Document

Abstract

ANALYZING LARGE DATASETS WITH HADOOP AND SPARK: AN INTEGRATED APPROACH

Harsha Vardhan Reddy Goli

Abstract

The exponential growth of data generated by modern digital systems poses significant challenges for traditional data processing architectures. While Hadoop and its MapReduce paradigm offer scalable storage and batch processing, their disk-based execution model limits efficiency, especially in iterative and real- time applications. Conversely, Apache Spark provides high-performance, in-memory processing and supports a wide range of analytics tasks, including streaming, machine learning, and interactive queries. This paper proposes an integrated Hadoop-Spark architecture that combines the storage resilience of Hadoop Distributed File System (HDFS) with Spark’s advanced in-memory computation capabilities. We present a case study involving a 100 GB semi-structured web log dataset to evaluate the performance, scalability, and resource efficiency of the integrated approach. Benchmark results show that the hybrid model significantly reduces execution time and improves CPU utilizati

International Scientific Journal of Contemporary Research in

Engineering Science and Management

|Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal|Impact

factor 7.521 (Calculated by Google Scholar and Semantic Scholar| AI-Powered Research Tool| Indexing)

in all Major Database & Metadata, Citation Generator

Abstract

ANALYZING LARGE DATASETS WITH HADOOP AND SPARK: AN INTEGRATED APPROACH

Abstract