Document

Abstract

Big Data Pipeline Optimization using Apache Kafka and Spark Streaming

Choon Lin Tan

Abstract

Real-time data processing has become essential in domains such as finance, e-commerce, and IoT, where high-velocity data must be ingested, processed, and analyzed with minimal delay. This paper explores the design and optimization of big data pipelines using Apache Kafka for data ingestion and Apache Spark Streaming for distributed processing. We implement a prototype pipeline that simulates a stock market feed handling millions of events per hour across a 10-node cluster. Key optimizations include Kafka topic partition tuning, Spark batch interval adjustments, memory and executor configuration, and fault-tolerant checkpointing. Performance is evaluated using metrics such as throughput, end-to-end latency, and resource utilization. Our results demonstrate that aligning Kafka’s partition-to-consumer mapping with Spark’s task parallelism, along with fine-tuned micro-batching, yields a 27% increase in throughput and a 30% reduction in latency. We also analyze fault recovery,

International Scientific Journal of Contemporary Research in

Engineering Science and Management

|Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal|Impact

factor 7.521 (Calculated by Google Scholar and Semantic Scholar| AI-Powered Research Tool| Indexing)

in all Major Database & Metadata, Citation Generator

Abstract

Big Data Pipeline Optimization using Apache Kafka and Spark Streaming

Abstract