Choon Lin Tan
Real-time data processing has become essential in domains such as finance, e-commerce, and IoT, where high-velocity data must be ingested, processed, and analyzed with minimal delay. This paper explores the design and optimization of big data pipelines using Apache Kafka for data ingestion and Apache Spark Streaming for distributed processing. We implement a prototype pipeline that simulates a stock market feed handling millions of events per hour across a 10-node cluster. Key optimizations include Kafka topic partition tuning, Spark batch interval adjustments, memory and executor configuration, and fault-tolerant checkpointing. Performance is evaluated using metrics such as throughput, end-to-end latency, and resource utilization. Our results demonstrate that aligning Kafka’s partition-to-consumer mapping with Spark’s task parallelism, along with fine-tuned micro-batching, yields a 27% increase in throughput and a 30% reduction in latency. We also analyze fault recovery,
Home
About Us
Editorial Board
Authors
Topics
Current Issue
October 2023
Impact Factor
Indexing
FAQ
Policies
Contact Us
Copyright © 2021 IJMRSET All Rights Reserved