This ETL (Extract, Transform, Load) Data Pipeline project is designed to process streaming data using Apache Kafka and Apache Spark. The pipeline captures data from a Kafka topic, processes it in real-time, and allows for easy data access for machine learning applications. This project leverages Docker to create isolated environments for different services, ensuring smooth deployment and management.
- Apache Kafka: For managing real-time data streams.
- Apache Spark: For processing streaming data.
- Docker: For containerization of services.
- Python: For scripting the ETL process.
- Git: For version control.
ETL-Data-Pipeline/
│
├── docker-compose.yml
├── Dockerfile
├── spark_kafka_streaming.py
├── kafka_producer.py
└── kafka_consumer.py
Clone the project repository to your local machine:
git clone https://github.com/your-username/etl-data-pipeline.git
cd etl-data-pipelineInstall Docker Ensure you have Docker and Docker Compose installed on your machine. You can download Docker from the official Docker website.
Build the Docker Images Navigate to the project directory and build the Docker images:
docker-compose buildRun the Docker Containers Start the ETL pipeline and its dependencies:
docker-compose upStep 1: Start the Docker Services Open your terminal and navigate to the project directory. Run the following command to start the services:
docker-compose upThis command will start Zookeeper, Kafka, and Spark services as defined in the docker-compose.yml file.
Step 2: Open a New Terminal for Data Production In a new terminal window (leave the first terminal running), execute the following command to run the Kafka producer:
docker exec -it etl-data-pipeline_etl_1 python kafka_producer.pyThis will simulate data production, sending messages to the Kafka topic defined in your producer script.
Step 3: Start the Spark Streaming Job In the terminal where you started the Docker services, the Spark streaming job (spark_kafka_streaming.py) will automatically start consuming messages from Kafka once the containers are up.
Step 4: Monitor Spark UI Access the Spark UI at http://localhost:8080 to monitor the Spark job and see the streaming data processing in real-time.
The pipeline will continuously process streaming data from Kafka. You can modify the producer and consumer scripts to change the data being sent and processed.