ETL Data Pipeline

Project Overview

This ETL (Extract, Transform, Load) Data Pipeline project is designed to process streaming data using Apache Kafka and Apache Spark. The pipeline captures data from a Kafka topic, processes it in real-time, and allows for easy data access for machine learning applications. This project leverages Docker to create isolated environments for different services, ensuring smooth deployment and management.

Technologies Used

Apache Kafka: For managing real-time data streams.
Apache Spark: For processing streaming data.
Docker: For containerization of services.
Python: For scripting the ETL process.
Git: For version control.

Project Structure

ETL-Data-Pipeline/
│
├── docker-compose.yml
├── Dockerfile
├── spark_kafka_streaming.py
├── kafka_producer.py
└── kafka_consumer.py

Setup Instructions

Clone the Repository

Clone the project repository to your local machine:

git clone https://github.com/your-username/etl-data-pipeline.git
cd etl-data-pipeline

Install Docker Ensure you have Docker and Docker Compose installed on your machine. You can download Docker from the official Docker website.

Build the Docker Images Navigate to the project directory and build the Docker images:

docker-compose build

Run the Docker Containers Start the ETL pipeline and its dependencies:

docker-compose up

Running the ETL Pipeline

Step 1: Start the Docker Services Open your terminal and navigate to the project directory. Run the following command to start the services:

docker-compose up

This command will start Zookeeper, Kafka, and Spark services as defined in the docker-compose.yml file.

Step 2: Open a New Terminal for Data Production In a new terminal window (leave the first terminal running), execute the following command to run the Kafka producer:

docker exec -it etl-data-pipeline_etl_1 python kafka_producer.py

This will simulate data production, sending messages to the Kafka topic defined in your producer script.

Step 3: Start the Spark Streaming Job In the terminal where you started the Docker services, the Spark streaming job (spark_kafka_streaming.py) will automatically start consuming messages from Kafka once the containers are up.

Step 4: Monitor Spark UI Access the Spark UI at http://localhost:8080 to monitor the Spark job and see the streaming data processing in real-time.

Usage

The pipeline will continuously process streaming data from Kafka. You can modify the producer and consumer scripts to change the data being sent and processed.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
Dockerfile		Dockerfile
README.md		README.md
consumer.py		consumer.py
docker-compose.yml		docker-compose.yml
main.py		main.py
producer.py		producer.py
spark_kafka_streaming.py		spark_kafka_streaming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Data Pipeline

Project Overview

Table of Contents

Technologies Used

Project Structure

Setup Instructions

Clone the Repository

Running the ETL Pipeline

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL Data Pipeline

Project Overview

Table of Contents

Technologies Used

Project Structure

Setup Instructions

Clone the Repository

Running the ETL Pipeline

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages