Module 08: Data Streaming with Apache Kafka¶
Introduction¶
Data streaming is the continuous processing of data in real time, as opposed to batch processing where data is accumulated and processed periodically. In modern applications such as fraud detection, IoT sensor monitoring, or seismic alerts, response time is critical.
Apache Kafka is a distributed streaming platform that allows you to publish, store, and process data streams in real time. Originally developed by LinkedIn and donated to Apache, Kafka has become the de facto standard for streaming data architectures.
Data Formats for Ingestion¶

Kafka Architecture¶
┌─────────────┐ ┌─────────────────────────────────────┐ ┌─────────────┐
│ PRODUCTOR │────>│ KAFKA CLUSTER │────>│ CONSUMIDOR │
│ (Python) │ │ ┌─────────────────────────────┐ │ │ (Python) │
└─────────────┘ │ │ Topic: sismos │ │ └─────────────┘
│ │ ├── Partition 0 │ │
┌─────────────┐ │ │ ├── Partition 1 │ │ ┌─────────────┐
│ PRODUCTOR │────>│ │ └── Partition 2 │ │────>│ CONSUMIDOR │
│ (API) │ │ └─────────────────────────────┘ │ │ (Spark) │
└─────────────┘ └─────────────────────────────────────┘ └─────────────┘
│
▼
┌─────────────────┐
│ KRaft Mode │
│ (sin ZooKeeper)│
└─────────────────┘
Key Concepts¶
| Concept | Description |
|---|---|
| Broker | Kafka server that stores and serves messages |
| Topic | Logical channel where messages are published (like a table) |
| Partition | Division of a topic for parallelism and scalability |
| Producer | Application that sends messages to a topic |
| Consumer | Application that reads messages from a topic |
| Consumer Group | Set of consumers that share the load of a topic |
| Offset | Position of a message within a partition |
KRaft Mode vs ZooKeeper¶
Since Kafka 3.x, KRaft (Kafka Raft) mode replaces ZooKeeper for metadata management, simplifying the architecture:
# Antes (con ZooKeeper) - 2 servicios
kafka + zookeeper
# Ahora (KRaft mode) - 1 servicio autocontenido
kafka
Delivery Guarantees¶
Kafka offers three delivery semantics:
| Semantics | Description | Typical use |
|---|---|---|
| At-most-once | Message may be lost, never duplicated | Non-critical logs |
| At-least-once | Message may be duplicated, never lost | Processing with idempotency |
| Exactly-once | Message is delivered exactly once | Financial transactions |
Required Tools¶
- Docker and Docker Compose: To set up the infrastructure
- Python 3.9+: Main language
- confluent-kafka: Official Python client for Kafka
- requests: To consume external APIs
Dependency Installation¶
Challenge 1: Set Up Kafka with Docker Compose¶
Objective: Create a functional Kafka cluster on your local machine using Docker.
Difficulty: Basic
Instructions¶
-
Create a directory for the project:
-
Create a
docker-compose.ymlfile with a Kafka broker in KRaft mode -
The broker must meet the following requirements:
- Image:
apache/kafka:latest - Port 9092 exposed
- KRaft mode enabled (no ZooKeeper)
-
Environment variables correctly configured
-
Start and verify:
Success Criteria¶
- The Kafka container is running without errors
- Logs show "Kafka Server started"
- Port 9092 responds
Hints¶
- Check the documentation for the
apache/kafkaimage on Docker Hub - Key variables:
KAFKA_NODE_ID,KAFKA_PROCESS_ROLES,KAFKA_LISTENERS - KRaft mode requires
KAFKA_CONTROLLER_QUORUM_VOTERS
Resources¶
Challenge 2: Your First Producer¶
Objective: Create a Python script that sends messages to Kafka.
Difficulty: Basic
Instructions¶
-
Create
productor_simple.py -
The script must:
- Connect to
localhost:9092 - Send 10 messages to the topic
mensajes-test - Each message must be JSON with:
id,texto,timestamp -
Confirm each successful send
-
Base structure:
from confluent_kafka import Producer import json from datetime import datetime def delivery_callback(err, msg): """Callback para confirmar envio""" # Implementa: imprime error o confirmacion pass config = { 'bootstrap.servers': 'localhost:9092', 'client.id': 'productor-simple' } producer = Producer(config) # Implementa: enviar 10 mensajes # Usa: producer.produce(topic, key, value, callback=delivery_callback) # No olvides: producer.flush()
Success Criteria¶
- Script runs without errors
- 10 messages sent successfully
- Each message has a valid JSON structure
Hints¶
json.dumps()to serialize to string- The
keycan be the message ID flush()ensures all messages are sent before exiting
Challenge 3: Your First Consumer¶
Objective: Create a script that reads messages from Kafka in real time.
Difficulty: Basic
Instructions¶
-
Create
consumidor_simple.py -
The script must:
- Subscribe to the topic
mensajes-test - Read messages in an infinite loop
- Print each message with its offset and partition
-
Exit cleanly with Ctrl+C
-
Base structure:
from confluent_kafka import Consumer, KafkaError import json config = { 'bootstrap.servers': 'localhost:9092', 'group.id': 'mi-grupo-consumidor', 'auto.offset.reset': 'earliest' # Leer desde el inicio } consumer = Consumer(config) consumer.subscribe(['mensajes-test']) try: while True: msg = consumer.poll(timeout=1.0) # Implementa: verificar errores y procesar mensaje except KeyboardInterrupt: pass finally: consumer.close()
Success Criteria¶
- Consumer connects successfully
- Reads the producer's messages
- Displays: partition, offset, key, value
- Exits cleanly with Ctrl+C
Hints¶
- Check
msg is None(timeout with no message) - Check
msg.error()before processing - Use
msg.partition(),msg.offset(),msg.key(),msg.value()
Challenge 4: Connect to a Real API (USGS Earthquakes)¶
Objective: Create a producer that consumes earthquake data in real time.
Difficulty: Intermediate
Instructions¶
-
Create
productor_sismos.py -
The producer must:
- Query the USGS API every 30 seconds
- Parse the GeoJSON response
- Publish each new earthquake to the topic
sismos -
Avoid duplicates (maintain a set of processed IDs)
-
API to use:
-
Message structure to publish:
Success Criteria¶
- Queries the API every 30 seconds
- Publishes earthquakes to the topic
sismos - Does not publish duplicate earthquakes
- Handles network errors gracefully
- Runs continuously
Hints¶
- Earthquakes are in
response['features'] - The ID is in
feature['id'] - Properties are in
feature['properties'] - Coordinates are in
feature['geometry']['coordinates'](lon, lat, depth)
Challenge 5: Alert System¶
Objective: Create a consumer that detects significant earthquakes.
Difficulty: Intermediate
Instructions¶
-
Create
consumidor_alertas.py -
The consumer must:
- Read from the topic
sismos - Filter earthquakes with magnitude >= 4.5
- Display a highlighted alert in the console
-
Save alerts to
alertas.log -
Alert format:
Success Criteria¶
- Correctly filters by magnitude >= 4.5
- Alerts visible in the console
- Alerts saved to log file
- System runs continuously
Challenge 6: Aggregations with Spark Structured Streaming¶
Objective: Process the earthquake stream with Spark to calculate statistics.
Difficulty: Advanced
Instructions¶
-
Add a Spark service to your
docker-compose.yml -
Create
spark_streaming_sismos.py -
The job must:
- Read from the topic
sismosas a stream - Parse the JSON messages
- Calculate per 5-minute window:
- Earthquake count
- Average magnitude
- Maximum magnitude
-
Write results to the console
-
Base structure:
from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * spark = SparkSession.builder \ .appName("SismosStreaming") \ .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0") \ .getOrCreate() # Define schema del mensaje schema = StructType([ StructField("id", StringType()), StructField("magnitud", DoubleType()), # ... completa el schema ]) # Lee del topic df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka:9092") \ .option("subscribe", "sismos") \ .load() # Parsea el JSON sismos = df.select( from_json(col("value").cast("string"), schema).alias("data") ).select("data.*") # Implementa: agregaciones por ventana de tiempo # Usa: window(), avg(), max(), count()
Success Criteria¶
- Spark reads from the Kafka topic
- Correctly parses the messages
- Calculates aggregations per window
- Displays results in the console
FINAL Challenge: Visualization Dashboard¶
Objective: Create a web visualization that displays earthquakes in real time.
Difficulty: Advanced
Evaluation Criteria¶
| Criterion | Points |
|---|---|
| Interactive map with earthquake locations | 20 |
| Automatic updates without reloading | 20 |
| Live statistics (total, max, average) | 15 |
| Magnitude filters | 15 |
| Visual differentiation by magnitude | 15 |
| Professional design | 15 |
| Total | 100 |
Technical Requirements¶
- HTML5 + vanilla JavaScript (no frameworks)
- Leaflet.js for maps
- Fetch API to retrieve data
- Modern CSS (flexbox/grid)
Suggestions¶
- Query the USGS API directly from JavaScript
- Use
setInterval()for automatic updates - Implement colors by magnitude (risk scale)
Submission¶
- Self-contained HTML file
- Screenshot showing it working
- Brief usage documentation
Reference: You can see an example dashboard at Seismic Observatory, but the challenge is to create your own version with your own style.
Resources and References¶
Official Documentation¶
- Apache Kafka Documentation
- Confluent Python Client
- Spark Structured Streaming + Kafka
- USGS Earthquake API
Earthquake GeoJSON¶
- Last hour:
https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.geojson - Last day:
https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson - Last week:
https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.geojson
---
Course: Big Data with Python - From Zero to Production Instructor: Juan Marcelo Gutierrez Miranda | @TodoEconometria Hash ID: 4e8d9b1a5f6e7c3d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c Methodology: Progressive exercises with real data and professional tools
Academic References:
- Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: A distributed messaging system for log processing. Proceedings of the NetDB Workshop.
- Zaharia, M., et al. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56-65.
- Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. ISBN: 978-1449373320.
- Narkhede, N., Shapira, G., & Palino, T. (2017). Kafka: The Definitive Guide. O'Reilly Media. ISBN: 978-1491936160.
- USGS (2024). Earthquake Hazards Program - Real-time Feeds. United States Geological Survey.