Capstone Project: Big Data Pipeline with Docker Infrastructure¶
Course: Big Data with Python - Prof. Juan Marcelo Gutierrez Miranda (@TodoEconometria)
Objective¶
Build from scratch a data processing infrastructure using Docker, Apache Spark, and PostgreSQL. Starting from the Quality of Government (QoG) dataset, design and execute an ETL + analysis pipeline that answers a research question formulated by you.
What is evaluated: Not just the code, but your learning process. You may use AI tools (ChatGPT, Copilot, Claude, etc.) but you must document how you used them and what you learned.
Dataset¶
Quality of Government Standard Dataset (QoG) - January 2024
- ~15,500 rows (countries x years) x ~1,990 columns
- Variables: democracy, corruption, GDP, health, education, political stability...
- Documentation: QoG Data
Structure: 4 Blocks¶
Block A: Docker Infrastructure (30%)¶
Write a docker-compose.yml that launches a mini-cluster:
| Service | Minimum requirement |
|---|---|
| PostgreSQL | Database to store results |
| Spark Master | Cluster coordinator |
| Spark Worker | At least 1 processing node |
Steps:
- Research what Docker Compose is and how a YAML file is structured
- Write your
docker-compose.ymlwith the 3 minimum services - Add
healthcheckat least for PostgreSQL - Run
docker compose up -dand verify everything starts - Open the Spark UI and take a screenshot showing the connected worker
- Write
02_INFRAESTRUCTURA.mdexplaining each section of your YAML in your own words
Hints:
- Spark image:
apache/spark:3.5.4-python3(orbitnami/spark:3.5) - PostgreSQL image:
postgres:15-alpine - The Spark Master uses port 7077 for communication and 8080 for the web UI
Deliverables: docker-compose.yml + 02_INFRAESTRUCTURA.md
Block B: ETL Pipeline with Spark (25%)¶
Write a Python script that processes QoG using Apache Spark.
Steps:
- Choose 5 countries that interest you (they cannot be the ones from the instructor's example: KAZ, UZB, TKM, KGZ, TJK)
- Choose 5 numerical variables from the QoG dataset
- Formulate a research question
- Write
pipeline.pythat:- Creates a SparkSession
- Reads the CSV with
spark.read.csv() - Selects your countries and variables
- Filters a range of years (e.g., 2000-2023)
- Creates at least 1 derived variable
- Saves the result as Parquet
Important: Your selection of countries and variables must be UNIQUE. If two students submit the same 5 countries, it will be considered plagiarism.
Deliverable: pipeline.py
Block C: Analysis and Visualization (25%)¶
Analyze your processed data and answer your research question.
Choose ONE option:
| Option | What to do | Example |
|---|---|---|
| Clustering | K-Means on your countries | "Which countries are similar based on democracy + GDP?" |
| Time series | Evolution chart by country | "How did corruption change between 2000-2023?" |
| Comparison | Before/after an event | "Did GDP change after the 2008 crisis?" |
Minimum requirements:
- 2 charts (matplotlib, plotly, or seaborn)
- Each chart with title, labeled axes, and legend
- Interpretation paragraph for each chart
Deliverable: Charts and interpretation in 03_RESULTADOS.md
Block D: AI Reflection - "3 Key Moments" (20%)¶
Document your learning process and share your prompts.
For each block (A, B, C), answer:
| Moment | Question |
|---|---|
| Start | What was the first thing you asked the AI (or searched for)? |
| Error | What failed and how did you solve it? |
| Learning | What did you learn that you did NOT know before? |
Additionally, paste the exact text of the AI prompt that helped you the most in each block.
What is evaluated:
- That your prompts are real (pasted as-is, not made up afterward)
- That your answers are specific
- That the errors are real (documenting them does not lower your grade)
- That the process is consistent with your code
Deliverable: 04_REFLEXION_IA.md
Comprehension Questions (mandatory)¶
Answer in 05_RESPUESTAS.md:
- Infrastructure: If your worker has 2 GB of RAM and the CSV weighs 3 GB, what happens? How would you solve it?
- ETL: Why does
spark.read.csv()not execute anything until you call.count()or.show()? - Analysis: Interpret your main chart: what pattern do you see and why do you think it occurs?
- Scalability: If you had to repeat this with a 50 GB dataset, what would you change in your infrastructure?
Submission Format¶
entregas/trabajo_final/apellido_nombre/
PROMPTS.md <- THE MOST IMPORTANT (your AI prompts)
01_README.md <- Your data + research question
02_INFRAESTRUCTURA.md <- YAML explanation
03_RESULTADOS.md <- Charts + interpretation
04_REFLEXION_IA.md <- 3 Key Moments x 3 blocks
05_RESPUESTAS.md <- 4 comprehension questions
docker-compose.yml <- Your working YAML
pipeline.py <- ETL + Analysis
requirements.txt <- Dependencies (pip freeze)
.gitignore <- Exclude data, venv, __pycache__
Copy the template from trabajo_final/plantilla/ to your submission folder.
Process (NO Pull Request)¶
- Sync your fork:
git fetch upstream && git merge upstream/main - Copy the template:
cp -r trabajo_final/plantilla/ entregas/trabajo_final/apellido_nombre/ - Fill in PROMPTS.md as you work - This file is what gets evaluated
- Complete the files (01 through 05) +
docker-compose.yml+pipeline.py - Push to your fork:
git add . && git commit -m "Trabajo Final" && git push - Done! The system evaluates your PROMPTS.md automatically
You do not need to create a Pull Request
The automated system evaluates your PROMPTS.md file directly in your fork. Just make sure to upload your work with git push.
Prohibited items¶
- Data files (.csv, .parquet, .db)
- Virtual environments (venv/, .venv/)
- .env files with real credentials
- pycache/ folders
Evaluation¶
| Block | Weight | What is evaluated |
|---|---|---|
| A. Infrastructure | 30% | Working YAML + explanation in your own words |
| B. ETL Pipeline | 25% | Spark API + your own countries/variables + question |
| C. Analysis | 25% | Charts + interpretation that answers your question |
| D. AI Reflection | 20% | Real and specific learning process |
Penalties:
- Copying the same countries/variables as another student: -50%
- Copying the countries from the instructor's example (Central Asia): -30%
- YAML that does not work without an explanation of why: -15%
- Absent or generic AI reflection: -20%
Resources¶
- Spark Documentation: spark.apache.org
- Docker Compose: docs.docker.com/compose
- QoG Codebook: qog.pol.gu.se (download codebook to see variables)
- Quick Start Guide:
trabajo_final/GUIA_INICIO_RAPIDO.md
---
Course: Big Data with Python - From Zero to Production Instructor: Juan Marcelo Gutierrez Miranda | @TodoEconometria Hash ID: 4e8d9b1a5f6e7c3d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c Methodology: Progressive exercises with real data and professional tools
Academic references:
- Zaharia, M., et al. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56-65.
- Teorell, J., et al. (2024). The Quality of Government Standard Dataset. University of Gothenburg.
- Merkel, D. (2014). Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux Journal, 2014(239), 2.