Exercises¶
Complete list of all available exercises in the course.
Exercise Roadmap¶
Module 1: Databases¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 1.1 | Introduction to SQLite | SQLite + Pandas | Basic | Available |
| 2.1 | PostgreSQL HR | PostgreSQL | Intermediate | Available |
| 2.2 | PostgreSQL Gardening | PostgreSQL | Intermediate | Available |
| 2.3 | SQLite to PostgreSQL Migration | PostgreSQL + Python | Intermediate | Available |
| 3.1 | Oracle HR | Oracle Database | Advanced | Available |
| 5.1 | Excel/Python Analysis | Pandas + Excel | Basic | Available |
Module 2: Data Cleaning and ETL¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 02 | ETL Pipeline QoG | PostgreSQL + Pandas | Advanced | Available |
Module 3: Distributed Processing¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 03 | Distributed Processing with Dask | Dask + Parquet | Intermediate | Available |
Module 4: Machine Learning¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 04 | Machine Learning (PCA, K-Means) | Scikit-Learn, PCA, K-Means | Advanced | Available |
| 04.2 | Transfer Learning Flowers | TensorFlow, MobileNetV2 | Advanced | Available |
| ARIMA | Time Series ARIMA/SARIMA | statsmodels, Box-Jenkins | Advanced | Available |
Module 5: NLP and Text Mining¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 05 | NLP and Text Mining | NLTK, TF-IDF, Jaccard, Sentiment | Advanced | Available |
Module 6: Panel Data Analysis¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 06 | Panel Data Analysis | linearmodels, Panel OLS, Altair | Advanced | Available |
Module 7: Big Data Infrastructure¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 07 | Big Data Infrastructure | Docker Compose, Apache Spark | Intermediate-Advanced | Available |
Module 8: Streaming with Kafka¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 08 | Streaming with Kafka | Apache Kafka, Spark Streaming, KRaft | Advanced | Available |
Module 9: Cloud with LocalStack¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| 09 | Cloud with LocalStack | LocalStack, Terraform, AWS | Advanced | Available |
Capstone Project¶
| # | Exercise | Technology | Level | Status |
|---|---|---|---|---|
| TF | Capstone Integrative Project | Docker + Spark + PostgreSQL + QoG | Advanced | Available |
MODULE 1: Databases¶
Exercise 1.1: Introduction to SQLite¶
Details
- Level: Basic
- Dataset: NYC Taxi (10MB sample)
- Technologies: SQLite, Pandas
What you'll learn:
- Load CSV data into a SQLite database
- Basic SQL queries (SELECT, WHERE, GROUP BY)
- Optimization with indexes
- Export results to CSV
Exercise 2.1: PostgreSQL with HR Database¶
Details
- Level: Intermediate
- Database: HR (Human Resources) from Oracle
- Technologies: PostgreSQL, SQL
What you'll learn:
- Install and configure PostgreSQL
- Load databases from SQL scripts
- Complex queries with multiple JOINs
- PostgreSQL-specific functions
Exercise 2.2: PostgreSQL Gardening¶
Details
- Level: Intermediate
- Database: Gardening sales system
- Technologies: PostgreSQL, Window Functions
What you'll learn:
- Sales analysis with SQL
- Complex aggregations (GROUP BY, HAVING)
- Window Functions for rankings
- Materialized views
Exercise 2.3: SQLite to PostgreSQL Migration¶
Details
- Level: Intermediate
- Technologies: SQLite, PostgreSQL, Python
What you'll learn:
- Differences between database engines
- Migrate schemas and data
- Adapt data types
- Validate integrity
Exercise 3.1: Oracle with HR Database¶
Advanced
- Level: Advanced
- Database: HR on native Oracle
- Technologies: Oracle Database, PL/SQL
What you'll learn:
- Install Oracle Database XE
- Oracle-specific syntax
- PL/SQL (procedures, functions)
- Sequences and triggers
Exercise 5.1: Excel/Python Analysis¶
Details
- Level: Basic-Intermediate
- Technologies: Python, Pandas, Excel
What you'll learn:
- Read Excel files with Python
- Exploratory Data Analysis (EDA)
- Visualizations with matplotlib/seaborn
- Automate analyses
MODULE 2: Data Cleaning and ETL¶
Professional ETL Pipeline - Quality of Government¶
Details
- Level: Advanced
- Dataset: QoG (1289 variables, 194+ countries)
- Technologies: PostgreSQL, Pandas, psycopg2
What you'll learn:
- Design a modular ETL architecture
- Work with PostgreSQL for longitudinal analysis
- Clean complex datasets (>1000 variables)
- Prepare panel data for econometrics
MODULE 3: Distributed Processing¶
Distributed Processing with Dask¶
Details
- Level: Intermediate
- Technologies: Dask, Parquet, LocalCluster
What you'll learn:
- Set up a Local Cluster with Dask
- Read Parquet files in a partitioned manner
- Execute complex aggregations in parallel
- Compare performance vs Pandas
MODULE 4: Machine Learning¶
Machine Learning in Big Data¶
Details
- Level: Advanced
- Technologies: Scikit-Learn, PCA, K-Means
- Scripts: PCA Iris, FactoMineR, Breast Cancer, Wine, TF-IDF
What you'll learn:
- Dimensionality reduction with PCA
- Clustering with K-Means and Hierarchical Clustering
- Principal component interpretation
- Cluster profiling
Transfer Learning: Flower Classification¶
Details
- Level: Advanced
- Technologies: TensorFlow, MobileNetV2, Scikit-Learn
- Dataset: TensorFlow Flowers (3,670 images, 5 classes)
What you'll learn:
- Transfer Learning with pre-trained networks (ImageNet)
- Embedding extraction with CNNs
- Image classification with traditional ML (KNN, SVM, Random Forest)
- t-SNE visualization of high-dimensional spaces
Time Series: ARIMA/SARIMA¶
Details
- Level: Advanced
- Dataset: AirPassengers (144 observations, 1949-1960)
- Technologies: statsmodels, Box-Jenkins Methodology
What you'll learn:
- Complete Box-Jenkins methodology (Identification, Estimation, Diagnostics, Forecasting)
- ARIMA and SARIMA models with seasonality
- ACF/PACF for order identification
- Residual diagnostics and forecasts
MODULE 5: NLP and Text Mining¶
NLP and Text Mining¶
Details
- Level: Advanced
- Technologies: NLTK, TF-IDF, Jaccard, Sentiment Analysis
- Scripts: Counting, Cleaning, Sentiment, Similarity
What you'll learn:
- Tokenization and text cleaning
- Stopword removal
- Jaccard similarity between documents
- Lexicon-based sentiment analysis
MODULE 6: Panel Data Analysis¶
Panel Data Analysis¶
Details
- Level: Advanced
- Datasets: Guns (gun laws), Fatalities (traffic mortality)
- Technologies: linearmodels, Panel OLS, Altair
What you'll learn:
- Panel data: country x year structure
- Fixed Effects vs Random Effects
- Two-Way Fixed Effects
- Hausman test for model selection
- Odds Ratios and Marginal Effects
MODULE 7: Big Data Infrastructure¶
Big Data Infrastructure: Docker and Spark¶
Details
- Level: Intermediate-Advanced
- Type: Theoretical-Conceptual with practical examples
- Technologies: Docker, Docker Compose, Apache Spark
What you'll learn:
- Docker: containers, images, Dockerfile, orchestration with Compose
- Networks, volumes, healthchecks, production patterns
- Apache Spark: Master-Worker architecture, cluster with Docker
- SparkSession, Lazy Evaluation, DAG, Catalyst optimizer
- Spark + PostgreSQL via JDBC
- From Standalone to production (Kubernetes, EMR, Dataproc)
MODULE 8: Streaming with Kafka¶
Streaming with Apache Kafka¶
Details
- Level: Advanced
- Technologies: Apache Kafka (KRaft), Python, Spark Streaming
- API: USGS Earthquakes (real-time)
What you'll learn:
- Kafka architecture: Brokers, Topics, Partitions
- KRaft mode (no ZooKeeper)
- Producers and Consumers in Python
- Spark Structured Streaming
- Real-time alert system
MODULE 9: Cloud with LocalStack¶
Cloud with LocalStack and Terraform¶
Details
- Level: Advanced
- Technologies: LocalStack, Terraform, AWS (S3, Lambda, DynamoDB)
- API: ISS Tracker (real-time)
What you'll learn:
- Cloud Computing: IaaS, PaaS, SaaS
- Simulate AWS locally with LocalStack
- Infrastructure as Code with Terraform
- Serverless Lambda functions
- Data Lake architecture (Medallion)
CAPSTONE PROJECT¶
Capstone Project: Big Data Pipeline with Docker¶
Integrative Project
- Level: Advanced
- Technologies: Docker, Apache Spark, PostgreSQL, QoG
- Evaluation: Infrastructure 30% + ETL 25% + Analysis 25% + AI Reflection 20%
What you'll do:
- Build Docker infrastructure (Spark + PostgreSQL)
- Design and execute an ETL pipeline with Apache Spark
- Analyze QoG data with your own research question
- Document your learning process with AI
Datasets Used¶
NYC Taxi & Limousine Commission (TLC)¶
- Source: NYC Open Data
- Period: 2021
- Records: 10M+ trips
Quality of Government (QoG)¶
- Source: University of Gothenburg
- Variables: 1289 institutional quality indicators
- Countries: 194+ with data since 1946
AirPassengers¶
- Source: Box & Jenkins (1976)
- Period: 1949-1960 (144 monthly observations)
- Use: ARIMA/SARIMA time series
How to Work Through Exercises¶
Recommended Workflow¶
- Read the full assignment - Do not start coding without reading everything
- Understand the objectives - What are you expected to achieve?
- Create a working branch -
git checkout -b your-lastname-exercise-XX - Work in small steps - Do not try to do everything at once
- Test frequently - Run your code each time you complete a section
- Make regular commits - Save your progress frequently
- Push with git push - When you finish, the system evaluates your PROMPTS.md
---
Next Steps¶
Start with the first exercise:
Exercise 01: Introduction to SQLite
Or jump to the capstone project: