Skip to content

Exercises​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌​‌‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​‌​‌‍​‌‌​​​​‌‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​​​​

Complete list of all available exercises in the course.​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌​‌‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​‌​‌‍​‌‌​​​​‌‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​​​​


Exercise Roadmap

Module 1: Databases

# Exercise Technology Level Status
1.1 Introduction to SQLite SQLite + Pandas Basic Available
2.1 PostgreSQL HR PostgreSQL Intermediate Available
2.2 PostgreSQL Gardening PostgreSQL Intermediate Available
2.3 SQLite to PostgreSQL Migration PostgreSQL + Python Intermediate Available
3.1 Oracle HR Oracle Database Advanced Available
5.1 Excel/Python Analysis Pandas + Excel Basic Available

Module 2: Data Cleaning and ETL

# Exercise Technology Level Status
02 ETL Pipeline QoG PostgreSQL + Pandas Advanced Available

Module 3: Distributed Processing

# Exercise Technology Level Status
03 Distributed Processing with Dask Dask + Parquet Intermediate Available

Module 4: Machine Learning

# Exercise Technology Level Status
04 Machine Learning (PCA, K-Means) Scikit-Learn, PCA, K-Means Advanced Available
04.2 Transfer Learning Flowers TensorFlow, MobileNetV2 Advanced Available
ARIMA Time Series ARIMA/SARIMA statsmodels, Box-Jenkins Advanced Available

Module 5: NLP and Text Mining

# Exercise Technology Level Status
05 NLP and Text Mining NLTK, TF-IDF, Jaccard, Sentiment Advanced Available

Module 6: Panel Data Analysis

# Exercise Technology Level Status
06 Panel Data Analysis linearmodels, Panel OLS, Altair Advanced Available

Module 7: Big Data Infrastructure

# Exercise Technology Level Status
07 Big Data Infrastructure Docker Compose, Apache Spark Intermediate-Advanced Available

Module 8: Streaming with Kafka

# Exercise Technology Level Status
08 Streaming with Kafka Apache Kafka, Spark Streaming, KRaft Advanced Available

Module 9: Cloud with LocalStack

# Exercise Technology Level Status
09 Cloud with LocalStack LocalStack, Terraform, AWS Advanced Available

Capstone Project

# Exercise Technology Level Status
TF Capstone Integrative Project Docker + Spark + PostgreSQL + QoG Advanced Available

MODULE 1: Databases

Exercise 1.1: Introduction to SQLite

Details

  • Level: Basic
  • Dataset: NYC Taxi (10MB sample)
  • Technologies: SQLite, Pandas

What you'll learn:

  • Load CSV data into a SQLite database
  • Basic SQL queries (SELECT, WHERE, GROUP BY)
  • Optimization with indexes
  • Export results to CSV

View Full Exercise


Exercise 2.1: PostgreSQL with HR Database

Details

  • Level: Intermediate
  • Database: HR (Human Resources) from Oracle
  • Technologies: PostgreSQL, SQL

What you'll learn:

  • Install and configure PostgreSQL
  • Load databases from SQL scripts
  • Complex queries with multiple JOINs
  • PostgreSQL-specific functions

View Full Exercise


Exercise 2.2: PostgreSQL Gardening

Details

  • Level: Intermediate
  • Database: Gardening sales system
  • Technologies: PostgreSQL, Window Functions

What you'll learn:

  • Sales analysis with SQL
  • Complex aggregations (GROUP BY, HAVING)
  • Window Functions for rankings
  • Materialized views

View Full Exercise


Exercise 2.3: SQLite to PostgreSQL Migration

Details

  • Level: Intermediate
  • Technologies: SQLite, PostgreSQL, Python

What you'll learn:

  • Differences between database engines
  • Migrate schemas and data
  • Adapt data types
  • Validate integrity

View Full Exercise


Exercise 3.1: Oracle with HR Database

Advanced

  • Level: Advanced
  • Database: HR on native Oracle
  • Technologies: Oracle Database, PL/SQL

What you'll learn:

  • Install Oracle Database XE
  • Oracle-specific syntax
  • PL/SQL (procedures, functions)
  • Sequences and triggers

View Full Exercise


Exercise 5.1: Excel/Python Analysis

Details

  • Level: Basic-Intermediate
  • Technologies: Python, Pandas, Excel

What you'll learn:

  • Read Excel files with Python
  • Exploratory Data Analysis (EDA)
  • Visualizations with matplotlib/seaborn
  • Automate analyses

View Full Exercise


MODULE 2: Data Cleaning and ETL

Professional ETL Pipeline - Quality of Government

Details

  • Level: Advanced
  • Dataset: QoG (1289 variables, 194+ countries)
  • Technologies: PostgreSQL, Pandas, psycopg2

What you'll learn:

  • Design a modular ETL architecture
  • Work with PostgreSQL for longitudinal analysis
  • Clean complex datasets (>1000 variables)
  • Prepare panel data for econometrics

View Full Exercise


MODULE 3: Distributed Processing

Distributed Processing with Dask

Details

  • Level: Intermediate
  • Technologies: Dask, Parquet, LocalCluster

What you'll learn:

  • Set up a Local Cluster with Dask
  • Read Parquet files in a partitioned manner
  • Execute complex aggregations in parallel
  • Compare performance vs Pandas

View Full Exercise


MODULE 4: Machine Learning

Machine Learning in Big Data

Details

  • Level: Advanced
  • Technologies: Scikit-Learn, PCA, K-Means
  • Scripts: PCA Iris, FactoMineR, Breast Cancer, Wine, TF-IDF

What you'll learn:

  • Dimensionality reduction with PCA
  • Clustering with K-Means and Hierarchical Clustering
  • Principal component interpretation
  • Cluster profiling

View Full Exercise


Transfer Learning: Flower Classification

Details

  • Level: Advanced
  • Technologies: TensorFlow, MobileNetV2, Scikit-Learn
  • Dataset: TensorFlow Flowers (3,670 images, 5 classes)

What you'll learn:

  • Transfer Learning with pre-trained networks (ImageNet)
  • Embedding extraction with CNNs
  • Image classification with traditional ML (KNN, SVM, Random Forest)
  • t-SNE visualization of high-dimensional spaces

View Interactive Dashboard


Time Series: ARIMA/SARIMA

Details

  • Level: Advanced
  • Dataset: AirPassengers (144 observations, 1949-1960)
  • Technologies: statsmodels, Box-Jenkins Methodology

What you'll learn:

  • Complete Box-Jenkins methodology (Identification, Estimation, Diagnostics, Forecasting)
  • ARIMA and SARIMA models with seasonality
  • ACF/PACF for order identification
  • Residual diagnostics and forecasts

View Full Exercise



MODULE 5: NLP and Text Mining

NLP and Text Mining

Details

  • Level: Advanced
  • Technologies: NLTK, TF-IDF, Jaccard, Sentiment Analysis
  • Scripts: Counting, Cleaning, Sentiment, Similarity

What you'll learn:

  • Tokenization and text cleaning
  • Stopword removal
  • Jaccard similarity between documents
  • Lexicon-based sentiment analysis

View Full Exercise


MODULE 6: Panel Data Analysis

Panel Data Analysis

Details

  • Level: Advanced
  • Datasets: Guns (gun laws), Fatalities (traffic mortality)
  • Technologies: linearmodels, Panel OLS, Altair

What you'll learn:

  • Panel data: country x year structure
  • Fixed Effects vs Random Effects
  • Two-Way Fixed Effects
  • Hausman test for model selection
  • Odds Ratios and Marginal Effects

View Full Exercise


MODULE 7: Big Data Infrastructure

Big Data Infrastructure: Docker and Spark

Details

  • Level: Intermediate-Advanced
  • Type: Theoretical-Conceptual with practical examples
  • Technologies: Docker, Docker Compose, Apache Spark

What you'll learn:

  • Docker: containers, images, Dockerfile, orchestration with Compose
  • Networks, volumes, healthchecks, production patterns
  • Apache Spark: Master-Worker architecture, cluster with Docker
  • SparkSession, Lazy Evaluation, DAG, Catalyst optimizer
  • Spark + PostgreSQL via JDBC
  • From Standalone to production (Kubernetes, EMR, Dataproc)

View Full Exercise


MODULE 8: Streaming with Kafka

Streaming with Apache Kafka

Details

  • Level: Advanced
  • Technologies: Apache Kafka (KRaft), Python, Spark Streaming
  • API: USGS Earthquakes (real-time)

What you'll learn:

  • Kafka architecture: Brokers, Topics, Partitions
  • KRaft mode (no ZooKeeper)
  • Producers and Consumers in Python
  • Spark Structured Streaming
  • Real-time alert system

View Full Exercise


MODULE 9: Cloud with LocalStack

Cloud with LocalStack and Terraform

Details

  • Level: Advanced
  • Technologies: LocalStack, Terraform, AWS (S3, Lambda, DynamoDB)
  • API: ISS Tracker (real-time)

What you'll learn:

  • Cloud Computing: IaaS, PaaS, SaaS
  • Simulate AWS locally with LocalStack
  • Infrastructure as Code with Terraform
  • Serverless Lambda functions
  • Data Lake architecture (Medallion)

View Full Exercise


CAPSTONE PROJECT

Capstone Project: Big Data Pipeline with Docker

Integrative Project

  • Level: Advanced
  • Technologies: Docker, Apache Spark, PostgreSQL, QoG
  • Evaluation: Infrastructure 30% + ETL 25% + Analysis 25% + AI Reflection 20%

What you'll do:

  • Build Docker infrastructure (Spark + PostgreSQL)
  • Design and execute an ETL pipeline with Apache Spark
  • Analyze QoG data with your own research question
  • Document your learning process with AI

View Full Assignment


Datasets Used

NYC Taxi & Limousine Commission (TLC)

Quality of Government (QoG)

  • Source: University of Gothenburg
  • Variables: 1289 institutional quality indicators
  • Countries: 194+ with data since 1946

AirPassengers

  • Source: Box & Jenkins (1976)
  • Period: 1949-1960 (144 monthly observations)
  • Use: ARIMA/SARIMA time series

How to Work Through Exercises

  1. Read the full assignment - Do not start coding without reading everything
  2. Understand the objectives - What are you expected to achieve?
  3. Create a working branch - git checkout -b your-lastname-exercise-XX
  4. Work in small steps - Do not try to do everything at once
  5. Test frequently - Run your code each time you complete a section
  6. Make regular commits - Save your progress frequently
  7. Push with git push - When you finish, the system evaluates your PROMPTS.md

​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌​‌‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​‌​‌‍​‌‌​​​​‌‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​​​​---

Next Steps

Start with the first exercise:

Exercise 01: Introduction to SQLite

Or jump to the capstone project:

Capstone Project: Big Data Pipeline