Skip to content

COMPLETE BIG DATA COURSE​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌​‌‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​‌​‌‍​‌‌​​​​‌‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​​​​

TodoEconometria

"Without experience there is no knowledge"

Stars Forks LinkedIn

Live Demos

Global Seismic Observatory

Real-time earthquakes from USGS API. Interactive map, magnitude filters, tsunami alerts.

View Live

ISS Tracker

Track the International Space Station in real time. Pass predictor over your city.

View Live

These dashboards update automatically with real data from public APIs​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌​‌‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​‌​‌‍​‌‌​​​​‌‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​​​​


The Course in Numbers

230
Hours of content
9
Complete modules
25+
Hands-on exercises
12+
Interactive dashboards
30+
Technologies

Complete Technology Stack

Databases

Technology Level What You'll Learn
SQLite Basic SQL queries, indexes, optimization
PostgreSQL Intermediate Complex joins, Window Functions, CTEs
Oracle Advanced PL/SQL, stored procedures
DynamoDB Advanced NoSQL, key-value, serverless

Data Processing

Technology When to Use It Scale
Pandas Exploratory analysis < 5 GB
Dask Large datasets, single machine 5-100 GB
Apache Spark Clusters, production > 100 GB
Spark Streaming Real-time data Unlimited

Streaming and Messaging

Technology Purpose
Apache Kafka Distributed streaming, KRaft mode
Spark Structured Streaming Stream processing
AWS Kinesis Cloud streaming

Cloud and Infrastructure

Technology What It Does
Docker Containers, reproducible environments
Docker Compose Multi-container orchestration
LocalStack Local AWS simulation (free)
Terraform Infrastructure as Code
AWS S3 Object storage
AWS Lambda Serverless functions
EventBridge Task scheduling

Machine Learning and AI

Technology Application
Scikit-learn Classic ML, clustering, classification
PCA Dimensionality reduction
K-Means Segmentation, clustering
TensorFlow Deep Learning, neural networks
MobileNetV2 Transfer Learning, Computer Vision
ARIMA/SARIMA Time series, forecasting

NLP and Text Mining

Technology Use
NLTK Natural language processing
TF-IDF Text vectorization
Sentiment Analysis Sentiment analysis
Jaccard Similarity Document similarity

Visualization

Technology Type
Plotly Interactive dashboards
Matplotlib Static charts
Seaborn Statistical visualization
Leaflet.js Interactive maps
Altair Declarative charts

Econometrics

Technology Application
linearmodels Panel data
Panel OLS Fixed and random effects
Hausman Test Model selection

Course Modules

Module 1: Databases

SQLite, PostgreSQL, Oracle, migrations

From your first SELECT query to Oracle stored procedures. You will learn to design schemas, optimize queries, and migrate between engines.

View Exercises


Module 2: Data Cleaning and ETL

Professional ETL pipeline, QoG Dataset, PostgreSQL

Build a modular ETL pipeline that processes the Quality of Government dataset (1,289 variables, 194+ countries). Cleaning, transformation, and loading into PostgreSQL.

View Exercises


Module 3: Distributed Processing

Dask, Parquet, Local Cluster

Process large datasets without needing a cluster. Dask lets you scale pandas to data that doesn't fit in memory, using Parquet and local parallelism.

View Exercises


Module 4: Machine Learning

PCA, K-Means, Transfer Learning, ARIMA/SARIMA

Dimensionality reduction, clustering, image classification with TensorFlow, and time series with Box-Jenkins methodology. All with real datasets.

View Exercises


Module 5: NLP and Text Mining

NLTK, TF-IDF, Jaccard, Sentiment Analysis

Tokenization, text cleaning, document similarity, sentiment analysis, and vectorization with TF-IDF.

View Exercises


Module 6: Panel Data Analysis

Fixed Effects, Random Effects, Hausman Test

Analyze longitudinal data (country x year). Replicate real academic studies on gun laws and traffic mortality.

View Exercises


Module 7: Big Data Infrastructure

Docker, Docker Compose, Apache Spark, Cluster Computing

Understand how infrastructure is built. Containers, orchestration with Docker Compose, Spark clusters with Master-Worker architecture. The foundation for the Capstone Project.

View Exercises


Module 8: Streaming with Kafka

Apache Kafka, KRaft, Spark Structured Streaming

Real-time streaming with Kafka (KRaft mode, no ZooKeeper). Producers, consumers, Spark Structured Streaming, and a seismic alert system.

View Exercises


Module 9: Cloud with LocalStack

LocalStack, Terraform, AWS (S3, Lambda, DynamoDB)

Simulate AWS on your machine at no cost. Infrastructure as Code with Terraform, serverless Lambda functions, and Data Lake architecture.

View Exercises


Capstone Project

Docker + Spark + PostgreSQL + Full Analysis

Integrate everything you have learned into an end-to-end project. Infrastructure with Docker, ETL with Spark, analysis with your own research question.

View Assignment


All of these dashboards were created during the course:

ARIMA PRO
Bloomberg-style time series
View Dashboard
PCA + K-Means
Clustering and dimensionality reduction
View Dashboard
Transfer Learning
Flower classification with CNN
View Dashboard
Panel Data QoG
Spark + PostgreSQL + ML
View Dashboard

View All Dashboards


Who Is This Course For?

  • All content is free and open source
  • Learn at your own pace with progressive exercises
  • Build a professional portfolio of projects
  • Real dashboards you can showcase in interviews
  • Update your skills to modern technologies
  • From Excel to Spark in weeks, not years
  • Streaming, Cloud, ML - all in a single course
  • Immediately applicable to your job
  • In-company training available
  • Material proven in 230+ hours of in-person classes
  • Real industry use cases
  • Consulting for specific projects

How to Get Started

In-Person Course Students

Read the Submission Guide first to learn how to submit your assignments.

Step 1: Fork and Clone

# Fork on GitHub (button at the top right)
# Then clone YOUR fork:
git clone https://github.com/YOUR_USERNAME/ejercicios-bigdata.git
cd ejercicios-bigdata

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Choose Your Path

If you are... Start with...
Beginner Exercise 1.1: SQLite
Intermediate ETL Pipeline QoG
Advanced Streaming with Kafka

Instructor

Juan Marcelo Gutierrez Miranda
@TodoEconometria

10+ years in data analysis and Big Data. I have trained hundreds of professionals at companies across Latin America and Spain.

Professional Services

  • In-Company Training: Courses tailored to your team and technologies
  • Big Data Consulting: Pipeline design, data architecture
  • Dashboard Development: Interactive visualizations for your business

Contact:


Contributions

This repository is open source. If you find errors or want to contribute:

  1. Fork the repository
  2. Create a branch for your change
  3. Submit a Pull Request

Your Big Data Career Starts Here

230 hours of content, 30+ technologies, real-time dashboards

Start Now

​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌​‌‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​‌​‌‍​‌‌​​​​‌‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​​​​---

Course: Big Data with Python - From Zero to Production
Instructor: Juan Marcelo Gutierrez Miranda | @TodoEconometria
Hash ID: 4e8d9b1a5f6e7c3d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c