COMPLETE BIG DATA COURSE‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‍‌‌‌‌‍‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌¶

"Without experience there is no knowledge"

Live Demos¶

Global Seismic Observatory

Real-time earthquakes from USGS API. Interactive map, magnitude filters, tsunami alerts.

View Live

ISS Tracker

Track the International Space Station in real time. Pass predictor over your city.

View Live

These dashboards update automatically with real data from public APIs‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‍‌‌‌‌‍‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌

The Course in Numbers¶

230

Hours of content

9

Complete modules

25+

Hands-on exercises

12+

Interactive dashboards

30+

Technologies

Complete Technology Stack¶

Databases¶

Technology	Level	What You'll Learn
SQLite	Basic	SQL queries, indexes, optimization
PostgreSQL	Intermediate	Complex joins, Window Functions, CTEs
Oracle	Advanced	PL/SQL, stored procedures
DynamoDB	Advanced	NoSQL, key-value, serverless

Data Processing¶

Technology	When to Use It	Scale
Pandas	Exploratory analysis	< 5 GB
Dask	Large datasets, single machine	5-100 GB
Apache Spark	Clusters, production	> 100 GB
Spark Streaming	Real-time data	Unlimited

Streaming and Messaging¶

Technology	Purpose
Apache Kafka	Distributed streaming, KRaft mode
Spark Structured Streaming	Stream processing
AWS Kinesis	Cloud streaming

Cloud and Infrastructure¶

Technology	What It Does
Docker	Containers, reproducible environments
Docker Compose	Multi-container orchestration
LocalStack	Local AWS simulation (free)
Terraform	Infrastructure as Code
AWS S3	Object storage
AWS Lambda	Serverless functions
EventBridge	Task scheduling

Machine Learning and AI¶

Technology	Application
Scikit-learn	Classic ML, clustering, classification
PCA	Dimensionality reduction
K-Means	Segmentation, clustering
TensorFlow	Deep Learning, neural networks
MobileNetV2	Transfer Learning, Computer Vision
ARIMA/SARIMA	Time series, forecasting

NLP and Text Mining¶

Technology	Use
NLTK	Natural language processing
TF-IDF	Text vectorization
Sentiment Analysis	Sentiment analysis
Jaccard Similarity	Document similarity

Visualization¶

Technology	Type
Plotly	Interactive dashboards
Matplotlib	Static charts
Seaborn	Statistical visualization
Leaflet.js	Interactive maps
Altair	Declarative charts

Econometrics¶

Technology	Application
linearmodels	Panel data
Panel OLS	Fixed and random effects
Hausman Test	Model selection

Course Modules¶

Module 1: Databases¶

SQLite, PostgreSQL, Oracle, migrations

From your first SELECT query to Oracle stored procedures. You will learn to design schemas, optimize queries, and migrate between engines.

View Exercises

Module 2: Data Cleaning and ETL¶

Professional ETL pipeline, QoG Dataset, PostgreSQL

Build a modular ETL pipeline that processes the Quality of Government dataset (1,289 variables, 194+ countries). Cleaning, transformation, and loading into PostgreSQL.

View Exercises

Module 3: Distributed Processing¶

Dask, Parquet, Local Cluster

Process large datasets without needing a cluster. Dask lets you scale pandas to data that doesn't fit in memory, using Parquet and local parallelism.

View Exercises

Module 4: Machine Learning¶

PCA, K-Means, Transfer Learning, ARIMA/SARIMA

Dimensionality reduction, clustering, image classification with TensorFlow, and time series with Box-Jenkins methodology. All with real datasets.

View Exercises

Module 5: NLP and Text Mining¶

NLTK, TF-IDF, Jaccard, Sentiment Analysis

Tokenization, text cleaning, document similarity, sentiment analysis, and vectorization with TF-IDF.

View Exercises

Module 6: Panel Data Analysis¶

Fixed Effects, Random Effects, Hausman Test

Analyze longitudinal data (country x year). Replicate real academic studies on gun laws and traffic mortality.

View Exercises

Module 7: Big Data Infrastructure¶

Docker, Docker Compose, Apache Spark, Cluster Computing

Understand how infrastructure is built. Containers, orchestration with Docker Compose, Spark clusters with Master-Worker architecture. The foundation for the Capstone Project.

View Exercises

Module 8: Streaming with Kafka¶

Apache Kafka, KRaft, Spark Structured Streaming

Real-time streaming with Kafka (KRaft mode, no ZooKeeper). Producers, consumers, Spark Structured Streaming, and a seismic alert system.

View Exercises

Module 9: Cloud with LocalStack¶

LocalStack, Terraform, AWS (S3, Lambda, DynamoDB)

Simulate AWS on your machine at no cost. Infrastructure as Code with Terraform, serverless Lambda functions, and Data Lake architecture.

View Exercises

Capstone Project¶

Docker + Spark + PostgreSQL + Full Analysis

Integrate everything you have learned into an end-to-end project. Infrastructure with Docker, ETL with Spark, analysis with your own research question.

View Assignment

Dashboard Gallery¶

All of these dashboards were created during the course:

ARIMA PRO
Bloomberg-style time series
View Dashboard

PCA + K-Means
Clustering and dimensionality reduction
View Dashboard

Transfer Learning
Flower classification with CNN
View Dashboard

Panel Data QoG
Spark + PostgreSQL + ML
View Dashboard

View All Dashboards

Who Is This Course For?¶

Students and Self-LearnersWorking ProfessionalsCompanies

All content is free and open source
Learn at your own pace with progressive exercises
Build a professional portfolio of projects
Real dashboards you can showcase in interviews

Update your skills to modern technologies
From Excel to Spark in weeks, not years
Streaming, Cloud, ML - all in a single course
Immediately applicable to your job

In-company training available
Material proven in 230+ hours of in-person classes
Real industry use cases
Consulting for specific projects

How to Get Started¶

In-Person Course Students

Read the Submission Guide first to learn how to submit your assignments.

Step 1: Fork and Clone¶

# Fork on GitHub (button at the top right)
# Then clone YOUR fork:
git clone https://github.com/YOUR_USERNAME/ejercicios-bigdata.git
cd ejercicios-bigdata

Step 2: Install Dependencies¶

pip install -r requirements.txt

Step 3: Choose Your Path¶

If you are...	Start with...
Beginner	Exercise 1.1: SQLite
Intermediate	ETL Pipeline QoG
Advanced	Streaming with Kafka

Instructor¶

Juan Marcelo Gutierrez Miranda
@TodoEconometria

10+ years in data analysis and Big Data. I have trained hundreds of professionals at companies across Latin America and Spain.

Professional Services¶

In-Company Training: Courses tailored to your team and technologies
Big Data Consulting: Pipeline design, data architecture
Dashboard Development: Interactive visualizations for your business

Contact:

Contributions¶

This repository is open source. If you find errors or want to contribute:

Fork the repository
Create a branch for your change
Submit a Pull Request

Your Big Data Career Starts Here

230 hours of content, 30+ technologies, real-time dashboards

Start Now

‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‍‌‌‌‌‍‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌---

Course: Big Data with Python - From Zero to Production
Instructor: Juan Marcelo Gutierrez Miranda | @TodoEconometria
Hash ID: 4e8d9b1a5f6e7c3d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c