COMPLETE BIG DATA COURSE¶
"Without experience there is no knowledge"
Live Demos¶
Global Seismic Observatory
Real-time earthquakes from USGS API. Interactive map, magnitude filters, tsunami alerts.
ISS Tracker
Track the International Space Station in real time. Pass predictor over your city.
These dashboards update automatically with real data from public APIs
The Course in Numbers¶
Complete Technology Stack¶
Databases¶
| Technology | Level | What You'll Learn |
|---|---|---|
| SQLite | Basic | SQL queries, indexes, optimization |
| PostgreSQL | Intermediate | Complex joins, Window Functions, CTEs |
| Oracle | Advanced | PL/SQL, stored procedures |
| DynamoDB | Advanced | NoSQL, key-value, serverless |
Data Processing¶
| Technology | When to Use It | Scale |
|---|---|---|
| Pandas | Exploratory analysis | < 5 GB |
| Dask | Large datasets, single machine | 5-100 GB |
| Apache Spark | Clusters, production | > 100 GB |
| Spark Streaming | Real-time data | Unlimited |
Streaming and Messaging¶
| Technology | Purpose |
|---|---|
| Apache Kafka | Distributed streaming, KRaft mode |
| Spark Structured Streaming | Stream processing |
| AWS Kinesis | Cloud streaming |
Cloud and Infrastructure¶
| Technology | What It Does |
|---|---|
| Docker | Containers, reproducible environments |
| Docker Compose | Multi-container orchestration |
| LocalStack | Local AWS simulation (free) |
| Terraform | Infrastructure as Code |
| AWS S3 | Object storage |
| AWS Lambda | Serverless functions |
| EventBridge | Task scheduling |
Machine Learning and AI¶
| Technology | Application |
|---|---|
| Scikit-learn | Classic ML, clustering, classification |
| PCA | Dimensionality reduction |
| K-Means | Segmentation, clustering |
| TensorFlow | Deep Learning, neural networks |
| MobileNetV2 | Transfer Learning, Computer Vision |
| ARIMA/SARIMA | Time series, forecasting |
NLP and Text Mining¶
| Technology | Use |
|---|---|
| NLTK | Natural language processing |
| TF-IDF | Text vectorization |
| Sentiment Analysis | Sentiment analysis |
| Jaccard Similarity | Document similarity |
Visualization¶
| Technology | Type |
|---|---|
| Plotly | Interactive dashboards |
| Matplotlib | Static charts |
| Seaborn | Statistical visualization |
| Leaflet.js | Interactive maps |
| Altair | Declarative charts |
Econometrics¶
| Technology | Application |
|---|---|
| linearmodels | Panel data |
| Panel OLS | Fixed and random effects |
| Hausman Test | Model selection |
Course Modules¶
Module 1: Databases¶
SQLite, PostgreSQL, Oracle, migrations
From your first SELECT query to Oracle stored procedures. You will learn to design schemas, optimize queries, and migrate between engines.
Module 2: Data Cleaning and ETL¶
Professional ETL pipeline, QoG Dataset, PostgreSQL
Build a modular ETL pipeline that processes the Quality of Government dataset (1,289 variables, 194+ countries). Cleaning, transformation, and loading into PostgreSQL.
Module 3: Distributed Processing¶
Dask, Parquet, Local Cluster
Process large datasets without needing a cluster. Dask lets you scale pandas to data that doesn't fit in memory, using Parquet and local parallelism.
Module 4: Machine Learning¶
PCA, K-Means, Transfer Learning, ARIMA/SARIMA
Dimensionality reduction, clustering, image classification with TensorFlow, and time series with Box-Jenkins methodology. All with real datasets.
Module 5: NLP and Text Mining¶
NLTK, TF-IDF, Jaccard, Sentiment Analysis
Tokenization, text cleaning, document similarity, sentiment analysis, and vectorization with TF-IDF.
Module 6: Panel Data Analysis¶
Fixed Effects, Random Effects, Hausman Test
Analyze longitudinal data (country x year). Replicate real academic studies on gun laws and traffic mortality.
Module 7: Big Data Infrastructure¶
Docker, Docker Compose, Apache Spark, Cluster Computing
Understand how infrastructure is built. Containers, orchestration with Docker Compose, Spark clusters with Master-Worker architecture. The foundation for the Capstone Project.
Module 8: Streaming with Kafka¶
Apache Kafka, KRaft, Spark Structured Streaming
Real-time streaming with Kafka (KRaft mode, no ZooKeeper). Producers, consumers, Spark Structured Streaming, and a seismic alert system.
Module 9: Cloud with LocalStack¶
LocalStack, Terraform, AWS (S3, Lambda, DynamoDB)
Simulate AWS on your machine at no cost. Infrastructure as Code with Terraform, serverless Lambda functions, and Data Lake architecture.
Capstone Project¶
Docker + Spark + PostgreSQL + Full Analysis
Integrate everything you have learned into an end-to-end project. Infrastructure with Docker, ETL with Spark, analysis with your own research question.
Dashboard Gallery¶
All of these dashboards were created during the course:
Who Is This Course For?¶
- All content is free and open source
- Learn at your own pace with progressive exercises
- Build a professional portfolio of projects
- Real dashboards you can showcase in interviews
- Update your skills to modern technologies
- From Excel to Spark in weeks, not years
- Streaming, Cloud, ML - all in a single course
- Immediately applicable to your job
- In-company training available
- Material proven in 230+ hours of in-person classes
- Real industry use cases
- Consulting for specific projects
How to Get Started¶
In-Person Course Students
Read the Submission Guide first to learn how to submit your assignments.
Step 1: Fork and Clone¶
# Fork on GitHub (button at the top right)
# Then clone YOUR fork:
git clone https://github.com/YOUR_USERNAME/ejercicios-bigdata.git
cd ejercicios-bigdata
Step 2: Install Dependencies¶
Step 3: Choose Your Path¶
| If you are... | Start with... |
|---|---|
| Beginner | Exercise 1.1: SQLite |
| Intermediate | ETL Pipeline QoG |
| Advanced | Streaming with Kafka |
Instructor¶
@TodoEconometria
10+ years in data analysis and Big Data. I have trained hundreds of professionals at companies across Latin America and Spain.
Professional Services¶
- In-Company Training: Courses tailored to your team and technologies
- Big Data Consulting: Pipeline design, data architecture
- Dashboard Development: Interactive visualizations for your business
Contact:
- Email: cursos@todoeconometria.com
- LinkedIn: Juan Gutierrez
- Web: www.todoeconometria.com
Contributions¶
This repository is open source. If you find errors or want to contribute:
- Fork the repository
- Create a branch for your change
- Submit a Pull Request
Your Big Data Career Starts Here
230 hours of content, 30+ technologies, real-time dashboards
Start Now---
Course: Big Data with Python - From Zero to Production
Instructor: Juan Marcelo Gutierrez Miranda | @TodoEconometria
Hash ID: 4e8d9b1a5f6e7c3d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c