Skip to content

Course Roadmap​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​​​‍​‌‌​​‌‌​‍​‌‌​​‌​‌‍​​‌‌​​​​

Complete overview of all exercises, technologies, and the recommended learning plan.​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​​​‍​‌‌​​‌‌​‍​‌‌​​‌​‌‍​​‌‌​​​​

Learning Levels

graph TD
    A[LEVEL 1: Fundamentals<br/>2-3 weeks] --> B[LEVEL 2: Scaling Up<br/>3-4 weeks]
    B --> C[LEVEL 3: Real Big Data<br/>4-5 weeks]
    C --> D[LEVEL 4: Visualization<br/>3-4 weeks]

    A1[SQLite<br/>Pandas<br/>Git/GitHub] --> A
    B1[Dask<br/>Parquet<br/>Optimization] --> B
    C1[PySpark<br/>Advanced SQL<br/>Pipelines] --> C
    D1[Dashboards<br/>APIs<br/>Deploy] --> D

LEVEL 1: Fundamentals

Duration: 2-3 weeks | Difficulty: Basic

Objectives

  • Master relational databases with SQLite
  • Learn data analysis with Pandas
  • Understand version control with Git/GitHub

Technologies

Technology Purpose Resources
SQLite Embedded database Official docs
Pandas In-memory data analysis Pandas docs
Git Version control Git handbook

Exercises

Exercise 01: Data Loading with SQLite

Details

  • Estimated time: 2-3 hours
  • Dataset: NYC Taxi (10MB sample)
  • Level: Basic

What you will learn:

  • Load data from CSV into a database
  • Basic SQL queries (SELECT, WHERE, GROUP BY)
  • Optimization with indexes
  • Export results

Skills:

  • Load CSV in chunks
  • Create SQLite database
  • Run SQL queries
  • Create indexes
  • Export results to CSV

Exercise 02: Cleaning and Transformation

Details

  • Estimated time: 3-4 hours
  • Dataset: NYC Taxi (dirty data)
  • Level: Basic

What you will learn:

  • Detect and handle null values
  • Identify outliers
  • Data transformations
  • Type validation

LEVEL 2: Scaling Up

Duration: 3-4 weeks | Difficulty: Intermediate

Objectives

  • Process data larger than your RAM
  • Understand parallel processing
  • Optimize performance

Technologies

Technology Purpose When to Use
Dask Parallel processing Data > RAM (5-100GB)
Parquet Columnar format Efficient storage
Optimization Performance Always

Exercises

Exercise 03: Processing with Parquet and Dask

Details

  • Estimated time: 4-5 hours
  • Dataset: Full NYC Taxi (121MB)
  • Level: Intermediate

What you will learn:

  • Why Parquet is better than CSV
  • Parallel processing with Dask
  • Lazy evaluation
  • Memory optimization

Format Comparison:

Metric CSV Parquet
Disk size 121 MB 45 MB
Read time 8.5 sec 1.2 sec
Compression No Yes
Data types Not preserved Preserved

LEVEL 3: Real Big Data

Duration: 4-5 weeks | Difficulty: Advanced

Objectives

  • Master distributed processing
  • Work with massive data (>100GB)
  • Build production pipelines

Technologies

Technology Purpose Scale
PySpark Distributed processing > 100GB
Advanced SQL Complex queries Any size
ETL Pipelines Automation Production

Exercises

Exercise 04: Complex Queries with PySpark

Details

  • Estimated time: 5-6 hours
  • Dataset: NYC Taxi + Weather (multiple sources)
  • Level: Advanced

What you will learn:

  • Introduction to Spark
  • Distributed DataFrames
  • SQL in Spark
  • Multi-source joins
  • Data partitioning

Exercise 06: Complete ETL Pipeline

Details

  • Estimated time: 10-12 hours
  • Dataset: Multiple sources
  • Level: Advanced

Pipeline Architecture:

graph LR
    A[CSV 100GB] -->|Extract| B[Dask]
    B -->|Transform| C[PySpark]
    C -->|Load| D[Parquet 10GB]
    D -->|Serve| E[API Flask]
    E -->|Visualize| F[Dashboard]

LEVEL 4: Visualization and Deploy

Duration: 3-4 weeks | Difficulty: Advanced

Objectives

  • Create professional dashboards
  • Serve data via API
  • Deploy to production

Technologies

Technology Purpose Use
Flask Web backend APIs and dashboards
Chart.js Visualizations Interactive charts
Docker Containers Deploy

Exercises

Exercise 05: Interactive Dashboard

Details

  • Estimated time: 8-10 hours
  • Project: NYC Taxi EDA Dashboard
  • Level: Advanced

Features:

  • 📊 Visualization of 10M+ records
  • 📆 Dynamic filters by date/time
  • 🗺 Heat maps
  • 📈 Trend analysis

Tech Stack:

Frontend: HTML + Bootstrap + Chart.js
Backend:  Flask + Pandas/Dask
Data:     SQLite/Parquet
Deploy:   Docker

For Beginners (10-12 weeks)

gantt
    title Study Plan - Beginners
    dateFormat YYYY-MM-DD
    section Fundamentals
    Exercise 01           :2024-01-01, 1w
    Exercise 02           :2024-01-08, 1w
    Fundamentals Practice :2024-01-15, 1w
    section Scaling Up
    Exercise 03           :2024-01-22, 2w
    Personal Project      :2024-02-05, 1w
    section Big Data
    Exercise 04           :2024-02-12, 2w
    Exercise 06           :2024-02-26, 2w
    section Visualization
    Exercise 05           :2024-03-11, 2w

Dedication: 10-15 hours/week

For Intermediate Learners (6-8 weeks)

Recommendation

If you already know Python and Pandas, you can start directly at LEVEL 2.

Dedication: 8-10 hours/week

For Advanced Learners (4-5 weeks)

Recommendation

If you have already worked with Big Data, focus on the PySpark exercises and the final project.

Dedication: 5-8 hours/week


Technologies by Exercise

Exercise SQLite Pandas Dask PySpark Flask Level
01 - SQLite Yes Yes - - - Basic
02 - Cleaning - Yes - - - Basic
03 - Dask - Yes Yes - - Intermediate
04 - PySpark - - Yes Yes - Advanced
05 - Dashboard Yes Yes - - Yes Advanced
06 - Pipeline - - Yes Yes Yes Advanced

Technology Comparison

When to use each tool?

graph TD
    A[Do you have data?] --> B{How large is it?}
    B -->|< 5GB| C[Pandas]
    B -->|5-100GB| D[Dask]
    B -->|> 100GB| E[PySpark]

    C --> F{Need a DB?}
    D --> F
    E --> F

    F -->|Yes, local| G[SQLite]
    F -->|Yes, production| H[PostgreSQL/MySQL]
    F -->|No| I[Parquet]

Comparison Table

Data Size Tool Processing Time RAM Required
< 1GB Pandas Seconds 2-4x data size
1-5GB Pandas Minutes 2-4x data size
5-50GB Dask Minutes Any RAM
50-500GB Dask/PySpark Minutes-Hours Any RAM
> 500GB PySpark Hours Cluster

Certification and Evaluation

For In-Person Course Students

  • ✅ 230-hour certificate
  • ✅ Automatic evaluation via PROMPTS.md
  • ✅ Integrative final project
  • ✅ Direct instructor support

For Self-Learners

  • ✅ Project portfolio on GitHub
  • ✅ Code reviewable by employers
  • ✅ Experience with real data
  • ✅ Learn at your own pace

Your GitHub Is Your Certificate

Employers value seeing your code and projects more than a PDF. Make sure to:

  • Make clear and professional commits
  • Document your code
  • Complete exercises with quality
  • Add a personalized README to your fork

Additional Resources

Official Documentation

Complementary Courses

Communities

​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​‌‌​​‌‌​‍​​‌‌​‌​​‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​​​‍​‌‌​​‌‌​‍​‌‌​​‌​‌‍​​‌‌​​​​---

Next Steps

Now that you know the complete roadmap:

  1. Install Tools - If you don't have them yet
  2. Your First Exercise - Start practicing
  3. Fork and Clone - Set up your work environment