Exercise 01: Introduction to SQLite‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‍‌‌‌‌‍‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌¶

Learn how to load and query data using SQLite and Pandas.‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‍‌‌‌‌‍‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌

General Information¶

Field	Value
Level	🟢 Basic
Estimated time	2-3 hours
Technologies	Python, SQLite, Pandas
Dataset	NYC Taxi (10MB sample)
Prerequisites	Basic Python, basic SQL knowledge

Learning Objectives¶

Upon completing this exercise, you will be able to:

Load large CSV files into SQLite using chunks
Create and manage SQLite databases
Execute basic and advanced SQL queries
Optimize performance with indexes
Export query results to CSV

The Problem¶

You have a CSV file with 100,000 records of NYC taxi trips. If you try to load it all into memory with Pandas, your computer may run out of memory.

Your mission: Load this data into a SQLite database efficiently and perform analysis.

Dataset¶

NYC Taxi Trip Records¶

File: datos/muestra_taxi.csv
Size: ~10 MB
Records: ~100,000 trips
Period: January 2021

Data Structure¶

Columnas:
- tpep_pickup_datetime    # Fecha/hora de inicio del viaje
- tpep_dropoff_datetime   # Fecha/hora de fin del viaje
- passenger_count         # Numero de pasajeros
- trip_distance           # Distancia en millas
- pickup_longitude        # Longitud de origen
- pickup_latitude         # Latitud de origen
- dropoff_longitude       # Longitud de destino
- dropoff_latitude        # Latitud de destino
- payment_type            # Tipo de pago (1=Credit, 2=Cash, ...)
- fare_amount             # Tarifa base
- tip_amount              # Propina
- total_amount            # Total pagado

Tasks¶

Task 1: Load CSV into SQLite in Chunks¶

Objective: Load the complete CSV into SQLite without running out of memory.

Requirements:

Use pandas.read_csv() with the chunksize parameter
Process the CSV in chunks of 10,000 records
Insert each chunk into the trips table
Display loading progress

Initial code example:

import sqlite3
import pandas as pd

def cargar_datos_sqlite(csv_path, db_path, chunksize=10000):
    """
    Carga un CSV grande a SQLite en chunks

    Args:
        csv_path: Ruta al archivo CSV
        db_path: Ruta a la base de datos SQLite
        chunksize: Numero de registros por chunk
    """
    conn = sqlite3.connect(db_path)

    # TODO: Implementar carga por chunks
    # Pista: usa pd.read_csv con chunksize
    # y itera sobre los chunks

    conn.close()

Hint

chunks = pd.read_csv(csv_path, chunksize=chunksize)
for i, chunk in enumerate(chunks):
    # Insertar chunk en SQLite
    chunk.to_sql('trips', conn, if_exists='append', index=False)
    print(f"Chunk {i+1} cargado")

Task 2: Create Indexes¶

Objective: Optimize queries by adding indexes to frequently queried columns.

Requirements:

Create an index on tpep_pickup_datetime
Create an index on payment_type
Measure query time before and after creating indexes

Example:

def crear_indices(db_path):
    """
    Crea indices para optimizar queries
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # TODO: Crear indices
    # Ejemplo:
    # cursor.execute("CREATE INDEX idx_pickup ON trips(tpep_pickup_datetime)")

    conn.commit()
    conn.close()

Task 3: Analysis Queries¶

Objective: Extract insights from the data using SQL.

Query 1: Average Revenue by Hour of Day¶

Calculate the average revenue for each hour of the day.

Expected SQL:

SELECT
    strftime('%H', tpep_pickup_datetime) as hora,
    AVG(total_amount) as promedio_ingreso,
    COUNT(*) as num_viajes
FROM trips
GROUP BY hora
ORDER BY hora

Expected result:

hora  promedio_ingreso  num_viajes
00    15.23             2340
01    14.89             1982
02    16.45             1657
...

Query 2: Payment Method Distribution¶

Calculate the percentage of each payment method.

SELECT
    payment_type,
    COUNT(*) as total,
    ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM trips), 2) as porcentaje
FROM trips
GROUP BY payment_type
ORDER BY total DESC

Query 3: Top 10 Most Profitable Routes¶

Find the routes (pickup → dropoff) with the highest average revenue.

Challenge

This query requires grouping by rounded coordinates. Think about how to do it.

Task 4: Export Results¶

Objective: Save your analysis results to CSV files.

def exportar_resultados(db_path, query, output_path):
    """
    Ejecuta una query y exporta a CSV

    Args:
        db_path: Ruta a la BD SQLite
        query: Query SQL a ejecutar
        output_path: Ruta del CSV de salida
    """
    conn = sqlite3.connect(db_path)

    # TODO: Ejecutar query y exportar
    # Pista: usa pd.read_sql_query() y df.to_csv()

    conn.close()

Evaluation Criteria¶

Functionality (40 points)¶

Complete data loading without errors (10 pts)
Indexes created correctly (10 pts)
Queries execute and return correct results (15 pts)
Export works (5 pts)

Clean Code (30 points)¶

Well-documented functions (10 pts)
Readable and organized code (10 pts)
Error handling (10 pts)

Performance (20 points)¶

Efficient loading with chunks (10 pts)
Indexes improve performance measurably (10 pts)

Analysis (10 points)¶

Interpretation of results (5 pts)
Additional insights (5 pts)

Deliverables¶

You must submit by pushing to your fork (git push):

Python code: 01_cargar_sqlite.py
Database: datos/taxi.db (DO NOT upload to GitHub, too large)
Results: resultados/ folder with exported CSVs
Analysis: ANALISIS.md with your findings

Folder structure¶

entregas/01_sqlite/tu_apellido_nombre/
├── 01_cargar_sqlite.py
├── resultados/
│   ├── ingresos_por_hora.csv
│   ├── distribucion_pagos.csv
│   └── top_rutas.csv
└── ANALISIS.md

Partial Solution Example¶

Sample Code

import sqlite3
import pandas as pd
import time

def cargar_datos_sqlite(csv_path, db_path, chunksize=10000):
    """Carga CSV a SQLite en chunks"""
    conn = sqlite3.connect(db_path)

    chunks = pd.read_csv(csv_path, chunksize=chunksize)

    for i, chunk in enumerate(chunks):
        chunk.to_sql('trips', conn, if_exists='append', index=False)
        print(f"Chunk {i+1} cargado ({len(chunk)} registros)")

    conn.close()
    print("Carga completa!")

def crear_indices(db_path):
    """Crea indices para optimizacion"""
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    print("Creando indices...")
    cursor.execute("CREATE INDEX IF NOT EXISTS idx_pickup ON trips(tpep_pickup_datetime)")
    cursor.execute("CREATE INDEX IF NOT EXISTS idx_payment ON trips(payment_type)")

    conn.commit()
    conn.close()
    print("Indices creados!")

def analizar_ingresos_por_hora(db_path):
    """Query: Ingresos promedio por hora"""
    conn = sqlite3.connect(db_path)

    query = """
        SELECT
            strftime('%H', tpep_pickup_datetime) as hora,
            ROUND(AVG(total_amount), 2) as promedio_ingreso,
            COUNT(*) as num_viajes
        FROM trips
        GROUP BY hora
        ORDER BY hora
    """

    df = pd.read_sql_query(query, conn)
    conn.close()

    return df

if __name__ == "__main__":
    # Rutas
    csv_path = "datos/muestra_taxi.csv"
    db_path = "datos/taxi.db"

    # 1. Cargar datos
    print("=== CARGANDO DATOS ===")
    start = time.time()
    cargar_datos_sqlite(csv_path, db_path)
    print(f"Tiempo: {time.time() - start:.2f} segundos\n")

    # 2. Crear indices
    print("=== CREANDO INDICES ===")
    crear_indices(db_path)
    print()

    # 3. Analisis
    print("=== ANALISIS: INGRESOS POR HORA ===")
    df = analizar_ingresos_por_hora(db_path)
    print(df)

    # Exportar
    df.to_csv("resultados/ingresos_por_hora.csv", index=False)
    print("\nResultados exportados!")

Reflection Questions¶

Answer in your ANALISIS.md:

Performance:
- How long did it take to load 100,000 records?
- How much did the indexes improve query time?
Insights:
- At what time of day do taxis earn the most?
- What is the most common payment method?
- What pattern do you observe in the data?
Improvements:
- How would you further optimize the loading?
- What other analyses could be done?
- What limitations does SQLite have for this dataset?

Additional Resources¶

Documentation¶

Tutorials¶

Common Issues¶

Error: MemoryError when loading CSV

Cause: Trying to load the entire file at once.

Solution: Use chunks:

chunks = pd.read_csv(csv_path, chunksize=10000)

Very slow query

Cause: Missing index on queried columns.

Solution: Create indexes:

cursor.execute("CREATE INDEX idx_nombre ON tabla(columna)")

Error: database is locked

Cause: Another process has the database open.

Solution: - Close all connections: conn.close() - Make sure the database is not open in another program

The Big Picture: Data Ecosystems¶

SQLite is just the first step. In this course you will explore complete database ecosystems:

SQL vs NoSQL¶

The Data Schism: Relational vs NoSQL

Key-Value and Columnar Models¶

Speed and Scale: Redis, DynamoDB, Cassandra

The MongoDB Ecosystem¶

Document Flexibility: MongoDB Atlas

‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‍‌‌‌‍‌‌‌‌‍‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‌---

Next Steps¶

Once you have completed this exercise:

Exercise 02: Data Cleaning - Next exercise
Submission Guide - How to submit your work
Roadmap - See all exercises