Exercise 01: Introduction to SQLite¶
Learn how to load and query data using SQLite and Pandas.
General Information¶
| Field | Value |
|---|---|
| Level | 🟢 Basic |
| Estimated time | 2-3 hours |
| Technologies | Python, SQLite, Pandas |
| Dataset | NYC Taxi (10MB sample) |
| Prerequisites | Basic Python, basic SQL knowledge |
Learning Objectives¶
Upon completing this exercise, you will be able to:
- Load large CSV files into SQLite using chunks
- Create and manage SQLite databases
- Execute basic and advanced SQL queries
- Optimize performance with indexes
- Export query results to CSV
The Problem¶
You have a CSV file with 100,000 records of NYC taxi trips. If you try to load it all into memory with Pandas, your computer may run out of memory.
Your mission: Load this data into a SQLite database efficiently and perform analysis.
Dataset¶
NYC Taxi Trip Records¶
- File:
datos/muestra_taxi.csv - Size: ~10 MB
- Records: ~100,000 trips
- Period: January 2021
Data Structure¶
Columnas:
- tpep_pickup_datetime # Fecha/hora de inicio del viaje
- tpep_dropoff_datetime # Fecha/hora de fin del viaje
- passenger_count # Numero de pasajeros
- trip_distance # Distancia en millas
- pickup_longitude # Longitud de origen
- pickup_latitude # Latitud de origen
- dropoff_longitude # Longitud de destino
- dropoff_latitude # Latitud de destino
- payment_type # Tipo de pago (1=Credit, 2=Cash, ...)
- fare_amount # Tarifa base
- tip_amount # Propina
- total_amount # Total pagado
Tasks¶
Task 1: Load CSV into SQLite in Chunks¶
Objective: Load the complete CSV into SQLite without running out of memory.
Requirements:
- Use
pandas.read_csv()with thechunksizeparameter - Process the CSV in chunks of 10,000 records
- Insert each chunk into the
tripstable - Display loading progress
Initial code example:
import sqlite3
import pandas as pd
def cargar_datos_sqlite(csv_path, db_path, chunksize=10000):
"""
Carga un CSV grande a SQLite en chunks
Args:
csv_path: Ruta al archivo CSV
db_path: Ruta a la base de datos SQLite
chunksize: Numero de registros por chunk
"""
conn = sqlite3.connect(db_path)
# TODO: Implementar carga por chunks
# Pista: usa pd.read_csv con chunksize
# y itera sobre los chunks
conn.close()
Hint
Task 2: Create Indexes¶
Objective: Optimize queries by adding indexes to frequently queried columns.
Requirements:
- Create an index on
tpep_pickup_datetime - Create an index on
payment_type - Measure query time before and after creating indexes
Example:
def crear_indices(db_path):
"""
Crea indices para optimizar queries
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# TODO: Crear indices
# Ejemplo:
# cursor.execute("CREATE INDEX idx_pickup ON trips(tpep_pickup_datetime)")
conn.commit()
conn.close()
Task 3: Analysis Queries¶
Objective: Extract insights from the data using SQL.
Query 1: Average Revenue by Hour of Day¶
Calculate the average revenue for each hour of the day.
Expected SQL:
SELECT
strftime('%H', tpep_pickup_datetime) as hora,
AVG(total_amount) as promedio_ingreso,
COUNT(*) as num_viajes
FROM trips
GROUP BY hora
ORDER BY hora
Expected result:
Query 2: Payment Method Distribution¶
Calculate the percentage of each payment method.
SELECT
payment_type,
COUNT(*) as total,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM trips), 2) as porcentaje
FROM trips
GROUP BY payment_type
ORDER BY total DESC
Query 3: Top 10 Most Profitable Routes¶
Find the routes (pickup → dropoff) with the highest average revenue.
Challenge
This query requires grouping by rounded coordinates. Think about how to do it.
Task 4: Export Results¶
Objective: Save your analysis results to CSV files.
def exportar_resultados(db_path, query, output_path):
"""
Ejecuta una query y exporta a CSV
Args:
db_path: Ruta a la BD SQLite
query: Query SQL a ejecutar
output_path: Ruta del CSV de salida
"""
conn = sqlite3.connect(db_path)
# TODO: Ejecutar query y exportar
# Pista: usa pd.read_sql_query() y df.to_csv()
conn.close()
Evaluation Criteria¶
Functionality (40 points)¶
- Complete data loading without errors (10 pts)
- Indexes created correctly (10 pts)
- Queries execute and return correct results (15 pts)
- Export works (5 pts)
Clean Code (30 points)¶
- Well-documented functions (10 pts)
- Readable and organized code (10 pts)
- Error handling (10 pts)
Performance (20 points)¶
- Efficient loading with chunks (10 pts)
- Indexes improve performance measurably (10 pts)
Analysis (10 points)¶
- Interpretation of results (5 pts)
- Additional insights (5 pts)
Deliverables¶
You must submit by pushing to your fork (git push):
- Python code:
01_cargar_sqlite.py - Database:
datos/taxi.db(DO NOT upload to GitHub, too large) - Results:
resultados/folder with exported CSVs - Analysis:
ANALISIS.mdwith your findings
Folder structure¶
entregas/01_sqlite/tu_apellido_nombre/
├── 01_cargar_sqlite.py
├── resultados/
│ ├── ingresos_por_hora.csv
│ ├── distribucion_pagos.csv
│ └── top_rutas.csv
└── ANALISIS.md
Partial Solution Example¶
Sample Code
import sqlite3
import pandas as pd
import time
def cargar_datos_sqlite(csv_path, db_path, chunksize=10000):
"""Carga CSV a SQLite en chunks"""
conn = sqlite3.connect(db_path)
chunks = pd.read_csv(csv_path, chunksize=chunksize)
for i, chunk in enumerate(chunks):
chunk.to_sql('trips', conn, if_exists='append', index=False)
print(f"Chunk {i+1} cargado ({len(chunk)} registros)")
conn.close()
print("Carga completa!")
def crear_indices(db_path):
"""Crea indices para optimizacion"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
print("Creando indices...")
cursor.execute("CREATE INDEX IF NOT EXISTS idx_pickup ON trips(tpep_pickup_datetime)")
cursor.execute("CREATE INDEX IF NOT EXISTS idx_payment ON trips(payment_type)")
conn.commit()
conn.close()
print("Indices creados!")
def analizar_ingresos_por_hora(db_path):
"""Query: Ingresos promedio por hora"""
conn = sqlite3.connect(db_path)
query = """
SELECT
strftime('%H', tpep_pickup_datetime) as hora,
ROUND(AVG(total_amount), 2) as promedio_ingreso,
COUNT(*) as num_viajes
FROM trips
GROUP BY hora
ORDER BY hora
"""
df = pd.read_sql_query(query, conn)
conn.close()
return df
if __name__ == "__main__":
# Rutas
csv_path = "datos/muestra_taxi.csv"
db_path = "datos/taxi.db"
# 1. Cargar datos
print("=== CARGANDO DATOS ===")
start = time.time()
cargar_datos_sqlite(csv_path, db_path)
print(f"Tiempo: {time.time() - start:.2f} segundos\n")
# 2. Crear indices
print("=== CREANDO INDICES ===")
crear_indices(db_path)
print()
# 3. Analisis
print("=== ANALISIS: INGRESOS POR HORA ===")
df = analizar_ingresos_por_hora(db_path)
print(df)
# Exportar
df.to_csv("resultados/ingresos_por_hora.csv", index=False)
print("\nResultados exportados!")
Reflection Questions¶
Answer in your ANALISIS.md:
-
Performance:
- How long did it take to load 100,000 records?
- How much did the indexes improve query time?
-
Insights:
- At what time of day do taxis earn the most?
- What is the most common payment method?
- What pattern do you observe in the data?
-
Improvements:
- How would you further optimize the loading?
- What other analyses could be done?
- What limitations does SQLite have for this dataset?
Additional Resources¶
Documentation¶
Tutorials¶
Common Issues¶
Error: MemoryError when loading CSV
Cause: Trying to load the entire file at once.
Solution: Use chunks:
Very slow query
Cause: Missing index on queried columns.
Solution: Create indexes:
Error: database is locked
Cause: Another process has the database open.
Solution: - Close all connections: conn.close() - Make sure the database is not open in another program
The Big Picture: Data Ecosystems¶
SQLite is just the first step. In this course you will explore complete database ecosystems:
SQL vs NoSQL¶

Key-Value and Columnar Models¶

The MongoDB Ecosystem¶

---
Next Steps¶
Once you have completed this exercise:
- Exercise 02: Data Cleaning - Next exercise
- Submission Guide - How to submit your work
- Roadmap - See all exercises