Content for Chapter 1, Big Data Technologies
This chapter takes students from familiar single-machine data analysis (Python + SQL) to distributed processing of multi-gigabyte datasets, using a single fil rouge: NYC Yellow Taxi trip records (2019–2023). Each new technology is introduced when the previous one hits its limits — students experience the wall before they get the tool.
Table of contents
Learning objectives
By the end of this chapter, students should be able to:
- Explain the technical reasons (memory, I/O, CPU) why single-machine data analysis breaks down on large datasets, and describe the core ideas of distributed storage and computation.
- Operate a 3-node Hadoop + Spark cluster (HDFS, YARN, Spark on YARN) provided as a Docker image.
- Design a MapReduce job from a high-level analytical question, and implement it in Python with the MrJob library.
- Process multi-gigabyte datasets with Spark — using both the DataFrame API and Spark SQL — and reason about partitioning, lazy evaluation and shuffling.
- Compare the three paradigms (pandas, MapReduce, Spark) and choose the appropriate tool for a given problem and scale.
- Conduct a small data analysis project end-to-end: pick a research question, design and implement a Spark pipeline on real-world taxi data, write a report and defend the work orally.
Prerequisites
| Topic | Expected level |
| Python | Comfortable with pandas, numpy, list/dict manipulations, virtual environments. |
| SQL | Comfortable writing SELECT … GROUP BY … JOIN queries. |
| Command line | Comfortable in a Unix shell (cd, ls, pipes, redirection). |
| Docker Desktop | Installed on the personal machine before session S4 (free download, ~1 GB). |
| Disk space | ≈ 20 GB free for the Docker image + NYC Taxi data + HDFS volumes. |
Architecture of the chapter
| Module | Sessions | Duration | Theme |
| A — From single-machine to scale | S1–S3 | 6 h | Big Data context, pandas refresher, hitting the wall |
| B — Distributed storage and MapReduce | S4–S7 | 8 h | HDFS, MapReduce paradigm, MrJob in practice |
| C — Spark | S8–S9 | 4 h | RDDs, DataFrames, Spark SQL |
| D — Group project | S10–S11 | 4 h | Supervised project work |
| Mid-term exam | week 9 | 1 h | Covers S1–S7 (pre-Spark) |
| Final restitution | week 12 | 2 h | Project defenses, 2 parallel jurys |
| Total | | 25 h | |
Sessions are 2 hours each. The fil rouge — NYC Yellow Taxi data — appears from S1 to S11.
Module A — From single-machine to scale
S1 — What is “Big Data”? (Lecture, 2 h)
- Type Lecture, no lab.
- Objectives
- Define Big Data through the 5Vs (Volume, Velocity, Variety, Veracity, Value).
- Map the modern Big Data ecosystem (storage, processing, NoSQL, cloud).
- Identify when Big Data tools are the right answer — and when they are overkill.
- Discover the NYC Yellow Taxi dataset that will serve as fil rouge.
- Outline
- Brief history: from BI/data warehouses to distributed systems.
- The 5Vs, with examples.
- Ecosystem map: HDFS, MapReduce, Spark, NoSQL, cloud data lakes.
- “Do I really need Big Data?” — decision flowchart.
- Tour of the NYC Taxi & Limousine Commission (TLC) trip-record dataset.
- Resources slides outline
slides/session_01.md. - No deliverable.
S2 — Pandas refresher on NYC Taxi (Lab, 2 h)
- Type Lab, students work on their own machine.
- Objectives
- Reactivate pandas reflexes on a real-world dataset.
- Read parquet files, type-check columns, handle missing values.
- Answer five analytical questions with idiomatic pandas.
- Outline
- Download a single month of yellow taxi data (~150 MB, ~3 M rows).
- Inspect schema and data quality.
- Five guided questions:
- Average fare amount per hour of the day.
- Top 10 pickup zones by trip count.
- Tip percentage by payment type.
- Trip duration distribution; flag obvious outliers.
- Daily revenue trend.
- Free exploration (10 minutes).
- Resources TP
TP_BigData_English/lab_02_pandas/, slides outline slides/session_02.md. - Deliverable Jupyter notebook submitted on the LMS by end of session.
S3 — Hitting the wall (Lecture + Lab, 2 h)
- Type Mixed (45 min lecture, 75 min lab).
- Objectives
- Experience first-hand the failure modes of single-machine analytics.
- Profile memory consumption and identify bottlenecks.
- Understand sampling, chunking and partitioning as first-step mitigations.
- Build the intuition for distributed computation.
- Outline
- Lecture: vertical vs horizontal scaling; memory hierarchy; I/O is the bottleneck.
- Lab: try to load one full year of yellow taxi data (~36 M rows, ~2 GB) into pandas — observe the crash on most laptops.
- Mitigations:
chunksize, sampling, column pruning, dtype downcasting. - Bridge to next module: “what if storage and compute were spread across machines?”
- Resources TP
TP_BigData_English/lab_03_limits/, slides outline slides/session_03.md. - Deliverable Short markdown note (½ page) describing the bottleneck observed and one mitigation tried.
Module B — Distributed storage and MapReduce
S4 — HDFS and the Docker cluster (Lecture + Lab, 2 h)
- Type Mixed (45 min lecture, 75 min hands-on cluster setup).
- Objectives
- Describe the HDFS architecture (NameNode, DataNodes, blocks, replication).
- Explain “schema-on-read” and why it matters at scale.
- Bring up the 3-node Hadoop/Spark cluster on the personal laptop.
- Push and inspect data in HDFS via the CLI and the NameNode UI.
- Outline
- Lecture: distributed file systems, GFS heritage, HDFS design choices, fault tolerance.
- Hands-on:
docker compose up -d, docker compose exec hadoop-master bash, check-cluster.sh. - HDFS tour:
hadoop fs -mkdir, -put, -ls, -cat; navigate the NameNode UI on http://localhost:9870. - Push one month of yellow taxi data into HDFS, observe block distribution across the 2 DataNodes.
- Resources TP
TP_BigData_English/lab_04_hdfs/, Docker repo, slides outline slides/session_04.md. - Deliverable Screenshot of the NameNode UI showing taxi data correctly distributed across both DataNodes.
S5 — The MapReduce paradigm (Lecture, 2 h)
- Type Lecture + pen-and-paper exercises (no machine).
- Objectives
- Explain the MapReduce computational model: map, shuffle, reduce.
- Decompose an analytical question into map and reduce functions.
- Recognise problems that fit MapReduce naturally and those that don’t.
- Outline
- The original Google MapReduce paper — key ideas.
- WordCount as the canonical example.
- Decomposition exercises (pen-and-paper): “average fare per pickup zone”, “top 10 zones by tip percentage”, “trip count by hour-of-day”.
- Limits: iterative algorithms, joins, the cost of shuffling.
- Bridge to Spark.
- Resources slides outline
slides/session_05.md, exercises TP_BigData_English/exercises_05_mapreduce/. - Deliverable Decomposition sheet (provided as a worksheet) handed in at end of session.
S6 — MrJob in practice (Lab, 2 h)
- Type Lab.
- Objectives
- Write and run a first MrJob job (WordCount).
- Understand the MrJob workflow: local run, Hadoop streaming, debugging.
- Implement a simple aggregation MrJob on a small taxi sample.
- Outline
- MrJob installation (already in the Docker image).
- WordCount on Bram Stoker’s Dracula, run locally then on Hadoop.
- First taxi job: count trips per pickup zone, on a 1-month sample.
- Multi-step MrJob: average fare per pickup zone (two reducers chained).
- Debugging tour: counters, logs, YARN UI.
- Resources TP
TP_BigData_English/lab_06_mrjob_basics/, slides outline slides/session_06.md. - Deliverable Two MrJob scripts (single-step and multi-step) committed to the student’s repo.
S7 — MrJob at scale on NYC Taxi (Lab, 2 h)
- Type Lab.
- Objectives
- Run MrJob on one full year of taxi data via HDFS.
- Reproduce in MapReduce the questions answered in pandas in S2.
- Compare the two approaches: code length, runtime, mental model.
- Outline
- Push 1 year of data into HDFS (~2 GB).
- Implement the five S2 questions in MrJob.
- Watch jobs in YARN UI; identify slow stages.
- Comparison table (pandas vs MapReduce) filled in by each student.
- Resources TP
TP_BigData_English/lab_07_mrjob_taxi/, slides outline slides/session_07.md. - Deliverable Comparison table + repo containing the 5 MrJob scripts.
Module C — Spark
S8 — Spark introduction (Lecture + Lab, 2 h)
- Type Mixed (45 min lecture, 75 min lab).
- Objectives
- Explain why Spark replaced MapReduce as the default tool: in-memory, lazy evaluation, richer API.
- Describe the SparkContext / SparkSession, RDDs and DataFrames.
- Run a first PySpark script on YARN.
- Outline
- Lecture: Spark architecture, driver vs executors, lazy evaluation, DAG.
- RDD vs DataFrame API.
- Hands-on: open a
pyspark shell on the master, do basic transformations on a small RDD. - Submit a first script with
spark-submit --master yarn.
- Resources TP
TP_BigData_English/lab_08_spark_intro/, slides outline slides/session_08.md. - Deliverable First PySpark script counting trips per pickup zone, submitted to YARN.
S9 — Spark DataFrames and Spark SQL (Lab, 2 h)
- Type Lab.
- Objectives
- Use the Spark DataFrame API and Spark SQL fluently.
- Process multi-year taxi data (3–5 years, ~10 GB).
- Build a side-by-side comparison: pandas vs MapReduce vs Spark for the same five questions.
- Outline
- DataFrames:
read.parquet, filter, groupBy, agg, join, withColumn. - Same as Spark SQL via
spark.sql("SELECT ... GROUP BY ..."). - Re-implement the 5 questions of S2/S7 on 3 years of data.
- Performance discussion: caching, partitioning, broadcast joins.
- Final comparison table: lines of code, runtime, ease of debugging.
- Resources TP
TP_BigData_English/lab_09_spark_sql/, slides outline slides/session_09.md. - Deliverable Repo with the Spark scripts + comparison table (pandas/MapReduce/Spark).
Module D — Group project
S10 — Project kickoff (Supervised, 2 h)
- Type Supervised group work.
- Objectives
- Form groups of 3 students (10 groups for a class of 30).
- Choose a research question (from the catalogue or a free topic, subject to teacher validation).
- Scope the question, design the Spark pipeline, identify required data subsets.
- Set up the project repo (Git, README, environment).
- Outline
- Group formation (15 min).
- Topic selection — multiple groups may pick the same topic (5 min).
- Scoping workshop with the teacher (30 min).
- Pipeline design and repo bootstrap (70 min).
- Deliverable One-page project proposal validated by the teacher at end of session.
S11 — Project work (Supervised, 2 h)
- Type Supervised group work.
- Objectives
- Implement the Spark pipeline.
- Produce preliminary results.
- Prepare the report skeleton and the defense slides.
- Outline Free, with the teacher rotating between groups.
- Deliverable Working pipeline + draft report + draft slides at end of session. Final deliverables are due 1 week before the restitution.
Project topic catalogue
Twelve topics are proposed. Multiple groups may choose the same topic. A group may also propose its own topic, subject to teacher approval at S10.
Theme A — Descriptive analytics (accessible)
| # | Topic | Question | Difficulty | Key technical angle |
| A1 | Demand patterns | When and where is demand highest? Hour × day × zone heatmaps. | ★ | Multi-dim groupBy, visualisation |
| A2 | Tip behaviour | What drives tip %? Card vs cash, route, time, distance, zone? | ★★ | Aggregation + statistical comparison |
| A3 | COVID impact | How did demand collapse in 2020 and recover through 2023? | ★ | Time-series aggregation, before/after |
| A4 | Yellow vs Green vs FHV | How do the 3 fleets differ in service patterns and zones served? | ★★ | Multi-dataset join (3 TLC sources) |
| # | Topic | Question | Difficulty | Key technical angle |
| B1 | Driver income optimisation | Where and when should a driver work to maximise hourly revenue? | ★★ | Multi-dim aggregation + ranking |
| B2 | Origin-destination flows | Top NYC commute flows; OD matrix; evolution over years. | ★★ | Pairwise aggregation, matrix output |
Theme C — Predictive ML at scale (advanced)
| # | Topic | Question | Difficulty | Key technical angle |
| C1 | Tip prediction | Predict tip % from pickup features. | ★★★ | Spark MLlib pipeline, feature engineering |
| C2 | Trip duration prediction | Predict ETA given zone-pair, hour, weekday. | ★★★ | Spark MLlib, regression + evaluation |
| # | Topic | Question | Difficulty | Key technical angle |
| D1 | Anomaly detection | Suspicious trips — impossible distances, weird fares, route detours. | ★★ | Statistical outlier rules, zone distance matrix |
Theme E — External joins (advanced, bonus prestige)
| # | Topic | Question | Difficulty | Key technical angle |
| E1 | Weather impact | Does weather (rain, snow, heat) affect demand, fare, tips? | ★★★ | Join with NOAA daily weather data |
| E2 | Geospatial hotspots | Hotspot evolution over time, with NYC taxi zone shapefiles. | ★★★ | Spatial join, geographic visualisation |
| E3 | Special events impact | Quantify impact of NYE, parades, sports finals on taxi flows. | ★★★ | Manually-curated event list, temporal join |
Project deliverables
- Written report (PDF) — ~10 pages, English. Structure: research question, data, methodology, Spark pipeline, results, discussion, conclusion. Due 1 week before the restitution.
- Code repository (Git) — Spark scripts, README explaining how to reproduce, requirements file. Same deadline.
- Oral defense — 10 minutes presentation + 5 minutes Q/A, during the final restitution session.
Project grading rubric (40 % of the chapter grade)
| Criterion | Weight |
| Relevance and depth of the analysis | 30 % |
| Quality of the Spark code (readability, idioms, performance) | 25 % |
| Quality of the report (structure, visualisations, conclusions) | 25 % |
| Quality of the defense (clarity, answers to questions) | 20 % |
Mid-term exam
- When week 9, 1 hour.
- Format on paper, no machine, no notes.
- Coverage sessions S1 to S7 (Big Data context, HDFS, MapReduce, MrJob). Spark is not examined.
- Weight 20 % of the chapter grade.
Sample question types: design a MapReduce decomposition for a given analytical question; explain HDFS replication on a diagram; reason about a memory bottleneck given a pandas workflow.
Final restitution
- When week 12, 2 hours, two parallel sessions (5 groups per jury).
- Format 10 min presentation + 5 min Q/A per group.
- Audience teaching team + the other 4 groups in the same session.
Total chapter assessment:
| Component | Weight |
| Mid-term exam | 20 % |
| Lab report (Module B/C) | 20 % |
| Group project (report + code + defense) | 40 % |
| Other 20 % is provided by Chapter 2 (NoSQL lab report) | — |
Infrastructure and dataset
Docker cluster
- A 1-master + 2-slaves Hadoop 3.5 + Spark 4.1 cluster, packaged as a Docker image, is provided for the chapter.
- It runs identically on macOS, Linux and Windows (Docker Desktop or Docker Engine + WSL2).
- Memory profile is auto-detected (8 GB / 16 GB) at startup.
- Setup instructions, web UIs and troubleshooting are documented in the Docker repo’s README.
- Students must install Docker Desktop and pull the image before session S4.
NYC Yellow Taxi dataset
- Source NYC Taxi & Limousine Commission (TLC) — monthly Parquet files, free to use.
- Coverage used in this chapter January 2019 — December 2023 (5 years, ~150 M trips, ~15 GB).
- Subsets per session
- S2 (pandas refresher): 1 month, ~150 MB.
- S3 (hitting the wall): 1 year, ~2 GB.
- S4–S7 (HDFS, MrJob): 1 year, in HDFS.
- S8–S9 (Spark): 3 years, ~6 GB.
- S10–S11 (project): 5 years, ~15 GB.
- A helper script
download_taxi_data.sh is provided in the Docker repo.
References
Books
- Hadoop: The Definitive Guide, Tom White, 4th edition, O’Reilly, 2015. Available at the Centrale Lyon library.
- Learning Spark, Jules Damji et al., 2nd edition, O’Reilly, 2020. Free PDF on the Databricks website.
- Spark: The Definitive Guide, Bill Chambers and Matei Zaharia, O’Reilly, 2018.
Documentation
Videos