Content for Chapter 1, Big Data Technologies

This chapter takes students from familiar single-machine data analysis (Python + SQL) to distributed processing of multi-gigabyte datasets, using a single fil rouge: NYC Yellow Taxi trip records (2019–2023). Each new technology is introduced when the previous one hits its limits — students experience the wall before they get the tool.

Table of contents

Learning objectives
Prerequisites
Architecture of the chapter
Module A — From single-machine to scale
Module B — Distributed storage and MapReduce
Module C — Spark
Module D — Group project
Mid-term exam
Final restitution
Infrastructure and dataset
References

Learning objectives¶

By the end of this chapter, students should be able to:

Explain the technical reasons (memory, I/O, CPU) why single-machine data analysis breaks down on large datasets, and describe the core ideas of distributed storage and computation.
Operate a 3-node Hadoop + Spark cluster (HDFS, YARN, Spark on YARN) provided as a Docker image.
Design a MapReduce job from a high-level analytical question, and implement it in Python with the MrJob library.
Process multi-gigabyte datasets with Spark — using both the DataFrame API and Spark SQL — and reason about partitioning, lazy evaluation and shuffling.
Compare the three paradigms (pandas, MapReduce, Spark) and choose the appropriate tool for a given problem and scale.
Conduct a small data analysis project end-to-end: pick a research question, design and implement a Spark pipeline on real-world taxi data, write a report and defend the work orally.

Prerequisites¶

Topic	Expected level
Python	Comfortable with `pandas`, `numpy`, list/dict manipulations, virtual environments.
SQL	Comfortable writing `SELECT … GROUP BY … JOIN` queries.
Command line	Comfortable in a Unix shell (cd, ls, pipes, redirection).
Docker Desktop	Installed on the personal machine before session S4 (free download, ~1 GB).
Disk space	≈ 20 GB free for the Docker image + NYC Taxi data + HDFS volumes.

Architecture of the chapter¶

Module	Sessions	Duration	Theme
A — From single-machine to scale	S1–S3	6 h	Big Data context, pandas refresher, hitting the wall
B — Distributed storage and MapReduce	S4–S7	8 h	HDFS, MapReduce paradigm, MrJob in practice
C — Spark	S8–S9	4 h	RDDs, DataFrames, Spark SQL
D — Group project	S10–S11	4 h	Supervised project work
Mid-term exam	week 9	1 h	Covers S1–S7 (pre-Spark)
Final restitution	week 12	2 h	Project defenses, 2 parallel jurys
Total		25 h

Sessions are 2 hours each. The fil rouge — NYC Yellow Taxi data — appears from S1 to S11.

Module A — From single-machine to scale¶

S1 — What is “Big Data”? (Lecture, 2 h)¶

Type Lecture, no lab.
Objectives
- Define Big Data through the 5Vs (Volume, Velocity, Variety, Veracity, Value).
- Map the modern Big Data ecosystem (storage, processing, NoSQL, cloud).
- Identify when Big Data tools are the right answer — and when they are overkill.
- Discover the NYC Yellow Taxi dataset that will serve as fil rouge.
Outline
1. Brief history: from BI/data warehouses to distributed systems.
2. The 5Vs, with examples.
3. Ecosystem map: HDFS, MapReduce, Spark, NoSQL, cloud data lakes.
4. “Do I really need Big Data?” — decision flowchart.
5. Tour of the NYC Taxi & Limousine Commission (TLC) trip-record dataset.
Resources slides outline slides/session_01.md.
No deliverable.

S2 — Pandas refresher on NYC Taxi (Lab, 2 h)¶

Type Lab, students work on their own machine.
Objectives
- Reactivate pandas reflexes on a real-world dataset.
- Read parquet files, type-check columns, handle missing values.
- Answer five analytical questions with idiomatic pandas.
Outline
1. Download a single month of yellow taxi data (~150 MB, ~3 M rows).
2. Inspect schema and data quality.
3. Five guided questions:
  - Average fare amount per hour of the day.
  - Top 10 pickup zones by trip count.
  - Tip percentage by payment type.
  - Trip duration distribution; flag obvious outliers.
  - Daily revenue trend.
4. Free exploration (10 minutes).
Resources TP TP_BigData_English/lab_02_pandas/, slides outline slides/session_02.md.
Deliverable Jupyter notebook submitted on the LMS by end of session.

S3 — Hitting the wall (Lecture + Lab, 2 h)¶

Type Mixed (45 min lecture, 75 min lab).
Objectives
- Experience first-hand the failure modes of single-machine analytics.
- Profile memory consumption and identify bottlenecks.
- Understand sampling, chunking and partitioning as first-step mitigations.
- Build the intuition for distributed computation.
Outline
1. Lecture: vertical vs horizontal scaling; memory hierarchy; I/O is the bottleneck.
2. Lab: try to load one full year of yellow taxi data (~36 M rows, ~2 GB) into pandas — observe the crash on most laptops.
3. Mitigations: chunksize, sampling, column pruning, dtype downcasting.
4. Bridge to next module: “what if storage and compute were spread across machines?”
Resources TP TP_BigData_English/lab_03_limits/, slides outline slides/session_03.md.
Deliverable Short markdown note (½ page) describing the bottleneck observed and one mitigation tried.

Module B — Distributed storage and MapReduce¶

S4 — HDFS and the Docker cluster (Lecture + Lab, 2 h)¶

Type Mixed (45 min lecture, 75 min hands-on cluster setup).
Objectives
- Describe the HDFS architecture (NameNode, DataNodes, blocks, replication).
- Explain “schema-on-read” and why it matters at scale.
- Bring up the 3-node Hadoop/Spark cluster on the personal laptop.
- Push and inspect data in HDFS via the CLI and the NameNode UI.
Outline
1. Lecture: distributed file systems, GFS heritage, HDFS design choices, fault tolerance.
2. Hands-on: docker compose up -d, docker compose exec hadoop-master bash, check-cluster.sh.
3. HDFS tour: hadoop fs -mkdir, -put, -ls, -cat; navigate the NameNode UI on http://localhost:9870.
4. Push one month of yellow taxi data into HDFS, observe block distribution across the 2 DataNodes.
Resources TP TP_BigData_English/lab_04_hdfs/, Docker repo, slides outline slides/session_04.md.
Deliverable Screenshot of the NameNode UI showing taxi data correctly distributed across both DataNodes.

S5 — The MapReduce paradigm (Lecture, 2 h)¶

Type Lecture + pen-and-paper exercises (no machine).
Objectives
- Explain the MapReduce computational model: map, shuffle, reduce.
- Decompose an analytical question into map and reduce functions.
- Recognise problems that fit MapReduce naturally and those that don’t.
Outline
1. The original Google MapReduce paper — key ideas.
2. WordCount as the canonical example.
3. Decomposition exercises (pen-and-paper): “average fare per pickup zone”, “top 10 zones by tip percentage”, “trip count by hour-of-day”.
4. Limits: iterative algorithms, joins, the cost of shuffling.
5. Bridge to Spark.
Resources slides outline slides/session_05.md, exercises TP_BigData_English/exercises_05_mapreduce/.
Deliverable Decomposition sheet (provided as a worksheet) handed in at end of session.

S6 — MrJob in practice (Lab, 2 h)¶

Type Lab.
Objectives
- Write and run a first MrJob job (WordCount).
- Understand the MrJob workflow: local run, Hadoop streaming, debugging.
- Implement a simple aggregation MrJob on a small taxi sample.
Outline
1. MrJob installation (already in the Docker image).
2. WordCount on Bram Stoker’s Dracula, run locally then on Hadoop.
3. First taxi job: count trips per pickup zone, on a 1-month sample.
4. Multi-step MrJob: average fare per pickup zone (two reducers chained).
5. Debugging tour: counters, logs, YARN UI.
Resources TP TP_BigData_English/lab_06_mrjob_basics/, slides outline slides/session_06.md.
Deliverable Two MrJob scripts (single-step and multi-step) committed to the student’s repo.

S7 — MrJob at scale on NYC Taxi (Lab, 2 h)¶

Type Lab.
Objectives
- Run MrJob on one full year of taxi data via HDFS.
- Reproduce in MapReduce the questions answered in pandas in S2.
- Compare the two approaches: code length, runtime, mental model.
Outline
1. Push 1 year of data into HDFS (~2 GB).
2. Implement the five S2 questions in MrJob.
3. Watch jobs in YARN UI; identify slow stages.
4. Comparison table (pandas vs MapReduce) filled in by each student.
Resources TP TP_BigData_English/lab_07_mrjob_taxi/, slides outline slides/session_07.md.
Deliverable Comparison table + repo containing the 5 MrJob scripts.

Module C — Spark¶

S8 — Spark introduction (Lecture + Lab, 2 h)¶

Type Mixed (45 min lecture, 75 min lab).
Objectives
- Explain why Spark replaced MapReduce as the default tool: in-memory, lazy evaluation, richer API.
- Describe the SparkContext / SparkSession, RDDs and DataFrames.
- Run a first PySpark script on YARN.
Outline
1. Lecture: Spark architecture, driver vs executors, lazy evaluation, DAG.
2. RDD vs DataFrame API.
3. Hands-on: open a pyspark shell on the master, do basic transformations on a small RDD.
4. Submit a first script with spark-submit --master yarn.
Resources TP TP_BigData_English/lab_08_spark_intro/, slides outline slides/session_08.md.
Deliverable First PySpark script counting trips per pickup zone, submitted to YARN.

S9 — Spark DataFrames and Spark SQL (Lab, 2 h)¶

Type Lab.
Objectives
- Use the Spark DataFrame API and Spark SQL fluently.
- Process multi-year taxi data (3–5 years, ~10 GB).
- Build a side-by-side comparison: pandas vs MapReduce vs Spark for the same five questions.
Outline
1. DataFrames: read.parquet, filter, groupBy, agg, join, withColumn.
2. Same as Spark SQL via spark.sql("SELECT ... GROUP BY ...").
3. Re-implement the 5 questions of S2/S7 on 3 years of data.
4. Performance discussion: caching, partitioning, broadcast joins.
5. Final comparison table: lines of code, runtime, ease of debugging.
Resources TP TP_BigData_English/lab_09_spark_sql/, slides outline slides/session_09.md.
Deliverable Repo with the Spark scripts + comparison table (pandas/MapReduce/Spark).

Module D — Group project¶

S10 — Project kickoff (Supervised, 2 h)¶

Type Supervised group work.
Objectives
- Form groups of 3 students (10 groups for a class of 30).
- Choose a research question (from the catalogue or a free topic, subject to teacher validation).
- Scope the question, design the Spark pipeline, identify required data subsets.
- Set up the project repo (Git, README, environment).
Outline
1. Group formation (15 min).
2. Topic selection — multiple groups may pick the same topic (5 min).
3. Scoping workshop with the teacher (30 min).
4. Pipeline design and repo bootstrap (70 min).
Deliverable One-page project proposal validated by the teacher at end of session.

S11 — Project work (Supervised, 2 h)¶

Type Supervised group work.
Objectives
- Implement the Spark pipeline.
- Produce preliminary results.
- Prepare the report skeleton and the defense slides.
Outline Free, with the teacher rotating between groups.
Deliverable Working pipeline + draft report + draft slides at end of session. Final deliverables are due 1 week before the restitution.

Project topic catalogue¶

Twelve topics are proposed. Multiple groups may choose the same topic. A group may also propose its own topic, subject to teacher approval at S10.

Theme A — Descriptive analytics (accessible)¶

#	Topic	Question	Difficulty	Key technical angle
A1	Demand patterns	When and where is demand highest? Hour × day × zone heatmaps.	★	Multi-dim `groupBy`, visualisation
A2	Tip behaviour	What drives tip %? Card vs cash, route, time, distance, zone?	★★	Aggregation + statistical comparison
A3	COVID impact	How did demand collapse in 2020 and recover through 2023?	★	Time-series aggregation, before/after
A4	Yellow vs Green vs FHV	How do the 3 fleets differ in service patterns and zones served?	★★	Multi-dataset join (3 TLC sources)

Theme B — Optimisation and decision support (intermediate)¶

#	Topic	Question	Difficulty	Key technical angle
B1	Driver income optimisation	Where and when should a driver work to maximise hourly revenue?	★★	Multi-dim aggregation + ranking
B2	Origin-destination flows	Top NYC commute flows; OD matrix; evolution over years.	★★	Pairwise aggregation, matrix output

Theme C — Predictive ML at scale (advanced)¶

#	Topic	Question	Difficulty	Key technical angle
C1	Tip prediction	Predict tip % from pickup features.	★★★	Spark MLlib pipeline, feature engineering
C2	Trip duration prediction	Predict ETA given zone-pair, hour, weekday.	★★★	Spark MLlib, regression + evaluation

Theme D — Detection and data quality (intermediate)¶

#	Topic	Question	Difficulty	Key technical angle
D1	Anomaly detection	Suspicious trips — impossible distances, weird fares, route detours.	★★	Statistical outlier rules, zone distance matrix

Theme E — External joins (advanced, bonus prestige)¶

#	Topic	Question	Difficulty	Key technical angle
E1	Weather impact	Does weather (rain, snow, heat) affect demand, fare, tips?	★★★	Join with NOAA daily weather data
E2	Geospatial hotspots	Hotspot evolution over time, with NYC taxi zone shapefiles.	★★★	Spatial join, geographic visualisation
E3	Special events impact	Quantify impact of NYE, parades, sports finals on taxi flows.	★★★	Manually-curated event list, temporal join

Project deliverables¶

Written report (PDF) — ~10 pages, English. Structure: research question, data, methodology, Spark pipeline, results, discussion, conclusion. Due 1 week before the restitution.
Code repository (Git) — Spark scripts, README explaining how to reproduce, requirements file. Same deadline.
Oral defense — 10 minutes presentation + 5 minutes Q/A, during the final restitution session.

Project grading rubric (40 % of the chapter grade)¶

Criterion	Weight
Relevance and depth of the analysis	30 %
Quality of the Spark code (readability, idioms, performance)	25 %
Quality of the report (structure, visualisations, conclusions)	25 %
Quality of the defense (clarity, answers to questions)	20 %

Mid-term exam¶

When week 9, 1 hour.
Format on paper, no machine, no notes.
Coverage sessions S1 to S7 (Big Data context, HDFS, MapReduce, MrJob). Spark is not examined.
Weight 20 % of the chapter grade.

Sample question types: design a MapReduce decomposition for a given analytical question; explain HDFS replication on a diagram; reason about a memory bottleneck given a pandas workflow.

Final restitution¶

When week 12, 2 hours, two parallel sessions (5 groups per jury).
Format 10 min presentation + 5 min Q/A per group.
Audience teaching team + the other 4 groups in the same session.

Total chapter assessment:

Component	Weight
Mid-term exam	20 %
Lab report (Module B/C)	20 %
Group project (report + code + defense)	40 %
Other 20 % is provided by Chapter 2 (NoSQL lab report)	—

Infrastructure and dataset¶

Docker cluster¶

A 1-master + 2-slaves Hadoop 3.5 + Spark 4.1 cluster, packaged as a Docker image, is provided for the chapter.
It runs identically on macOS, Linux and Windows (Docker Desktop or Docker Engine + WSL2).
Memory profile is auto-detected (8 GB / 16 GB) at startup.
Setup instructions, web UIs and troubleshooting are documented in the Docker repo’s README.
Students must install Docker Desktop and pull the image before session S4.

NYC Yellow Taxi dataset¶

Source NYC Taxi & Limousine Commission (TLC) — monthly Parquet files, free to use.
Coverage used in this chapter January 2019 — December 2023 (5 years, ~150 M trips, ~15 GB).
Subsets per session
- S2 (pandas refresher): 1 month, ~150 MB.
- S3 (hitting the wall): 1 year, ~2 GB.
- S4–S7 (HDFS, MrJob): 1 year, in HDFS.
- S8–S9 (Spark): 3 years, ~6 GB.
- S10–S11 (project): 5 years, ~15 GB.
A helper script download_taxi_data.sh is provided in the Docker repo.

References¶

Books¶

Hadoop: The Definitive Guide, Tom White, 4th edition, O’Reilly, 2015. Available at the Centrale Lyon library.
Learning Spark, Jules Damji et al., 2nd edition, O’Reilly, 2020. Free PDF on the Databricks website.
Spark: The Definitive Guide, Bill Chambers and Matei Zaharia, O’Reilly, 2018.

Documentation¶

Videos¶

Big Data in 5 minutes.
What is Hadoop? — 30 min beginner introduction.
HDFS Tutorial for Beginners — 43 min walk-through.
Apache Spark in 5 minutes.