Aller au contenu

Content for Chapter 1, Big Data Technologies

This chapter takes students from familiar single-machine data analysis (Python + SQL) to distributed processing of multi-gigabyte datasets, using a single fil rouge: NYC Yellow Taxi trip records (2019–2023). Each new technology is introduced when the previous one hits its limits — students experience the wall before they get the tool.

Table of contents

Learning objectives

By the end of this chapter, students should be able to:

  1. Explain the technical reasons (memory, I/O, CPU) why single-machine data analysis breaks down on large datasets, and describe the core ideas of distributed storage and computation.
  2. Operate a 3-node Hadoop + Spark cluster (HDFS, YARN, Spark on YARN) provided as a Docker image.
  3. Design a MapReduce job from a high-level analytical question, and implement it in Python with the MrJob library.
  4. Process multi-gigabyte datasets with Spark — using both the DataFrame API and Spark SQL — and reason about partitioning, lazy evaluation and shuffling.
  5. Compare the three paradigms (pandas, MapReduce, Spark) and choose the appropriate tool for a given problem and scale.
  6. Conduct a small data analysis project end-to-end: pick a research question, design and implement a Spark pipeline on real-world taxi data, write a report and defend the work orally.

Prerequisites

Topic Expected level
Python Comfortable with pandas, numpy, list/dict manipulations, virtual environments.
SQL Comfortable writing SELECT … GROUP BY … JOIN queries.
Command line Comfortable in a Unix shell (cd, ls, pipes, redirection).
Docker Desktop Installed on the personal machine before session S4 (free download, ~1 GB).
Disk space ≈ 20 GB free for the Docker image + NYC Taxi data + HDFS volumes.

Architecture of the chapter

Module Sessions Duration Theme
A — From single-machine to scale S1–S3 6 h Big Data context, pandas refresher, hitting the wall
B — Distributed storage and MapReduce S4–S7 8 h HDFS, MapReduce paradigm, MrJob in practice
C — Spark S8–S9 4 h RDDs, DataFrames, Spark SQL
D — Group project S10–S11 4 h Supervised project work
Mid-term exam week 9 1 h Covers S1–S7 (pre-Spark)
Final restitution week 12 2 h Project defenses, 2 parallel jurys
Total 25 h

Sessions are 2 hours each. The fil rouge — NYC Yellow Taxi data — appears from S1 to S11.

Module A — From single-machine to scale

S1 — What is “Big Data”? (Lecture, 2 h)

  • Type Lecture, no lab.
  • Objectives
    • Define Big Data through the 5Vs (Volume, Velocity, Variety, Veracity, Value).
    • Map the modern Big Data ecosystem (storage, processing, NoSQL, cloud).
    • Identify when Big Data tools are the right answer — and when they are overkill.
    • Discover the NYC Yellow Taxi dataset that will serve as fil rouge.
  • Outline
    1. Brief history: from BI/data warehouses to distributed systems.
    2. The 5Vs, with examples.
    3. Ecosystem map: HDFS, MapReduce, Spark, NoSQL, cloud data lakes.
    4. “Do I really need Big Data?” — decision flowchart.
    5. Tour of the NYC Taxi & Limousine Commission (TLC) trip-record dataset.
  • Resources slides outline slides/session_01.md.
  • No deliverable.

S2 — Pandas refresher on NYC Taxi (Lab, 2 h)

  • Type Lab, students work on their own machine.
  • Objectives
    • Reactivate pandas reflexes on a real-world dataset.
    • Read parquet files, type-check columns, handle missing values.
    • Answer five analytical questions with idiomatic pandas.
  • Outline
    1. Download a single month of yellow taxi data (~150 MB, ~3 M rows).
    2. Inspect schema and data quality.
    3. Five guided questions:
      • Average fare amount per hour of the day.
      • Top 10 pickup zones by trip count.
      • Tip percentage by payment type.
      • Trip duration distribution; flag obvious outliers.
      • Daily revenue trend.
    4. Free exploration (10 minutes).
  • Resources TP TP_BigData_English/lab_02_pandas/, slides outline slides/session_02.md.
  • Deliverable Jupyter notebook submitted on the LMS by end of session.

S3 — Hitting the wall (Lecture + Lab, 2 h)

  • Type Mixed (45 min lecture, 75 min lab).
  • Objectives
    • Experience first-hand the failure modes of single-machine analytics.
    • Profile memory consumption and identify bottlenecks.
    • Understand sampling, chunking and partitioning as first-step mitigations.
    • Build the intuition for distributed computation.
  • Outline
    1. Lecture: vertical vs horizontal scaling; memory hierarchy; I/O is the bottleneck.
    2. Lab: try to load one full year of yellow taxi data (~36 M rows, ~2 GB) into pandas — observe the crash on most laptops.
    3. Mitigations: chunksize, sampling, column pruning, dtype downcasting.
    4. Bridge to next module: “what if storage and compute were spread across machines?”
  • Resources TP TP_BigData_English/lab_03_limits/, slides outline slides/session_03.md.
  • Deliverable Short markdown note (½ page) describing the bottleneck observed and one mitigation tried.

Module B — Distributed storage and MapReduce

S4 — HDFS and the Docker cluster (Lecture + Lab, 2 h)

  • Type Mixed (45 min lecture, 75 min hands-on cluster setup).
  • Objectives
    • Describe the HDFS architecture (NameNode, DataNodes, blocks, replication).
    • Explain “schema-on-read” and why it matters at scale.
    • Bring up the 3-node Hadoop/Spark cluster on the personal laptop.
    • Push and inspect data in HDFS via the CLI and the NameNode UI.
  • Outline
    1. Lecture: distributed file systems, GFS heritage, HDFS design choices, fault tolerance.
    2. Hands-on: docker compose up -d, docker compose exec hadoop-master bash, check-cluster.sh.
    3. HDFS tour: hadoop fs -mkdir, -put, -ls, -cat; navigate the NameNode UI on http://localhost:9870.
    4. Push one month of yellow taxi data into HDFS, observe block distribution across the 2 DataNodes.
  • Resources TP TP_BigData_English/lab_04_hdfs/, Docker repo, slides outline slides/session_04.md.
  • Deliverable Screenshot of the NameNode UI showing taxi data correctly distributed across both DataNodes.

S5 — The MapReduce paradigm (Lecture, 2 h)

  • Type Lecture + pen-and-paper exercises (no machine).
  • Objectives
    • Explain the MapReduce computational model: map, shuffle, reduce.
    • Decompose an analytical question into map and reduce functions.
    • Recognise problems that fit MapReduce naturally and those that don’t.
  • Outline
    1. The original Google MapReduce paper — key ideas.
    2. WordCount as the canonical example.
    3. Decomposition exercises (pen-and-paper): “average fare per pickup zone”, “top 10 zones by tip percentage”, “trip count by hour-of-day”.
    4. Limits: iterative algorithms, joins, the cost of shuffling.
    5. Bridge to Spark.
  • Resources slides outline slides/session_05.md, exercises TP_BigData_English/exercises_05_mapreduce/.
  • Deliverable Decomposition sheet (provided as a worksheet) handed in at end of session.

S6 — MrJob in practice (Lab, 2 h)

  • Type Lab.
  • Objectives
    • Write and run a first MrJob job (WordCount).
    • Understand the MrJob workflow: local run, Hadoop streaming, debugging.
    • Implement a simple aggregation MrJob on a small taxi sample.
  • Outline
    1. MrJob installation (already in the Docker image).
    2. WordCount on Bram Stoker’s Dracula, run locally then on Hadoop.
    3. First taxi job: count trips per pickup zone, on a 1-month sample.
    4. Multi-step MrJob: average fare per pickup zone (two reducers chained).
    5. Debugging tour: counters, logs, YARN UI.
  • Resources TP TP_BigData_English/lab_06_mrjob_basics/, slides outline slides/session_06.md.
  • Deliverable Two MrJob scripts (single-step and multi-step) committed to the student’s repo.

S7 — MrJob at scale on NYC Taxi (Lab, 2 h)

  • Type Lab.
  • Objectives
    • Run MrJob on one full year of taxi data via HDFS.
    • Reproduce in MapReduce the questions answered in pandas in S2.
    • Compare the two approaches: code length, runtime, mental model.
  • Outline
    1. Push 1 year of data into HDFS (~2 GB).
    2. Implement the five S2 questions in MrJob.
    3. Watch jobs in YARN UI; identify slow stages.
    4. Comparison table (pandas vs MapReduce) filled in by each student.
  • Resources TP TP_BigData_English/lab_07_mrjob_taxi/, slides outline slides/session_07.md.
  • Deliverable Comparison table + repo containing the 5 MrJob scripts.

Module C — Spark

S8 — Spark introduction (Lecture + Lab, 2 h)

  • Type Mixed (45 min lecture, 75 min lab).
  • Objectives
    • Explain why Spark replaced MapReduce as the default tool: in-memory, lazy evaluation, richer API.
    • Describe the SparkContext / SparkSession, RDDs and DataFrames.
    • Run a first PySpark script on YARN.
  • Outline
    1. Lecture: Spark architecture, driver vs executors, lazy evaluation, DAG.
    2. RDD vs DataFrame API.
    3. Hands-on: open a pyspark shell on the master, do basic transformations on a small RDD.
    4. Submit a first script with spark-submit --master yarn.
  • Resources TP TP_BigData_English/lab_08_spark_intro/, slides outline slides/session_08.md.
  • Deliverable First PySpark script counting trips per pickup zone, submitted to YARN.

S9 — Spark DataFrames and Spark SQL (Lab, 2 h)

  • Type Lab.
  • Objectives
    • Use the Spark DataFrame API and Spark SQL fluently.
    • Process multi-year taxi data (3–5 years, ~10 GB).
    • Build a side-by-side comparison: pandas vs MapReduce vs Spark for the same five questions.
  • Outline
    1. DataFrames: read.parquet, filter, groupBy, agg, join, withColumn.
    2. Same as Spark SQL via spark.sql("SELECT ... GROUP BY ...").
    3. Re-implement the 5 questions of S2/S7 on 3 years of data.
    4. Performance discussion: caching, partitioning, broadcast joins.
    5. Final comparison table: lines of code, runtime, ease of debugging.
  • Resources TP TP_BigData_English/lab_09_spark_sql/, slides outline slides/session_09.md.
  • Deliverable Repo with the Spark scripts + comparison table (pandas/MapReduce/Spark).

Module D — Group project

S10 — Project kickoff (Supervised, 2 h)

  • Type Supervised group work.
  • Objectives
    • Form groups of 3 students (10 groups for a class of 30).
    • Choose a research question (from the catalogue or a free topic, subject to teacher validation).
    • Scope the question, design the Spark pipeline, identify required data subsets.
    • Set up the project repo (Git, README, environment).
  • Outline
    1. Group formation (15 min).
    2. Topic selection — multiple groups may pick the same topic (5 min).
    3. Scoping workshop with the teacher (30 min).
    4. Pipeline design and repo bootstrap (70 min).
  • Deliverable One-page project proposal validated by the teacher at end of session.

S11 — Project work (Supervised, 2 h)

  • Type Supervised group work.
  • Objectives
    • Implement the Spark pipeline.
    • Produce preliminary results.
    • Prepare the report skeleton and the defense slides.
  • Outline Free, with the teacher rotating between groups.
  • Deliverable Working pipeline + draft report + draft slides at end of session. Final deliverables are due 1 week before the restitution.

Project topic catalogue

Twelve topics are proposed. Multiple groups may choose the same topic. A group may also propose its own topic, subject to teacher approval at S10.

Theme A — Descriptive analytics (accessible)
# Topic Question Difficulty Key technical angle
A1 Demand patterns When and where is demand highest? Hour × day × zone heatmaps. Multi-dim groupBy, visualisation
A2 Tip behaviour What drives tip %? Card vs cash, route, time, distance, zone? ★★ Aggregation + statistical comparison
A3 COVID impact How did demand collapse in 2020 and recover through 2023? Time-series aggregation, before/after
A4 Yellow vs Green vs FHV How do the 3 fleets differ in service patterns and zones served? ★★ Multi-dataset join (3 TLC sources)
Theme B — Optimisation and decision support (intermediate)
# Topic Question Difficulty Key technical angle
B1 Driver income optimisation Where and when should a driver work to maximise hourly revenue? ★★ Multi-dim aggregation + ranking
B2 Origin-destination flows Top NYC commute flows; OD matrix; evolution over years. ★★ Pairwise aggregation, matrix output
Theme C — Predictive ML at scale (advanced)
# Topic Question Difficulty Key technical angle
C1 Tip prediction Predict tip % from pickup features. ★★★ Spark MLlib pipeline, feature engineering
C2 Trip duration prediction Predict ETA given zone-pair, hour, weekday. ★★★ Spark MLlib, regression + evaluation
Theme D — Detection and data quality (intermediate)
# Topic Question Difficulty Key technical angle
D1 Anomaly detection Suspicious trips — impossible distances, weird fares, route detours. ★★ Statistical outlier rules, zone distance matrix
Theme E — External joins (advanced, bonus prestige)
# Topic Question Difficulty Key technical angle
E1 Weather impact Does weather (rain, snow, heat) affect demand, fare, tips? ★★★ Join with NOAA daily weather data
E2 Geospatial hotspots Hotspot evolution over time, with NYC taxi zone shapefiles. ★★★ Spatial join, geographic visualisation
E3 Special events impact Quantify impact of NYE, parades, sports finals on taxi flows. ★★★ Manually-curated event list, temporal join

Project deliverables

  1. Written report (PDF) — ~10 pages, English. Structure: research question, data, methodology, Spark pipeline, results, discussion, conclusion. Due 1 week before the restitution.
  2. Code repository (Git) — Spark scripts, README explaining how to reproduce, requirements file. Same deadline.
  3. Oral defense — 10 minutes presentation + 5 minutes Q/A, during the final restitution session.

Project grading rubric (40 % of the chapter grade)

Criterion Weight
Relevance and depth of the analysis 30 %
Quality of the Spark code (readability, idioms, performance) 25 %
Quality of the report (structure, visualisations, conclusions) 25 %
Quality of the defense (clarity, answers to questions) 20 %

Mid-term exam

  • When week 9, 1 hour.
  • Format on paper, no machine, no notes.
  • Coverage sessions S1 to S7 (Big Data context, HDFS, MapReduce, MrJob). Spark is not examined.
  • Weight 20 % of the chapter grade.

Sample question types: design a MapReduce decomposition for a given analytical question; explain HDFS replication on a diagram; reason about a memory bottleneck given a pandas workflow.

Final restitution

  • When week 12, 2 hours, two parallel sessions (5 groups per jury).
  • Format 10 min presentation + 5 min Q/A per group.
  • Audience teaching team + the other 4 groups in the same session.

Total chapter assessment:

Component Weight
Mid-term exam 20 %
Lab report (Module B/C) 20 %
Group project (report + code + defense) 40 %
Other 20 % is provided by Chapter 2 (NoSQL lab report)

Infrastructure and dataset

Docker cluster

  • A 1-master + 2-slaves Hadoop 3.5 + Spark 4.1 cluster, packaged as a Docker image, is provided for the chapter.
  • It runs identically on macOS, Linux and Windows (Docker Desktop or Docker Engine + WSL2).
  • Memory profile is auto-detected (8 GB / 16 GB) at startup.
  • Setup instructions, web UIs and troubleshooting are documented in the Docker repo’s README.
  • Students must install Docker Desktop and pull the image before session S4.

NYC Yellow Taxi dataset

  • Source NYC Taxi & Limousine Commission (TLC) — monthly Parquet files, free to use.
  • Coverage used in this chapter January 2019 — December 2023 (5 years, ~150 M trips, ~15 GB).
  • Subsets per session
    • S2 (pandas refresher): 1 month, ~150 MB.
    • S3 (hitting the wall): 1 year, ~2 GB.
    • S4–S7 (HDFS, MrJob): 1 year, in HDFS.
    • S8–S9 (Spark): 3 years, ~6 GB.
    • S10–S11 (project): 5 years, ~15 GB.
  • A helper script download_taxi_data.sh is provided in the Docker repo.

References

Books

  • Hadoop: The Definitive Guide, Tom White, 4th edition, O’Reilly, 2015. Available at the Centrale Lyon library.
  • Learning Spark, Jules Damji et al., 2nd edition, O’Reilly, 2020. Free PDF on the Databricks website.
  • Spark: The Definitive Guide, Bill Chambers and Matei Zaharia, O’Reilly, 2018.

Documentation

Videos