Aller au contenu

Session 1 — From Raw Data to Clean Data

Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
Back to course index


📦 Download all session files — notebook, datasets, requirements

⬇ session1.zip

Contents: session1_titanic.ipynb · titanic.csv · port_info.csv · requirements.txt


Overview

Dataset Titanic passenger records
Duration 3 hours
Format Jupyter notebook + paper quiz (15 min)
Pandas required? No — introduced in this session

In this first session, you will discover Pandas by practice on the Titanic dataset. Rather than a systematic lecture on the library, concepts are introduced as you need them to answer real questions about the data.


Learning objectives

By the end of this session, you will be able to:

  • Load a CSV file and perform a first inspection of a DataFrame
  • Detect and quantify missing values, and choose an appropriate imputation strategy
  • Drop irrelevant columns and encode categorical variables
  • Join two DataFrames with pd.merge()
  • Aggregate data with groupby() and pivot_table()

Before the session — what you need to do

1. Set up your environment

Make sure Python 3.9+ is installed. Then install the required packages:

pip install pandas numpy matplotlib seaborn jupyter

2. Download the session files

Download the following files and place them in a folder called session1/: ⬇ session1.zip.

3. Launch Jupyter

jupyter notebook

Then open session1_titanic.ipynb.

4. Check your setup

Run the first code cell (the import cell). If no error appears, you are ready.


Session content

The notebook is divided into 5 blocks:

Block Topic Key functions
1 First contact with Pandas read_csv, head, shape, dtypes, describe
2 Exploring missingness isnull, info, bar chart of missing values
3 Cleaning & Encoding fillna, drop, map, get_dummies
4 Merging DataFrames pd.merge, join types
5 Aggregation & Grouping groupby, agg, pivot_table, heatmap

Each block contains exercises (🏋️) with solutions hidden in collapsible cells.
Try to solve them before expanding the solution.


Quiz

A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers the concepts introduced in the notebook:

  • True/False questions on Pandas behaviour and join semantics
  • Multiple choice on missing value strategies and encoding choices
  • Short interpretation questions on groupby output and pivot tables

💡 Tip: Focus on understanding why each operation is done, not on memorising syntax. The quiz tests reasoning, not code recall.


Key concepts to remember

  • Always work on a copy — never modify the raw DataFrame in place
  • Understand missingness before imputing — the right strategy depends on the context
  • Binary map vs one-hot encoding — use map for 2 categories, get_dummies for 3+
  • Left join preserves all rows from the left — always verify shape after a merge

Coming up in Session 2

Session 2 — From Clean Data to Insight

Spotify Tracks dataset · Distributions · Correlations · PCA