Session 1 — From Raw Data to Clean Data

Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
← Back to course index

📦 Download all session files — notebook, datasets, requirements

⬇ session1.zip

Contents: session1_titanic.ipynb · titanic.csv · port_info.csv · requirements.txt

Overview¶


Dataset	Titanic passenger records
Duration	3 hours
Format	Jupyter notebook + paper quiz (15 min)
Pandas required?	No — introduced in this session

In this first session, you will discover Pandas by practice on the Titanic dataset. Rather than a systematic lecture on the library, concepts are introduced as you need them to answer real questions about the data.

Learning objectives¶

By the end of this session, you will be able to:

Load a CSV file and perform a first inspection of a DataFrame
Detect and quantify missing values, and choose an appropriate imputation strategy
Drop irrelevant columns and encode categorical variables
Join two DataFrames with pd.merge()
Aggregate data with groupby() and pivot_table()

Before the session — what you need to do¶

1. Set up your environment

Make sure Python 3.9+ is installed. Then install the required packages:

pip install pandas numpy matplotlib seaborn jupyter

2. Download the session files

Download the following files and place them in a folder called session1/: ⬇ session1.zip.

3. Launch Jupyter

jupyter notebook

Then open session1_titanic.ipynb.

4. Check your setup

Run the first code cell (the import cell). If no error appears, you are ready.

Session content¶

The notebook is divided into 5 blocks:

Block	Topic	Key functions
1	First contact with Pandas	`read_csv`, `head`, `shape`, `dtypes`, `describe`
2	Exploring missingness	`isnull`, `info`, bar chart of missing values
3	Cleaning & Encoding	`fillna`, `drop`, `map`, `get_dummies`
4	Merging DataFrames	`pd.merge`, join types
5	Aggregation & Grouping	`groupby`, `agg`, `pivot_table`, heatmap

Each block contains exercises (🏋️) with solutions hidden in collapsible cells.
Try to solve them before expanding the solution.

Quiz¶

A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers the concepts introduced in the notebook:

True/False questions on Pandas behaviour and join semantics
Multiple choice on missing value strategies and encoding choices
Short interpretation questions on groupby output and pivot tables

💡 Tip: Focus on understanding why each operation is done, not on memorising syntax. The quiz tests reasoning, not code recall.

Key concepts to remember¶

Always work on a copy — never modify the raw DataFrame in place
Understand missingness before imputing — the right strategy depends on the context
Binary map vs one-hot encoding — use map for 2 categories, get_dummies for 3+
Left join preserves all rows from the left — always verify shape after a merge

Coming up in Session 2¶

→ Session 2 — From Clean Data to Insight

Spotify Tracks dataset · Distributions · Correlations · PCA