Session 1 — From Raw Data to Clean Data
Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
← Back to course index
📦 Download all session files — notebook, datasets, requirements
Contents:
session1_titanic.ipynb·titanic.csv·port_info.csv·requirements.txt
Overview¶
| Dataset | Titanic passenger records |
| Duration | 3 hours |
| Format | Jupyter notebook + paper quiz (15 min) |
| Pandas required? | No — introduced in this session |
In this first session, you will discover Pandas by practice on the Titanic dataset. Rather than a systematic lecture on the library, concepts are introduced as you need them to answer real questions about the data.
Learning objectives¶
By the end of this session, you will be able to:
- Load a CSV file and perform a first inspection of a DataFrame
- Detect and quantify missing values, and choose an appropriate imputation strategy
- Drop irrelevant columns and encode categorical variables
- Join two DataFrames with
pd.merge() - Aggregate data with
groupby()andpivot_table()
Before the session — what you need to do¶
1. Set up your environment
Make sure Python 3.9+ is installed. Then install the required packages:
2. Download the session files
Download the following files and place them in a folder called session1/: ⬇ session1.zip.
3. Launch Jupyter
Then open session1_titanic.ipynb.
4. Check your setup
Run the first code cell (the import cell). If no error appears, you are ready.
Session content¶
The notebook is divided into 5 blocks:
| Block | Topic | Key functions |
|---|---|---|
| 1 | First contact with Pandas | read_csv, head, shape, dtypes, describe |
| 2 | Exploring missingness | isnull, info, bar chart of missing values |
| 3 | Cleaning & Encoding | fillna, drop, map, get_dummies |
| 4 | Merging DataFrames | pd.merge, join types |
| 5 | Aggregation & Grouping | groupby, agg, pivot_table, heatmap |
Each block contains exercises (🏋️) with solutions hidden in collapsible cells.
Try to solve them before expanding the solution.
Quiz¶
A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers the concepts introduced in the notebook:
- True/False questions on Pandas behaviour and join semantics
- Multiple choice on missing value strategies and encoding choices
- Short interpretation questions on
groupbyoutput and pivot tables
💡 Tip: Focus on understanding why each operation is done, not on memorising syntax. The quiz tests reasoning, not code recall.
Key concepts to remember¶
- Always work on a copy — never modify the raw DataFrame in place
- Understand missingness before imputing — the right strategy depends on the context
- Binary map vs one-hot encoding — use
mapfor 2 categories,get_dummiesfor 3+ - Left join preserves all rows from the left — always verify shape after a merge
Coming up in Session 2¶
→ Session 2 — From Clean Data to Insight
Spotify Tracks dataset · Distributions · Correlations · PCA