Session 2 — From Clean Data to Insight

Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
← Back to course index

📦 Download all session files — notebook and dataset

⬇ session2.zip

Contents: session2_spotify.ipynb · spotify_tracks.csv

Overview¶


Dataset	Spotify Tracks (~3 000 tracks, 8 genres, 19 audio features)
Duration	3 hours
Format	Jupyter notebook + paper quiz (15 min)
New libraries	Seaborn, scikit-learn (StandardScaler, PCA)

In this session, you move from a clean dataset to meaningful insights. You will learn to choose the right visualisation for each question, measure relationships between variables, and reduce dimensionality to reveal hidden structure in the data.

Learning objectives¶

By the end of this session, you will be able to:

Describe the distribution of a variable and identify skewness, outliers, and modes
Compare distributions across groups using boxplots
Compute and interpret a correlation matrix
Read and interpret a pair plot
Standardise features and explain why it is necessary before PCA
Apply PCA, interpret the scree plot and the loading matrix

Before the session — what you need to do¶

1. Verify your environment

The packages from Session 1 must already be installed. Additionally:

pip install scikit-learn

2. Download the session files

⬇ session2.zip

3. Launch Jupyter and open the notebook

jupyter notebook session2_spotify.ipynb

4. Run the setup cell

Run the first code cell. You should see: All imports OK.

Session content¶

The notebook is divided into 4 blocks:

Block	Topic	Key tools
1	Dataset discovery	Inspection, cleaning, genre/decade distributions
2	Univariate analysis	`hist()`, `sns.boxplot()`, reading distributions
3	Bivariate analysis & Correlations	Scatter plots, `.corr()`, `sns.heatmap()`, `sns.pairplot()`
4	PCA	`StandardScaler`, `PCA`, scree plot, loadings, 2D projection

Each block contains exercises (🏋️) with collapsible solutions.

💡 Block 4 includes the linear algebra behind PCA — the covariance matrix, eigendecomposition, and the projection formula. Read it carefully before the exercises.

Background reading — PCA¶

PCA is built on three linear algebra concepts you should be comfortable with:

Matrix multiplication — the projection \(Z = X \cdot V_k\)
Covariance — \(\Sigma_{ij}\) measures how features \(i\) and \(j\) vary together
Eigenvectors/eigenvalues — directions and magnitudes of maximum variance

No need to derive them from scratch — the notebook walks through each step.

Quiz¶

A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers:

True/False on distribution shapes and PCA properties
Multiple choice on visualisation choices, correlation interpretation, and standardisation
Short questions: reading a scree plot, interpreting PCA loadings, identifying a correlation error

💡 Tip: Be ready to interpret a loading table and explain what a principal component “represents” musically. The quiz does not ask you to reproduce formulas.

Key concepts to remember¶

Boxplot anatomy — box = IQR (Q1 to Q3), line = median, whiskers = 1.5×IQR, dots = outliers
Correlation ≠ causation — a strong r only means two variables move together linearly
Standardisation is mandatory before PCA — features with large ranges dominate otherwise
Loadings explain PCA — the weight of each original feature in each component
Scree plot + cumulative variance — use both to choose the number of components

Coming up in Session 3¶

→ Session 3 — From Insight to Decision

Heart Disease UCI dataset · Feature engineering · Classification · Model evaluation · Ethics