Session 2 — From Clean Data to Insight
Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
← Back to course index
📦 Download all session files — notebook and dataset
Contents:
session2_spotify.ipynb·spotify_tracks.csv
Overview¶
| Dataset | Spotify Tracks (~3 000 tracks, 8 genres, 19 audio features) |
| Duration | 3 hours |
| Format | Jupyter notebook + paper quiz (15 min) |
| New libraries | Seaborn, scikit-learn (StandardScaler, PCA) |
In this session, you move from a clean dataset to meaningful insights. You will learn to choose the right visualisation for each question, measure relationships between variables, and reduce dimensionality to reveal hidden structure in the data.
Learning objectives¶
By the end of this session, you will be able to:
- Describe the distribution of a variable and identify skewness, outliers, and modes
- Compare distributions across groups using boxplots
- Compute and interpret a correlation matrix
- Read and interpret a pair plot
- Standardise features and explain why it is necessary before PCA
- Apply PCA, interpret the scree plot and the loading matrix
Before the session — what you need to do¶
1. Verify your environment
The packages from Session 1 must already be installed. Additionally:
2. Download the session files
3. Launch Jupyter and open the notebook
4. Run the setup cell
Run the first code cell. You should see: All imports OK.
Session content¶
The notebook is divided into 4 blocks:
| Block | Topic | Key tools |
|---|---|---|
| 1 | Dataset discovery | Inspection, cleaning, genre/decade distributions |
| 2 | Univariate analysis | hist(), sns.boxplot(), reading distributions |
| 3 | Bivariate analysis & Correlations | Scatter plots, .corr(), sns.heatmap(), sns.pairplot() |
| 4 | PCA | StandardScaler, PCA, scree plot, loadings, 2D projection |
Each block contains exercises (🏋️) with collapsible solutions.
💡 Block 4 includes the linear algebra behind PCA — the covariance matrix, eigendecomposition, and the projection formula. Read it carefully before the exercises.
Background reading — PCA¶
PCA is built on three linear algebra concepts you should be comfortable with:
- Matrix multiplication — the projection \(Z = X \cdot V_k\)
- Covariance — \(\Sigma_{ij}\) measures how features \(i\) and \(j\) vary together
- Eigenvectors/eigenvalues — directions and magnitudes of maximum variance
No need to derive them from scratch — the notebook walks through each step.
Quiz¶
A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers:
- True/False on distribution shapes and PCA properties
- Multiple choice on visualisation choices, correlation interpretation, and standardisation
- Short questions: reading a scree plot, interpreting PCA loadings, identifying a correlation error
💡 Tip: Be ready to interpret a loading table and explain what a principal component “represents” musically. The quiz does not ask you to reproduce formulas.
Key concepts to remember¶
- Boxplot anatomy — box = IQR (Q1 to Q3), line = median, whiskers = 1.5×IQR, dots = outliers
- Correlation ≠ causation — a strong r only means two variables move together linearly
- Standardisation is mandatory before PCA — features with large ranges dominate otherwise
- Loadings explain PCA — the weight of each original feature in each component
- Scree plot + cumulative variance — use both to choose the number of components
Coming up in Session 3¶
→ Session 3 — From Insight to Decision
Heart Disease UCI dataset · Feature engineering · Classification · Model evaluation · Ethics