Aller au contenu

Session 2 — From Clean Data to Insight

Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
Back to course index


📦 Download all session files — notebook and dataset

⬇ session2.zip

Contents: session2_spotify.ipynb · spotify_tracks.csv


Overview

Dataset Spotify Tracks (~3 000 tracks, 8 genres, 19 audio features)
Duration 3 hours
Format Jupyter notebook + paper quiz (15 min)
New libraries Seaborn, scikit-learn (StandardScaler, PCA)

In this session, you move from a clean dataset to meaningful insights. You will learn to choose the right visualisation for each question, measure relationships between variables, and reduce dimensionality to reveal hidden structure in the data.


Learning objectives

By the end of this session, you will be able to:

  • Describe the distribution of a variable and identify skewness, outliers, and modes
  • Compare distributions across groups using boxplots
  • Compute and interpret a correlation matrix
  • Read and interpret a pair plot
  • Standardise features and explain why it is necessary before PCA
  • Apply PCA, interpret the scree plot and the loading matrix

Before the session — what you need to do

1. Verify your environment

The packages from Session 1 must already be installed. Additionally:

pip install scikit-learn

2. Download the session files

⬇ session2.zip

3. Launch Jupyter and open the notebook

jupyter notebook session2_spotify.ipynb

4. Run the setup cell

Run the first code cell. You should see: All imports OK.


Session content

The notebook is divided into 4 blocks:

Block Topic Key tools
1 Dataset discovery Inspection, cleaning, genre/decade distributions
2 Univariate analysis hist(), sns.boxplot(), reading distributions
3 Bivariate analysis & Correlations Scatter plots, .corr(), sns.heatmap(), sns.pairplot()
4 PCA StandardScaler, PCA, scree plot, loadings, 2D projection

Each block contains exercises (🏋️) with collapsible solutions.

💡 Block 4 includes the linear algebra behind PCA — the covariance matrix, eigendecomposition, and the projection formula. Read it carefully before the exercises.


Background reading — PCA

PCA is built on three linear algebra concepts you should be comfortable with:

  • Matrix multiplication — the projection \(Z = X \cdot V_k\)
  • Covariance\(\Sigma_{ij}\) measures how features \(i\) and \(j\) vary together
  • Eigenvectors/eigenvalues — directions and magnitudes of maximum variance

No need to derive them from scratch — the notebook walks through each step.


Quiz

A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers:

  • True/False on distribution shapes and PCA properties
  • Multiple choice on visualisation choices, correlation interpretation, and standardisation
  • Short questions: reading a scree plot, interpreting PCA loadings, identifying a correlation error

💡 Tip: Be ready to interpret a loading table and explain what a principal component “represents” musically. The quiz does not ask you to reproduce formulas.


Key concepts to remember

  • Boxplot anatomy — box = IQR (Q1 to Q3), line = median, whiskers = 1.5×IQR, dots = outliers
  • Correlation ≠ causation — a strong r only means two variables move together linearly
  • Standardisation is mandatory before PCA — features with large ranges dominate otherwise
  • Loadings explain PCA — the weight of each original feature in each component
  • Scree plot + cumulative variance — use both to choose the number of components

Coming up in Session 3

Session 3 — From Insight to Decision

Heart Disease UCI dataset · Feature engineering · Classification · Model evaluation · Ethics