Aller au contenu

Data Analysis, M1

Module supervisor

Table of Contents

This page contains all resources (notebooks, data) for the Data Analysis module, taught as part of the M1 programme at Centrale Digital Lab @ Ecole Centrale Lyon.


General information

Formation: Centrale Digital Lab @ Ecole Centrale Lyon
Level: M1
Total duration: 12 hours (4 sessions × 3 hours)
Language: English
Format: Interactive — Jupyter notebooks + closed-book paper quiz at the end of each session


Prerequisites

Prerequisite Expected level
Python programming Comfortable with functions, loops, lists, and basic OOP
NumPy Basic array manipulation
Statistics Mean, variance, correlation — no regression or probability theory required
Pandas Not required — introduced in Session 1

Organisation

The module follows the full lifecycle of a data analysis project, from raw data to actionable decisions:

  • Session 1From Raw Data to Clean Data · Titanic dataset
  • Session 2From Clean Data to Insight · Spotify Tracks dataset
  • Session 3From Insight to Decision · Heart Disease UCI dataset
  • Session 4Beyond Supervised Learning · Spotify + Heart Disease

Each session is structured as follows:

  • Instructor-led introduction (10–15 min) — context, objectives, concepts
  • Guided notebook — blocks of explanation + code + exercises (🏋️)
  • Paper quiz (15 min, closed book) — at the end of each session

Schedule

Session Title Dataset Duration
1 From Raw Data to Clean Data Titanic 3h
2 From Clean Data to Insight Spotify Tracks 3h
3 From Insight to Decision Heart Disease UCI 3h
4 Beyond Supervised Learning Spotify + Heart Disease 3h

Sessions

Session 1. From Raw Data to Clean Data

Dataset: Titanic passenger records (891 rows, 12 features)

Key topics: Pandas by practice · Missing value detection and imputation · Categorical encoding · DataFrame merging · Aggregation with groupby

Go to Session 1


Session 2. From Clean Data to Insight

Dataset: Spotify Tracks (~3 000 tracks, 8 genres, 19 audio features)

Key topics: Univariate and bivariate analysis · Histograms and boxplots · Correlation matrix · Pair plot · PCA with linear algebra

Go to Session 2


Session 3. From Insight to Decision

Dataset: Heart Disease UCI (~920 patients, 13 clinical features)

Key topics: Feature engineering · Targeted EDA · Logistic Regression · Random Forest · Confusion matrix · Precision / Recall / F1 · ROC-AUC · Ethics of model errors

Go to Session 3


Session 4. Beyond Supervised Learning

Datasets: Spotify Tracks (K-Means) · Heart Disease UCI (Naive Bayes, MLP)

Key topics: K-Means clustering · Elbow method · Silhouette score · Naive Bayes and Bayes’ theorem · Multi-Layer Perceptron · Model comparison · Introduction to Deep Learning

Go to Session 4