Aller au contenu

Session 3 — From Insight to Decision

Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
Back to course index


📦 Download all session files — notebook and dataset

⬇ session3.zip

Contents: session3_heartdisease.ipynb · heart_disease.csv


Overview

Dataset Heart Disease UCI (~920 patients, 13 clinical features)
Duration 3 hours
Format Jupyter notebook + paper quiz (15 min)
Context Medical — high-stakes binary classification

This session introduces supervised classification in a real medical context. You will engineer features, train two classifiers, evaluate them rigorously, and reflect on the ethical implications of model errors in healthcare.


Learning objectives

By the end of this session, you will be able to:

  • Create new features from existing variables using domain knowledge
  • Explore data through the lens of the target variable (class-conditional EDA)
  • Split data into train/test sets correctly, avoiding data leakage
  • Train a Logistic Regression and a Random Forest classifier
  • Read and interpret a confusion matrix, classification report, and ROC curve
  • Adjust the decision threshold and understand the precision/recall trade-off
  • Discuss the ethical implications of false negatives and false positives in medicine

Before the session — what you need to do

1. Verify your environment

All packages from Sessions 1 and 2 must be installed. No new installation required.

2. Download the session files

⬇ session3.zip─ heart_disease.csv

3. Launch Jupyter and open the notebook

jupyter notebook session3_heartdisease.ipynb

4. Read the feature table in Block 1

The notebook opens with a table of all 13 clinical variables.
Take 2 minutes to read it — understanding what each feature represents
will help you interpret the model results correctly.


Session content

The notebook is divided into 5 blocks:

Block Topic Key tools
1 Dataset discovery & Feature engineering pd.cut, binary flags, ratio features
2 Targeted EDA Overlaid histograms, grouped bar charts, correlation with target
3 Classification train_test_split, StandardScaler, LogisticRegression, RandomForestClassifier
4 Model evaluation confusion_matrix, classification_report, roc_curve, threshold analysis
5 Ethics & limits Discussion — no code

⚠️ Block 3 introduces an important concept: data leakage.
The StandardScaler must be fitted on the training set only, then applied to the test set.
This is explained in detail — make sure you understand it before the quiz.


Key formulas to know

Logistic Regression — sigmoid: $\(P(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}\)$

Confusion matrix derived metrics:

Metric Formula Medical meaning
Precision \(TP / (TP + FP)\) Of all flagged as sick, how many truly are?
Recall \(TP / (TP + FN)\) Of all truly sick, how many did we catch?
F1 \(2 \cdot P \cdot R / (P + R)\) Balance between the two

AUC: area under the ROC curve — 1.0 = perfect, 0.5 = random.


Quiz

A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers:

  • True/False on data leakage, AUC interpretation, recall definition, and threshold effects
  • Multiple choice: computing recall from a confusion matrix, identifying the most dangerous error type, Random Forest vs Logistic Regression
  • Short questions: data leakage scenario, threshold trade-off analysis, sex feature bias debate

💡 Tip: Be ready to compute Recall and Precision from a small confusion matrix by hand. Practice with: TP=40, FP=10, FN=8, TN=42.


Key concepts to remember

  • Accuracy alone is misleading — always inspect the full confusion matrix
  • Recall is the priority in medical screening — a missed diagnosis is costlier than a false alarm
  • The threshold is a design choice — it should reflect the real-world cost of each error
  • Data leakage — fit the scaler on train only; never let test data influence preprocessing
  • A model is a decision support tool — the final call belongs to the clinician

Coming up in Session 4

Session 4 — Beyond Supervised Learning

K-Means · Naive Bayes · Neural Networks · Model comparison · Introduction to Deep Learning