Session 3 — From Insight to Decision

Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
← Back to course index

📦 Download all session files — notebook and dataset

⬇ session3.zip

Contents: session3_heartdisease.ipynb · heart_disease.csv

Overview¶


Dataset	Heart Disease UCI (~920 patients, 13 clinical features)
Duration	3 hours
Format	Jupyter notebook + paper quiz (15 min)
Context	Medical — high-stakes binary classification

This session introduces supervised classification in a real medical context. You will engineer features, train two classifiers, evaluate them rigorously, and reflect on the ethical implications of model errors in healthcare.

Learning objectives¶

By the end of this session, you will be able to:

Create new features from existing variables using domain knowledge
Explore data through the lens of the target variable (class-conditional EDA)
Split data into train/test sets correctly, avoiding data leakage
Train a Logistic Regression and a Random Forest classifier
Read and interpret a confusion matrix, classification report, and ROC curve
Adjust the decision threshold and understand the precision/recall trade-off
Discuss the ethical implications of false negatives and false positives in medicine

Before the session — what you need to do¶

1. Verify your environment

All packages from Sessions 1 and 2 must be installed. No new installation required.

2. Download the session files

⬇ session3.zip─ heart_disease.csv

3. Launch Jupyter and open the notebook

jupyter notebook session3_heartdisease.ipynb

4. Read the feature table in Block 1

The notebook opens with a table of all 13 clinical variables.
Take 2 minutes to read it — understanding what each feature represents
will help you interpret the model results correctly.

Session content¶

The notebook is divided into 5 blocks:

Block	Topic	Key tools
1	Dataset discovery & Feature engineering	`pd.cut`, binary flags, ratio features
2	Targeted EDA	Overlaid histograms, grouped bar charts, correlation with target
3	Classification	`train_test_split`, `StandardScaler`, `LogisticRegression`, `RandomForestClassifier`
4	Model evaluation	`confusion_matrix`, `classification_report`, `roc_curve`, threshold analysis
5	Ethics & limits	Discussion — no code

⚠️ Block 3 introduces an important concept: data leakage.
The StandardScaler must be fitted on the training set only, then applied to the test set.
This is explained in detail — make sure you understand it before the quiz.

Key formulas to know¶

Logistic Regression — sigmoid: $$P(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$

Confusion matrix derived metrics:

Metric	Formula	Medical meaning
Precision	$TP / (TP + FP)$	Of all flagged as sick, how many truly are?
Recall	$TP / (TP + FN)$	Of all truly sick, how many did we catch?
F1	$2 \cdot P \cdot R / (P + R)$	Balance between the two

AUC: area under the ROC curve — 1.0 = perfect, 0.5 = random.

Quiz¶

A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers:

True/False on data leakage, AUC interpretation, recall definition, and threshold effects
Multiple choice: computing recall from a confusion matrix, identifying the most dangerous error type, Random Forest vs Logistic Regression
Short questions: data leakage scenario, threshold trade-off analysis, sex feature bias debate

💡 Tip: Be ready to compute Recall and Precision from a small confusion matrix by hand. Practice with: TP=40, FP=10, FN=8, TN=42.

Key concepts to remember¶

Accuracy alone is misleading — always inspect the full confusion matrix
Recall is the priority in medical screening — a missed diagnosis is costlier than a false alarm
The threshold is a design choice — it should reflect the real-world cost of each error
Data leakage — fit the scaler on train only; never let test data influence preprocessing
A model is a decision support tool — the final call belongs to the clinician

Coming up in Session 4¶

→ Session 4 — Beyond Supervised Learning

K-Means · Naive Bayes · Neural Networks · Model comparison · Introduction to Deep Learning

Metric	Formula	Medical meaning
Precision	\(TP / (TP + FP)\)	Of all flagged as sick, how many truly are?
Recall	\(TP / (TP + FN)\)	Of all truly sick, how many did we catch?
F1	\(2 \cdot P \cdot R / (P + R)\)	Balance between the two