Session 3 — From Insight to Decision
Instructor: Stéphane Derrode, Centrale Lyon
Formation: Centrale Digital Lab @ Ecole Centrale Lyon
← Back to course index
📦 Download all session files — notebook and dataset
Contents:
session3_heartdisease.ipynb·heart_disease.csv
Overview¶
| Dataset | Heart Disease UCI (~920 patients, 13 clinical features) |
| Duration | 3 hours |
| Format | Jupyter notebook + paper quiz (15 min) |
| Context | Medical — high-stakes binary classification |
This session introduces supervised classification in a real medical context. You will engineer features, train two classifiers, evaluate them rigorously, and reflect on the ethical implications of model errors in healthcare.
Learning objectives¶
By the end of this session, you will be able to:
- Create new features from existing variables using domain knowledge
- Explore data through the lens of the target variable (class-conditional EDA)
- Split data into train/test sets correctly, avoiding data leakage
- Train a Logistic Regression and a Random Forest classifier
- Read and interpret a confusion matrix, classification report, and ROC curve
- Adjust the decision threshold and understand the precision/recall trade-off
- Discuss the ethical implications of false negatives and false positives in medicine
Before the session — what you need to do¶
1. Verify your environment
All packages from Sessions 1 and 2 must be installed. No new installation required.
2. Download the session files
⬇ session3.zip─ heart_disease.csv
3. Launch Jupyter and open the notebook
4. Read the feature table in Block 1
The notebook opens with a table of all 13 clinical variables.
Take 2 minutes to read it — understanding what each feature represents
will help you interpret the model results correctly.
Session content¶
The notebook is divided into 5 blocks:
| Block | Topic | Key tools |
|---|---|---|
| 1 | Dataset discovery & Feature engineering | pd.cut, binary flags, ratio features |
| 2 | Targeted EDA | Overlaid histograms, grouped bar charts, correlation with target |
| 3 | Classification | train_test_split, StandardScaler, LogisticRegression, RandomForestClassifier |
| 4 | Model evaluation | confusion_matrix, classification_report, roc_curve, threshold analysis |
| 5 | Ethics & limits | Discussion — no code |
⚠️ Block 3 introduces an important concept: data leakage.
TheStandardScalermust be fitted on the training set only, then applied to the test set.
This is explained in detail — make sure you understand it before the quiz.
Key formulas to know¶
Logistic Regression — sigmoid: $\(P(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}\)$
Confusion matrix derived metrics:
| Metric | Formula | Medical meaning |
|---|---|---|
| Precision | \(TP / (TP + FP)\) | Of all flagged as sick, how many truly are? |
| Recall | \(TP / (TP + FN)\) | Of all truly sick, how many did we catch? |
| F1 | \(2 \cdot P \cdot R / (P + R)\) | Balance between the two |
AUC: area under the ROC curve — 1.0 = perfect, 0.5 = random.
Quiz¶
A 15-minute paper quiz (closed book, no devices) will be held at the end of the session.
It covers:
- True/False on data leakage, AUC interpretation, recall definition, and threshold effects
- Multiple choice: computing recall from a confusion matrix, identifying the most dangerous error type, Random Forest vs Logistic Regression
- Short questions: data leakage scenario, threshold trade-off analysis, sex feature bias debate
💡 Tip: Be ready to compute Recall and Precision from a small confusion matrix by hand. Practice with: TP=40, FP=10, FN=8, TN=42.
Key concepts to remember¶
- Accuracy alone is misleading — always inspect the full confusion matrix
- Recall is the priority in medical screening — a missed diagnosis is costlier than a false alarm
- The threshold is a design choice — it should reflect the real-world cost of each error
- Data leakage — fit the scaler on train only; never let test data influence preprocessing
- A model is a decision support tool — the final call belongs to the clinician
Coming up in Session 4¶
→ Session 4 — Beyond Supervised Learning
K-Means · Naive Bayes · Neural Networks · Model comparison · Introduction to Deep Learning