Drop-One-Column Ablation Results Dashboard

Executive Summary

Baseline (All Columns)

Accuracy

Best Drop

Highest accuracy when removed → may add noise

Best Keep

Lowest accuracy when removed → important to keep

Experiments

Drop-One Runs

Model Comparison: All Columns + 14 Drop-One Experiments

Classification accuracy for baseline (all columns) and each run where one feature is dropped. Survival Months excluded in all.

Accuracy When Dropping Each Feature

What the model predicts: Alive or Dead (survival status) for each patient. “Best drop” = removing that feature gave the highest accuracy (so keeping it may add noise). “Best keep” = removing that feature gave the lowest accuracy - so dropping it makes the model less accurate, meaning that feature is important to keep (keeping it makes the model more accurate). Best drop / best keep refer to how the model performs when that feature is removed.

Ablation means removing one part of the model (here, one feature) and measuring how performance changes.

• Drop-one-column: Train the same model once per feature, each time excluding that feature. Compare accuracy, ROC AUC, etc., to the baseline (all columns).

• Why it matters: If dropping a feature improves performance, that feature may add noise or overfitting. If dropping it hurts performance a lot, that feature is likely important.

• Baseline: "All columns" uses every feature except Survival Months (to avoid data leakage). Each row in the table is one experiment where we dropped one additional feature.

Why We Use Progesterone vs Estrogen (ER vs PR)

Estrogen Receptor (ER) and Progesterone Receptor (PR) status are key biomarkers in breast cancer. This tab explains how they relate to outcomes in this dataset and why the model may emphasize one over the other.

ER+ (Estrogen Positive)

Death rate

Median survival: - mo

ER− (Estrogen Negative)

Death rate

Median survival: - mo

PR+ (Progesterone Positive)

Death rate

Median survival: - mo

PR− (Progesterone Negative)

Death rate

Median survival: - mo

ER/PR Combined: Outcomes in This Dataset

P = Positive, N = Negative. ER−/PR− has the highest risk; ER+/PR+ the best baseline prognosis.

Holistically in this dataset: ER+ and PR+ both track with better outcomes. ER−/PR− is the highest-risk group. Biologically and clinically, both receptors matter.

Why the Tree Kept Progesterone but Not Estrogen (In This Run)

Estrogen is highly imbalanced (~93% positive), so it adds less split power in a shallow tree.
Progesterone has more variation (~83% positive / 17% negative), so it can separate risk better.
ER and PR are correlated; when two columns overlap, tree models often keep the one with cleaner incremental signal and treat the other as redundant.
This ablation result is model-specific (split seed, depth, preprocessing). It does not mean estrogen is biologically unimportant.

Right interpretation: Keep both for clinical completeness. For this tree configuration, PR contributed more unique predictive signal than ER. For robust feature retention, run repeated cross-validation ablation (not a single split).

Five areas where progesterone receptor (PR) status shapes biology, risk, and treatment. Use the section titles to jump to what you need.

1 Mechanism of Immune System Evasion

“Cloaking” effect: Progesterone suppresses “danger signals” on the surface of breast tumor cells, allowing them to bypass immune surveillance.
STAT1 pathway: PR downregulates STAT1-mediated interferon-alpha signal, preventing the innate immune system from recognizing and destroying developing tumors.
Immune-cold tumors: High progesterone activity is a primary reason breast tumors are often “immune-cold,” with weak T-cell response and resistance to standard immunotherapy.

2 PR as a Clinical Biomarker (Subtyping & Risk)

Luminal A vs B: High PR (≥20%) is the primary differentiator for Luminal A (better prognosis) vs Luminal B (more aggressive, higher proliferation).
ER+/PR−: Estrogen-positive but progesterone-negative tumors are associated with “unfunctional” ER pathway, higher recurrence, and worse overall survival.
Isoform ratio (PR−A:PR−B): The balance between these receptor forms affects treatment outcome; high PR−A is linked to tamoxifen resistance and disease progression.

3 Therapeutic Synergy (PIONEER Trial)

Genomic reprogramming: When bound to a ligand (e.g. Megestrol), PR physically interacts with ER, pulling it away from DNA sites that drive cancer growth.
Antiproliferative synergy: Combining progestogens with Aromatase Inhibitors (AIs) can yield ~80% reduction in tumor proliferation (Ki67, AURKA), significantly higher than AI alone.
Dose efficiency: Low-dose megestrol (40 mg) can be as effective as high-dose (160 mg) at suppressing growth while reducing side effects (e.g. hot flashes).

4 Drivers of Resistance & Poor Prognosis

RANKL pathway: Progesterone increases RANKL expression; PR+ cells can signal PR− neighbors to divide, complicating local control.
Resistance markers: FGFR1 amplification (intrinsic resistance to AIs); high TMB (≥9 mutations/Mb) with poor response to combined endocrine therapy.
ER−/PR+ rarity: This rare subgroup (<2%) is clinically distinct, often younger, with worse outcomes than typical ER+/PR+.

5 Key Clinical Correlation Points

Ki67: Post-treatment Ki67 ≤2.7% indicates complete cell-cycle arrest; >10% signals high recurrence risk.
PR repression: Successful antiestrogen treatment often leads to loss of PR markers; if PR stays high during treatment, the tumor may be evading therapy.

Binary Classification (Alive vs. Dead)

All runs: Survival Months excluded. Baseline = All Columns (14 features); each other row = drop that one feature (13 features).

What the model predicts: For each patient, the model predicts Alive or Dead (survival status). It does not predict “more survival” or “more death”-it classifies each case as one or the other.

What “Best drop” and “Best keep” mean: Best drop = when we removed that feature, accuracy was highest - so keeping that feature may add noise. Best keep = when we removed that feature, accuracy was lowest - so dropping it makes the model less accurate, and keeping that feature makes the model more accurate (that feature is important). So “best keep” = important to keep; “best drop” = maybe better to leave out.

Baseline (All Columns)

Accuracy

Best Drop-One

Keeping this feature can make predictions worse (noise)

Best Keep

Keeping this feature makes predictions better (important)

Runs

All Columns + 14 one-drops

Accuracy by Experiment

↓ lower % when dropped = more important to keep. Green = best to drop, Red = best to keep.

ROC AUC by Experiment

What This Means: Each row is one model: baseline uses all 14 features (Survival Months already excluded), then each "drop_*" run removes one additional feature. Compare accuracy and ROC AUC to see which features help or hurt.

📉 Predicting Exact Survival Duration (Months)

Drop-one ablation for regression: each row shows MAE, RMSE, and R² when one feature is excluded. Survival Months is not used as a feature (data leakage); the model predicts it from diagnosis-time features.

What this model does: Predicts the exact number of months a patient will survive (e.g. "48 months"). This approach does not work well - see Model Performance Comparison and sections below.

Baseline MAE

Months Off (All Columns)

Best R² (Drop)

Overall

Poor

Not Recommended

Better Alternative

Horizon

Use Time-Point Prediction

Model Performance Comparison

All approaches use diagnosis-time features only (Survival Months excluded). Higher is better. Regression (R² × 100) is on a different scale than accuracy.

What the numbers mean: MAE - months = predictions are off by ~1.5 years on average. R² = - (-) = model explains almost none of the variance; essentially random guessing. See collapsibles below for why regression fails and the better alternative.

MAE by Experiment (Lower is Better)

Δ = change vs baseline (All Columns). For MAE/RMSE, negative Δ is better; for R², positive Δ is better.

1. The Censored Data Problem

Many patients are still alive when the study ends. We only know they survived at least X months, not their total lifespan. Regression treats "still alive at 60 months" the same as "died at 60 months" - wrong.

Example: Patient A died at 48 months (known). Patient B is still alive at 48 months (could survive 60, 80, 100+ months). Standard regression treats both as 48.

2. Wrong Tool for the Job

Survival data needs Cox Proportional Hazards, Kaplan-Meier, or Survival Random Forests. Decision tree regression ignores censoring.

3. Extreme Variability

Survival months range from 0 to 100+ with huge noise. Trees predict averages in regions - terrible for this data.

Instead of predicting exact months, predict survival at specific time points: 24-month horizon (- accuracy) and 60-month horizon (- accuracy). See the Horizon tab. Horizon classification is the most clinically useful approach.

Survival Horizon Analysis

Predicts whether the patient survives past 24 or 60 months. Drop-one ablation: each row shows performance when one feature is excluded.

Note on sample sizes: 24‑month horizon uses 3,976 patients (with 24mo+ follow‑up); 60‑month horizon uses 3,270 patients (with 60mo+ follow‑up). Fewer patients have 60‑month data.

Best 24-Month Accuracy

Baseline 60-Month

All Columns

Demographics Impact

High

Dropping Age hurts 60mo most

Clinical Utility

High

Most Useful Approach

Horizon Accuracy Visualization

What This Means: Horizon classification is the most successful approach (75–77% accuracy). For 60-month predictions, demographics (Age, Marital Status, Race) are critical - removing them drops accuracy by ~6%. This answers: "Will this patient survive past X months?"

Accuracy vs ROC AUC

All numbers below use only diagnosis-time features (Survival Months excluded).

Baseline ROC AUC

All Columns

Best ROC AUC (Drop)

Metrics

Accuracy & ROC AUC

Accuracy and ROC AUC by Experiment

How to read the colors in the delta (Δ) columns:
Green = Accuracy went down when this feature was dropped → this feature is important to keep (best keep).
Red = Accuracy went up when this feature was dropped → this feature may add noise (consider dropping).

Full Experiments Table

Delta = change vs baseline (All Columns).

How to read the colors in the delta (Δ) columns:
Green = Metric went down when this feature was dropped → feature is important to keep (best keep).
Red = Metric went up when this feature was dropped → feature may add noise (consider dropping).

Survival Months is excluded in every run. The baseline "All columns" already drops Survival Months to avoid data leakage (see Overview or Key Insights). So every experiment here uses only diagnosis-time features.

All Experiments: Accuracy vs ROC AUC

Each point = one experiment. Blue = baseline (All Columns), Red = best to drop (accuracy went up when removed), Green = best to keep.

Key Insights

📌 Important distinction:

The "Best to Keep" and "Best to Drop" below are based on classification accuracy (Alive vs Dead). Other tabs use different metrics:

Overview / Classification / Key Insights: Best drop = highest classification accuracy when removed → -
Horizon tab: "Best 24-Month Accuracy" = which drop gives the highest 24‑month survival prediction → - (may differ from classification)
Regression tab: "Best R² (Drop)" = which drop improves regression R² → - (yet another metric)

So if you see different features called "best" in different tabs, that's expected - each tab optimizes a different outcome.

🏆 Best to Keep

-: Important for Accuracy

When we removed this feature, accuracy dropped to -. So the model relies on it - we should keep it.

With This Feature (Baseline)

Accuracy	-
ROC AUC	-

📉 When We Dropped It

Accuracy	-
Change	-

Why This Feature Is Best to Keep (Easy to Understand)

1️⃣ The model uses it to separate patients

When this feature is in the model, it helps distinguish who is likely alive vs dead. Removing it takes away real information the tree was using to make better splits.

2️⃣ It carries signal, not noise

Accuracy dropped when we removed it - that means the feature was contributing to correct predictions. If it were just noise, dropping it would have left accuracy the same or improved it.

3️⃣ Clinically it makes sense

This aligns with clinical guidance: progesterone receptor (PR) is part of hormone receptor status (ER/PR), and hormone receptors are important for treatment decisions and prognosis in invasive breast cancer. The American Cancer Society notes these receptors are routinely tested, many breast cancers are receptor-positive, and receptor status meaningfully affects management. That supports why dropping progesterone hurts model quality - this feature carries real clinical signal, not noise.

Source: American Cancer Society - Breast Cancer Hormone Receptor Status

4️⃣ Bottom line

Don't drop this feature if you want the best accuracy. The model is better when this information is included.

📌 Recommendation

Keep - in your model. Accuracy drops when it's removed, so it's important for predicting survival status.

🔴 Best to Drop

-: Accuracy Improves When We Remove It

When we removed this feature, accuracy went up to -. So the model may be better off without it.

With This Feature (Baseline)

Accuracy

When We Dropped It

Accuracy	-
Change	-

Why This Feature Is Best to Drop (Easy to Understand)

1️⃣ Accuracy went up when we removed it

The model got more accurate without this feature. That suggests the feature wasn't helping - or was hurting - predictions, e.g. by adding noise or overfitting to this dataset.

2️⃣ SENOMAC Trial

The SENOMAC trial found that skipping more extensive axillary lymph node surgery in patients with one or two positive sentinel nodes led to the same 5-year recurrence-free survival rates (~90%). This suggests that counting how many nodes were examined does not always add the decision-making signal you would expect.

Source: Breastcancer.org - Some People With Early-Stage Breast Cancer Don't Need Axillary Lymph Node Surgery

3️⃣ BOOG 2013-08 Phase III Trial (SABCS 2025)

A 2025 BOOG 2013-08 phase III trial (presented at SABCS 2025) found comparable recurrence and regional recurrence-free survival rates at 5 years whether or not a sentinel lymph node biopsy was performed at all - especially in patients over 50 with hormone receptor-positive, T1 grade 1-2 breast cancer. The absolute difference in recurrence rate was only 0.7%.

Source: The ASCO Post - Regional Control Unchanged by Avoiding Sentinel Lymph Node Biopsy in Early Node-Negative Breast Cancer

4️⃣ Bottom line

Consider building the model without this feature. In this ablation, accuracy was higher when we dropped it, so it may not be needed for predicting survival status.

📌 Recommendation

Consider dropping - from your model. Accuracy improved when we removed it, so it may add noise or redundancy rather than useful signal.

All models shown exclude "Survival Months" as a feature. Initial experiments from V01 homework (Decision Tree 2026) including Survival Months as a feature showed 83.1% accuracy - suspiciously high! Analysis revealed this was data leakage: survival months is correlated with the outcome, making the model useless for real predictions.

The models here use only features available at diagnosis time, making them clinically valid. See the sections below for detailed evidence of why excluding survival months is essential.

Critical data leakage: Using "Survival Months" as a predictor creates a model that looks excellent in testing but is clinically useless.

Circular logic: Survival Months is directly tied to the outcome. Dead patients have shorter survival by definition - the model is "cheating" by using the answer to predict the answer.
Not available at diagnosis: When seeing a new patient, you do not know how many months they will survive. Any feature that encodes that future information must be excluded.
It dominates other features: When included, Survival Months gets ~75% feature importance; real clinical features (tumor size, grade, stage) shrink to a few percent. The model then ignores actual medical indicators.
This dashboard proves it: With Survival Months properly excluded, accuracy is ~70% and the model relies on diagnosis-time features only - the only setup valid for real-world use.

Rule: Never include variables that contain information from the future or that reveal the outcome you're predicting. Survival Months is the outcome's timeline - including it gives a model that fails in practice.

DATA LEAKAGE: What It Looks Like

This is what happens when you include "Survival Months" as a feature

WITH Survival Months (BAD)

Accuracy	83.1%
ROC AUC	0.810
Features Used	15

LOOKS AMAZING BUT USELESS!
Cannot be used in real world

WITHOUT Survival Months (GOOD)

Accuracy	69.9%
ROC AUC	0.671
Features Used	14

HONEST & DEPLOYABLE
Can be used in clinical practice

🔍 The Smoking Gun: Feature Importance

When "Survival Months" is included, it dominates all other features:

Feature importance graph showing Survival Months dominance

What This Graph Shows:

"Survival Months" has ~0.75 (75%) importance - completely dominates!
All clinical features (Tumor Size, Grade, Stage) are tiny - only ~0.05 (5%) each.
The model ignores actual medical indicators and just uses survival duration.
This is circular logic: using "how long they lived" to predict "if they're alive".

The Bottom Line

~83% accuracy is meaningless because you can't know "Survival Months" until after the patient has already survived that long! (V01 homework run with Survival Months included.)

It's like predicting tomorrow's weather by checking tomorrow's temperature - impossible in practice.

Accuracy: Fraction of correct predictions (Alive vs Dead).

ROC AUC (Area Under the ROC Curve): How well the model separates classes; 1.0 = perfect, 0.5 = random.

Sensitivity (Recall Alive): Of all truly alive patients, how many we correctly predict as alive.

Specificity: Of all truly dead patients, how many we correctly predict as dead.

F1 (Alive): Balance between precision and recall for the "Alive" class.

Deltas show the change when we drop one feature compared to the baseline (all columns).

What the model predicts: Alive or Dead (survival status). “Best drop” = removing that feature gave highest accuracy (keeping it may add noise). “Best keep” = removing that feature gave lowest accuracy - so dropping it makes the model less accurate; keeping that feature makes the model more accurate. So best keep = important to keep.

Green = IMPROVEMENT
+0.029 Accuracy went up when we dropped that feature.
+0.018 ROC AUC improved.

Red = DECLINE
-0.058 Accuracy dropped when we removed that feature.
Feature is likely important for the model.

What the data suggests

Dropping Marital Status or Regional Node Examined improves accuracy - the model may rely less on these or they add noise in this setup.
Dropping Progesterone Status or 6th Stage hurts accuracy the most - these features appear important for predicting survival.
ROC AUC can sometimes increase when accuracy drops (e.g. better ranking of predictions but more errors). Always look at both.