Breast cancer decision tree. Which features matter? Scroll to explore.
What these numbers mean for actual patient predictions
Drop-one ablation: each bar = one run. Hover for details. Green = best to drop, Red = best to keep.
Best to keep: Progesterone Status (accuracy drops when removed). Best to drop: Regional Node Examined (accuracy improves when removed—may add noise).
Split by Progesterone Status, the ablation study's most important feature. Hormone receptor status reveals tumor biology that node counts alone cannot.
Split by Node Positive ≤ 3. Within hormone-positive patients, how many lymph nodes tested positive further separates outcomes.
Split by Node Positive ≤ 3. For aggressive, hormone-negative tumors, lymph node spread becomes the critical survival factor.
Split by Node Positive ≤ 8. Without Progesterone, the model's best available split is the raw count of cancer-positive lymph nodes. Accuracy drops to 64.1%.
Split by Node Positive ≤ 3. Separates minimal spread from moderate, but misses the hormone context that Progesterone provided.
Split by Grade ≤ 2. Without hormone data, tumor differentiation grade becomes the fallback for separating outcomes in this high-risk group.
Progesterone Status is the single most critical predictor for patient survival classification.
Regional Node Examined measures how many lymph nodes the surgeon chose to inspect, not how many had cancer. It reflects surgical procedure, not tumor biology.
Removing Regional Node Examined improves accuracy by 3.4%.
Progesterone status, on the other hand, reveals the tumor's biology and whether it will respond to hormone therapy, making it far more relevant for predicting outcomes.
Sources: Cleveland Clinic · UPMC
Two hormone receptors, one clear winner in the ablation study
Estrogen
93% positive
Progesterone
83% positive
Both receptors strongly predict survival, but Progesterone won the ablation.
This is due to better data balance. We keep both for clinical completeness but Progesterone's variation allows the model to separate risk better.
Negative status linked to higher death rates for both receptors
Best prognosis (12.3%) to worst prognosis (42.1%)
Three fundamental problems with predicting exact survival months.
Patients still alive at study end: we know "survived ≥X months" but not total lifespan. Regression treats "alive at 60mo" = "died at 60mo"—wrong.
Survival data needs Cox proportional hazards, Kaplan-Meier, or survival forests. Decision tree regression ignores censoring.
Survival ranges 0–100+ months with huge noise. Trees predict region averages—terrible for this data.
Regression fails (R² ≈ 3%). Horizon classification achieves 75–77% accuracy by asking "Will the patient survive past X years?"
Data leakage happens when a feature contains information that wouldn't be available at prediction time.
Dead patients have shorter survival by definition. Using survival months to predict alive/dead is like using the answer to predict the answer.
When seeing a new patient, you don't know how many months they'll survive—that's what you're trying to predict.
Including it gave 83% accuracy (suspiciously high); excluding it drops to ~70%. The higher number is fake—it fails in real-world use.
Rule: Never include variables that reveal the outcome or come from the future. All models here use only diagnosis-time features.
77% accuracy at 5-year survival classification. This model helps clinicians identify high-risk patients who need aggressive treatment from day one.
Progesterone Status reveals how the tumor behaves, not just where it spread. Hormone biology predicts survival better than node counts alone.
Progesterone wins the ablation because of data balance, not clinical superiority. Both receptors matter, but one just splits the data better.