# ML comparison report (2026-01-03)

Scope: compare all ML runs to date, keeping POEMS+MMN unless noted, and document leakage checks.
All runs use 5-fold stratified CV, winsorization (1-99%), median imputation, and center-balanced sample weights. Center is never used as a feature.

## Winner (performance + leakage checks)
Winner by balanced accuracy: age + sensory + derived | best model ExtraTrees | bal_acc 0.812 | AUROC 0.890 | dir output/ml_benchmark_2026-01-03_age_sensory_with_derived
Leakage signals for this family of runs: permutation baseline ~0.52-0.55; center-only baseline ~0.778; group-CV by center remains low (~0.48-0.59 bal_acc).
Note: age shows high single-feature AUC (~0.81) and has imbalanced missingness by label; treat as a strong but potentially confounded signal.
Leakage proxy: group-CV HGB balanced accuracy is highest for age+sensory+derived (0.590) and age_only_with_derived (0.590); no run shows a materially lower center-only baseline.

## Calculator Deployment (current best model)
- Model: ExtraTrees trained on age + sensory + derived features.
- Training dataset: `output/ml_benchmark_2026-01-03_age_sensory_with_derived/ml_dataset_hereditary_vs_inflammatory.csv`
- Trained model artifact: `output/final_model_2026-01-03/extra_trees_model.joblib`
- Calculator model path: `neuropathy_calculator/model/extra_trees_model.joblib`
- Calculator features: age, sex, sensory_absent, median/ulnar latency/NCV/CMAP + derived ratio/diff.
- Training script: `analysis/train_calculator_model.py`

## Training label counts by center (winner dataset)
Saved to: `output/ml_benchmark_2026-01-03_age_sensory_with_derived/label_counts_by_center.csv`

| center | hereditary | inflammatory | total |
| --- | --- | --- | --- |
| HCFMRP_USP | 172 | 25 | 197 |
| Humanitas_Milano | 0 | 97 | 97 |
| UFU | 12 | 14 | 26 |
| USP_SP | 4 | 14 | 18 |
| legacy_hcrp | 39 | 31 | 70 |

## Run log (all attempts)
Columns: run, records, excluded(no_label/poems/mmn), features, best_model, bal_acc, AUROC, notes, dir

| run | records | excl(no_label/poems/mmn) | features | best_model | bal_acc | AUROC | notes | dir |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| baseline_388_base7 | 388 | 20/0/0 | sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp | RandomForest | 0.763 | 0.849 | 20 excluded (no label) | output/ml_benchmark_2026-01-02_022102 |
| derived_no_poems | 395 | 0/13/0 | sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp, ratio_amp, median_ulnar_distal_latency_diff_gt_10ms | ExtraTrees | 0.751 | 0.845 | 13 POEMS excluded, 20 recovered by group | output/ml_benchmark_2026-01-03_130633 |
| derived_mmn_included | 408 | 0/0/0 | sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp, ratio_amp, diff_ncv>10 | ExtraTrees | 0.762 | 0.845 | POEMS+MMN included, 20 recovered by group | output/ml_benchmark_2026-01-03_mmn_included |
| derived_mmn_excluded | 388 | 0/0/20 | sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp, ratio_amp, diff_ncv>10 | SVC_RBF | 0.706 | 0.781 | MMN excluded, POEMS included | output/ml_benchmark_2026-01-03_mmn_excluded |
| base7_408 | 408 | 0/0/0 | sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp | ExtraTrees | 0.744 | 0.835 | POEMS+MMN included, 20 recovered by group | output/ml_benchmark_2026-01-03_base7_408 |
| base7_no_poems | 395 | 0/13/0 | sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp | ExtraTrees | 0.764 | 0.848 | POEMS excluded | output/ml_benchmark_2026-01-03_base7_no_poems |
| base7_no_mmn | 388 | 0/0/20 | sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp | SVC_RBF | 0.711 | 0.782 | MMN excluded | output/ml_benchmark_2026-01-03_base7_no_mmn |
| age_only_with_derived | 408 | 0/0/0 | age, sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp, ratio_amp, diff_ncv>10 | RandomForest | 0.808 | 0.887 | POEMS+MMN included | output/ml_benchmark_2026-01-03_age_only_with_derived |
| age_only_no_derived | 408 | 0/0/0 | age, sex, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp | HistGradientBoosting | 0.807 | 0.877 | POEMS+MMN included | output/ml_benchmark_2026-01-03_age_only_no_derived |
| sensory_only_with_derived | 408 | 0/0/0 | sex, sensory_absent, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp, ratio_amp, diff_ncv>10 | ExtraTrees | 0.761 | 0.835 | POEMS+MMN included | output/ml_benchmark_2026-01-03_sensory_only_with_derived |
| sensory_only_no_derived | 408 | 0/0/0 | sex, sensory_absent, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp | RandomForest | 0.748 | 0.828 | POEMS+MMN included | output/ml_benchmark_2026-01-03_sensory_only_no_derived |
| age_sensory_with_derived | 408 | 0/0/0 | age, sex, sensory_absent, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp, ratio_amp, diff_ncv>10 | ExtraTrees | 0.812 | 0.890 | POEMS+MMN included | output/ml_benchmark_2026-01-03_age_sensory_with_derived |
| age_sensory_no_derived | 408 | 0/0/0 | age, sex, sensory_absent, med_lat, med_ncv, med_amp, uln_lat, uln_ncv, uln_amp | RandomForest | 0.801 | 0.892 | POEMS+MMN included | output/ml_benchmark_2026-01-03_age_sensory_no_derived |

## Leakage audit summary (POEMS+MMN kept runs)
Columns: run, perm_bal, center_only_bal, groupCV RF (bal/auc), HGB (bal/auc), XGB (bal/auc)

| run | perm_bal | center_only_bal | groupCV_RF | groupCV_HGB | groupCV_XGB |
| --- | --- | --- | --- | --- | --- |
| age_only_with_derived | 0.541 | 0.778 | 0.509/0.605 | 0.590/0.651 | 0.569/0.642 |
| age_only_no_derived | 0.537 | 0.778 | 0.535/0.607 | 0.580/0.621 | 0.573/0.628 |
| sensory_only_with_derived | 0.529 | 0.778 | 0.456/0.493 | 0.485/0.511 | 0.515/0.528 |
| sensory_only_no_derived | 0.539 | 0.778 | 0.481/0.494 | 0.496/0.513 | 0.518/0.530 |
| age_sensory_with_derived | 0.525 | 0.778 | 0.518/0.593 | 0.590/0.646 | 0.559/0.629 |
| age_sensory_no_derived | 0.546 | 0.778 | 0.523/0.591 | 0.572/0.619 | 0.572/0.625 |

## Age confound checks (dataset: output/ml_benchmark_2026-01-03_age_sensory_no_derived)
Outputs: output/age_checks_2026-01-03/

- KS test (label 0 vs 1): stat 0.490, p=2.76e-16 (age distributions differ overall).
- Significant within-center KS (p<0.05): HCFMRP_USP (stat 0.487), legacy_hcrp (stat 0.651).
- Age missingness by label: label 0 = 0.64%, label 1 = 15.46% (imbalanced).
- Age-only AUC: 0.812.
- Stratified CV (balanced accuracy / AUROC):
  - age_only: RF 0.741 / 0.835; HGB 0.737 / 0.823
  - base7: RF 0.734 / 0.832; HGB 0.695 / 0.791
  - base7+age: RF 0.794 / 0.889; HGB 0.807 / 0.877
- Group-CV by center (balanced accuracy / AUROC):
  - with_age: RF 0.492 / 0.562; HGB 0.601 / 0.637
  - without_age: RF 0.460 / 0.486; HGB 0.554 / 0.567

## Hyperparameter search (base7, POEMS/MMN exclusion scenarios)
Best per model (balanced accuracy):

| run | model | bal_acc | AUROC | params | dir |
| --- | --- | --- | --- | --- | --- |
| base7_no_poems | ExtraTrees | 0.778 | 0.848 | {"max_depth": 12, "max_features": 0.7, "min_samples_leaf": 1, "n_estimators": 600} | output/ml_hyperparam_2026-01-03_base7_no_poems |
| base7_no_poems | HistGradientBoosting | 0.705 | 0.802 | {"learning_rate": 0.1, "max_depth": null, "max_leaf_nodes": 15, "min_samples_leaf": 20} | output/ml_hyperparam_2026-01-03_base7_no_poems |
| base7_no_poems | RandomForest | 0.745 | 0.835 | {"max_depth": 12, "max_features": "sqrt", "min_samples_leaf": 2, "n_estimators": 600} | output/ml_hyperparam_2026-01-03_base7_no_poems |
| base7_no_poems | XGBoost | 0.730 | 0.804 | {"colsample_bytree": 0.8, "learning_rate": 0.1, "max_depth": 5, "min_child_weight": 1, "n_estimators": 400, "subsample": 0.8} | output/ml_hyperparam_2026-01-03_base7_no_poems |
| base7_no_mmn | ExtraTrees | 0.724 | 0.797 | {"max_depth": 12, "max_features": "sqrt", "min_samples_leaf": 1, "n_estimators": 300} | output/ml_hyperparam_2026-01-03_base7_no_mmn |
| base7_no_mmn | HistGradientBoosting | 0.684 | 0.749 | {"learning_rate": 0.03, "max_depth": null, "max_leaf_nodes": 15, "min_samples_leaf": 5} | output/ml_hyperparam_2026-01-03_base7_no_mmn |
| base7_no_mmn | RandomForest | 0.713 | 0.794 | {"max_depth": null, "max_features": 0.7, "min_samples_leaf": 2, "n_estimators": 300} | output/ml_hyperparam_2026-01-03_base7_no_mmn |
| base7_no_mmn | XGBoost | 0.704 | 0.763 | {"colsample_bytree": 0.8, "learning_rate": 0.03, "max_depth": 5, "min_child_weight": 1, "n_estimators": 400, "subsample": 0.8} | output/ml_hyperparam_2026-01-03_base7_no_mmn |

## Notes
- Derived features (ratio_amp, diff_ncv>10) did not show strong single-feature AUC (ratio_amp ~0.48; diff_ncv>10 ~0.60).
- Age is a strong single feature (AUC ~0.81) and shows imbalanced missingness by label; interpret center generalization with care.
- Center-only baseline remains high (~0.778) regardless of features, reflecting center imbalance in labels.
- One earlier run used a latency-difference derived feature (median_ulnar_distal_latency_diff_gt_10ms) before switching to NCV diff.

## Artifact Inventory (paths)
- Benchmarks: `output/ml_benchmark_2026-01-02_022102/`, `output/ml_benchmark_2026-01-03_mmn_included/`, `output/ml_benchmark_2026-01-03_mmn_excluded/`, `output/ml_benchmark_2026-01-03_base7_408/`, `output/ml_benchmark_2026-01-03_base7_no_poems/`, `output/ml_benchmark_2026-01-03_base7_no_mmn/`, `output/ml_benchmark_2026-01-03_age_only_with_derived/`, `output/ml_benchmark_2026-01-03_age_only_no_derived/`, `output/ml_benchmark_2026-01-03_sensory_only_with_derived/`, `output/ml_benchmark_2026-01-03_sensory_only_no_derived/`, `output/ml_benchmark_2026-01-03_age_sensory_with_derived/`, `output/ml_benchmark_2026-01-03_age_sensory_no_derived/`.
- Leakage audits: `output/ml_leakage_audit_2026-01-03_*` (see table above), plus `output/ml_leakage_audit_2026-01-03_mmn_excluded/`.
- Hyperparameter search: `output/ml_hyperparam_2026-01-03_base7_no_poems/`, `output/ml_hyperparam_2026-01-03_base7_no_mmn/`.
- Age confound checks: `output/age_checks_2026-01-03/` (KS tests, missingness, CVs, plots).
- Final model: `output/final_model_2026-01-03/` and `neuropathy_calculator/model/extra_trees_model.joblib`.
- Figure index: `knowledge_base/presentations/FIGURE_INDEX_2026-01-03.md`.