"Any new TBI prediction model must demonstrate improvement over IMPACT Core and CRASH Basic -- on calibration, not just discrimination."
| Attribute | Value |
|---|---|
| Status | Incubating |
| Maturity | Design Phase |
| License | Apache-2.0 |
| Part of | Evidence Commons |
| Mission Pillar | Pillar 1 (Clinical AI Evaluation & Benchmarking) |
Traumatic brain injury outcome prediction has two established baselines: IMPACT Core (age, GCS motor, pupils; C-statistic ~0.78-0.82 for mortality) and CRASH Basic. Most published models report only discrimination (AUROC), but calibration superiority -- whether predicted probabilities match observed outcomes -- matters more for clinical deployment. TBI-Benchmarks is designed to provide standardized evaluation protocols, synthetic benchmark datasets, and reference implementations that enforce methodologically rigorous model comparison, including net reclassification improvement (NRI) as the preferred comparison metric.
This repository does not yet contain extracted code or datasets. The parent codebase (evidenceos-research/evidenceos-bench) contains benchmarking infrastructure that is intended to be extracted and adapted for public use here. The planned scope includes synthetic TBI case sets (no real patient data), evaluation harnesses with TRIPOD+AI compliance checking (Collins et al. 2024), and baseline model implementations for IMPACT Core and CRASH Basic.
| Component | Description | Parent Code Exists |
|---|---|---|
baselines/ |
Reference implementations of IMPACT Core and CRASH Basic | Planned |
datasets/ |
Synthetic TBI benchmark datasets (fully generated, no real patient data) | Not yet |
protocols/ |
Evaluation protocol definitions (discrimination, calibration, NRI, DCA) | Partial (in parent) |
harness/ |
Reproducible evaluation runner with TRIPOD+AI compliance checks | Partial (in parent) |
leaderboard/ |
Model comparison infrastructure and result formatting | Planned |
What exists in the parent codebase:
- Benchmarking engine in
evidenceos-research/evidenceos-bench - TRIPOD+AI compliance checklist implementation (Collins et al. 2024)
- Sample size validation using Riley et al. (2019/2020) criteria: EPV >= 10 is necessary but not sufficient; formal
pmsampsizecalculation required - Multiverse analysis infrastructure producing 927 model configurations across 13 configs
What does not exist yet:
- Standalone synthetic TBI benchmark datasets for public distribution
- Reference implementations of IMPACT Core and CRASH Basic as comparison baselines
- Standardized evaluation harness separated from the research pipeline
- Public leaderboard infrastructure
- NRI calculation utilities as a standalone module
- Define evaluation protocol specifications covering discrimination (AUROC, C-statistic), calibration (calibration-in-the-large, calibration slope, calibration plots), NRI, and decision curve analysis (DCA)
- Generate synthetic TBI benchmark datasets using the existing SyntheticDatasetFactory (no real patient data)
- Implement reference baselines for IMPACT Core and CRASH Basic with documented expected performance ranges
- Extract and adapt the TRIPOD+AI compliance checker as a standalone validation module
- Build evaluation harness that accepts any model's predictions and produces standardized comparison reports
graph LR
A[evidenceos-bench<br/>evaluation engine] --> B[TBI-Benchmarks]
B --> C[Clinical Arena<br/>model leaderboard]
B --> D[BRIDGE-TBI<br/>validation baseline]
style B fill:#2A9D8F,stroke:#1E3A8A,color:#fff
TBI-Benchmarks is intended to provide the evaluation layer for the Clinical Arena (model leaderboard) and validation infrastructure for BRIDGE-TBI (clinical decision support). Canonical source: evidenceos-research/evidenceos-bench.
This project is in the design phase. The evaluation protocols and benchmark specifications are being defined; no code has been extracted to this repository yet. Contributions to protocol specification and synthetic dataset design are the most immediately useful. See CONTRIBUTING.md.
Apache-2.0 -- see LICENSE for details.