Score vs Latency
Each dot is one model. Higher is better; farther right is slower.
Acceptance Breakdown
Accepted, soft-accepted, and rejected cases for each model.
Recovery Lift
Each dot is one model. Farther right means a higher raw score. Above zero means deterministic recovery helped overall. Below zero means recovery hurt a little.
Average vs P95 Latency
Yellow point = average run time. Cyan point = slowest 5% cutoff. Longer gap means less stable latency.
Model Metric Heatmap
Greener cells mean better overall results in the models you are looking at. For the latency columns, faster models are shown greener.
Benchmark Family Heatmap
Average case score by benchmark family for each model.