CtxSift // Benchmark Command Deck

Ctxsift Benchmark Viewer

SourceD:\python_projects\opensource-contribs\ctxsift\benchmark\results
Generated2026-05-26 20:00:26Z
How scoring works
The headline score is the recovered score. Each case first gets a case score, then the scenario rolls those case scores into a quality core and applies a mild latency factor. Validation, visible-thought leakage, and instruction-following penalties are already baked into each case score.
case_score =
  validation_factor *
  thought_penalty *
  instruction_penalty *
  blended_quality

quality_core = 0.80 * mean(case_scores) + 0.20 * p10(case_scores)

latency_factor = clamp(0.85, 1.00, (2000ms / observed_ms)^0.15)

final_score = 100 * quality_core * latency_factor

All Top Score

CPU Top Score

GPU Top Score

Remote Top Score

Latency Leaderboard

Score Leaderboard

Success Error Exact pass Preserve Instruction Quality

Visual Comparisons

Global charts for quality, latency, reliability, and benchmark-family behavior. Use the track toggle to switch between all models or one track.

Head-to-Head

Compare two models on their quality, latency, reliability, and benchmark-family behavior. Use the track toggle to switch between all models or one track.
Ask Domain Latency (s) Recovered Raw Lift (pp) Preserve Quality Instruction Recovered thought Raw thought Status