CtxSift // Benchmark Command Deck

Ctxsift Benchmark Viewer

SourceD:\python_projects\opensource-contribs\ctxsift\benchmark\results

Generated2026-05-26 20:00:26Z

How scoring works

The headline score is the recovered score. Each case first gets a case score, then the scenario rolls those case scores into a quality core and applies a mild latency factor. Validation, visible-thought leakage, and instruction-following penalties are already baked into each case score.

case_score =
  validation_factor *
  thought_penalty *
  instruction_penalty *
  blended_quality

quality_core = 0.80 * mean(case_scores) + 0.20 * p10(case_scores)

latency_factor = clamp(0.85, 1.00, (2000ms / observed_ms)^0.15)

final_score = 100 * quality_core * latency_factor

All Top Score

CPU Top Score

GPU Top Score

Remote Top Score

Latency Leaderboard

Score Leaderboard

Success Error Exact pass Preserve Instruction Quality

Visual Comparisons

Global charts for quality, latency, reliability, and benchmark-family behavior. Use the track toggle to switch between all models or one track.

Latency By Category

Bar = average ask latency in seconds. White marker = p95 latency for that category.

Categories are derived from benchmark case domains across the selected scenarios.

Scenario Summary

Recovered score is the main score after deterministic cleanup, including safe visible-thought cleanup when possible. Raw score is the same run before that recovery step. Lift shows how much recovery helped.

Scenario	Track	Warmup	Avg s	P95 s	Recovered	Raw	Lift	Avg preserve	Avg quality	Avg instruction	Recovered thought	Max preserve	Success	Exact

Domain Breakdown

Domain	Cases	Errors	Avg s	Median s	P95 s	Avg preserve	Avg quality	Avg instruction

Slowest Asks Across Selection

The highest-latency asks across every selected scenario, useful for spotting worst-case prompts and domains.

Scenario	Ask	Domain	Latency (s)	Preserve	Quality	Instruction	Recovered thought	Status

Ask	Domain	Latency (s)	Recovered	Raw	Lift (pp)	Preserve	Quality	Instruction	Recovered thought	Raw thought	Status

Ctxsift Benchmark Viewer

All Top Score

CPU Top Score

GPU Top Score

Remote Top Score

Latency Leaderboard

Score Leaderboard

Visual Comparisons

Score vs Latency

Acceptance Breakdown

Recovery Lift

Average vs P95 Latency

Model Metric Heatmap

Benchmark Family Heatmap

Latency By Category

Scenario Summary

Domain Breakdown

Slowest Asks Across Selection

Ctxsift Benchmark Viewer

All Top Score

CPU Top Score

GPU Top Score

Remote Top Score

Latency Leaderboard

Score Leaderboard

Visual Comparisons

Score vs Latency

Acceptance Breakdown

Recovery Lift

Average vs P95 Latency

Model Metric Heatmap

Benchmark Family Heatmap

Head-to-Head

Latency By Category

Scenario Summary

Domain Breakdown

Slowest Asks Across Selection