Benchmark Guide
CtxSift includes a dedicated benchmark because this project is not doing generic summarization. The actual task is narrower and stricter: take noisy command output and return the part an agent asked for, without dropping anchors, breaking format, or wasting tokens.
The current corpus contains 280 cases built from realistic CLI output, stack traces, build failures, CI logs, linter output, and extraction-heavy prompts. Each case includes a raw input block, an instruction, a target output, and validation rules. That makes the suite much closer to real coding-agent use than a normal summary benchmark.
What it measures
Section titled “What it measures”The benchmark is checking whether a model can stay useful under constraints. In practice that means preserving exact tokens when they matter, staying semantically close to the intended answer, obeying the requested output shape, and staying brief enough to be worth storing.
The dataset is split into six families:
| Family | What it tests |
|---|---|
summary | Short, useful summaries of noisy output |
recall | Compression that preserves exact anchors for later search and reuse |
explanation | Brief, grounded explanations of what failed |
instruction_following | Tight adherence to inclusion, exclusion, and length rules |
structured | JSON, YAML, bullet lists, and markdown tables |
exact_format | Bare values or exact lines with no extra prose |
The benchmark also spans multiple output modes such as plain text, bullet lists, JSON, YAML, tables, single-line output, regex-constrained output, and exact extracted lines. A model can sound correct and still fail the task if it ignores the requested shape.
Scoring
Section titled “Scoring”Each case produces four component scores:
| Component | Meaning |
|---|---|
| Preserve | Required anchors stayed intact |
| Quality | The answer stayed close to the intended result |
| Format | The output matched the requested shape |
| Brevity | The output stayed near the allowed budget |
Each case is then validated as accepted, soft_accepted, or rejected. Rejections are reserved for hard failures such as empty output, control-token leakage, prompt echoing, visible thought leakage in strict outputs, broken structured output, or returning prose when the task asked for an exact value.
Each benchmark run now stores two views:
- Recovered: the main score. This is after CtxSift does deterministic cleanup, including safe visible-thought cleanup when that can be done without breaking the requested contract.
- Raw: the side score. This is the same model answer before that recovery step.
The difference between them is the recovery lift. That tells you whether the model was already clean, or whether the product had to rescue it. This is especially useful for models that answer with lines like Okay, the user wants..., I should..., or leaked wrappers such as <think>...</think>.
The final score is based on per-case scores first, then rolled up into a scenario score:
case_score = validation_factor * thought_penalty * instruction_penalty * blended_quality
quality_core = 0.80 * mean(case_scores) + 0.20 * p10(case_scores)
latency_factor = clamp(0.85, 1.00, (2000ms / observed_ms)^0.15)
final_score = 100 * quality_core * latency_factorblended_quality is the same intent-weighted mix of anchor, semantic, format, and brevity scores.
instruction_penalty is intent-sensitive:
- plain-text intents:
0.75 + 0.25 * instruction_following_score - strict intents: start from
0.40 + 0.60 * instruction_following_score, then only apply that penalty whenformat_scoreis weak. If a strict answer already matches the requested format perfectly, there is no extra instruction penalty on top.
Strict format scoring is graded too. For exact_match, regex, and structured shape checks, CtxSift gives partial credit to close near-miss outputs instead of treating them the same as totally wrong outputs. A command that misses a prefix or punctuation can still score partway. A raw log dump, wrong command type, or incomplete structured payload still scores very poorly.
Cases can also take a visible-thought penalty before the scenario rollup. If the answer still contains model reasoning text, the case score is reduced based on how much of the output is thought leakage. Clean answers stay at full weight.
This is not limited to <think>...</think> wrappers. CtxSift also looks for common visible meta-reasoning lines such as Okay, the user wants..., I should return..., or However, the instruction says.... Recovery only trims the leading preamble when that is safe. If stray reasoning is still left in the answer body, the score still drops.
This means the benchmark rewards strong average quality, but it also punishes unstable models with weak tail behavior. Latency matters, but only as a mild penalty. The headline score in the viewer is the recovered score. The raw score uses the same formula on the unrecovered output.
The viewer now expects the current result schema. Older benchmark JSON from before the raw/recovered split and per-case score storage should be rerun instead of relying on viewer-side score reconstruction.
The weights now change by explicit intent, not by the older family label. recall rows care more about exact anchors, while exact-format rows care much more about output shape. family is still useful in the dashboard, but it no longer decides the scoring contract.
| Intent | Anchor | Semantic | Format | Brevity |
|---|---|---|---|---|
recall | 0.45 | 0.25 | 0.20 | 0.10 |
summary | 0.25 | 0.40 | 0.20 | 0.15 |
exact-lines | 0.45 | 0.10 | 0.40 | 0.05 |
exact-format | 0.15 | 0.10 | 0.70 | 0.05 |
json / yaml / table / bullet-list | 0.20 | 0.30 | 0.40 | 0.10 |
Reading results
Section titled “Reading results”When you open the viewer, start with the recovered score, then check the raw score and the recovery lift.
| Metric | How to read it |
|---|---|
| Recovered score | Best single comparison number |
| Raw score | How clean the model answer was before recovery |
| Recovery lift | How much deterministic recovery helped |
| Thought density | How much visible reasoning leaked into the answer |
| Quality core | Output quality before the latency penalty |
| Accepted / Soft / Rejected | Reliability signal |
| Average / P95 latency | Normal speed and tail speed |
| Preserve / Quality / Format / Brevity | Shows why the model scored the way it did |
Latency should be compared carefully. CPU-to-CPU runs on the same machine are fair. GPU-to-GPU runs on the same machine are fair. Remote latency is more volatile because provider load and network path matter. Quality numbers travel better across tracks than raw speed does.
If recovered and raw are close, the model is naturally clean. If recovered is much higher, the product is rescuing messy output. If both are low, the model is simply a weak fit for this task. If raw thought density is high, the model is leaking reasoning into the answer even when some of the core facts are right.
In the detail view, case rows now show recovered thought density and raw thought density separately. That makes it easy to tell whether cleanup removed visible reasoning, or whether the model stayed clean by itself.
The viewer also includes a recovery-lift scatter: raw score on the x-axis, recovery lift on the y-axis, and color by CPU / GPU / remote track. Points above zero are runs where recovery helped overall. Points below zero are runs where recovery hurt a little.
That same deterministic recovery path is enabled in normal product output by default. If you want to compare behavior with and without that recovery step, turn it off with recovery_enabled = false or CTXSIFT_RECOVERY_ENABLED=false.
The scenario matrix lives in benchmark/matrix.json and defines the three main tracks used in the repo:
| Track | Backend |
|---|---|
cpu | Local GGUF models through embedded llama.cpp |
gpu | Local Transformers models on CUDA |
remote | Hosted models through LiteLLM-compatible APIs |
Running it
Section titled “Running it”Write your own matrix.json
Section titled “Write your own matrix.json”The runner does not guess which models you want to compare. It reads them from benchmark/matrix.json by default, or from a custom file passed with --matrix.
The file format is:
{ "scenarios": [ { "name": "cpu-my-model", "track": "cpu", "phase": "cpu-screen", "model": "unsloth/Qwen3.5-0.8B-GGUF", "gguf_filename": "Qwen3.5-0.8B-Q8_0.gguf", "quantization": "none", "device": "cpu" }, { "name": "gpu-my-model", "track": "gpu", "phase": "gpu-screen", "model": "Qwen/Qwen3.5-0.8B", "quantization": "none", "device": "cuda" }, { "name": "remote-my-model", "track": "remote", "phase": "remote-screen", "model": "gpt-4.1-mini", "quantization": "none", "device": "remote", "concurrency": 8 } ]}Each scenario row becomes one benchmark run target.
Scenario fields
Section titled “Scenario fields”| Field | Required | Meaning |
|---|---|---|
name | Yes | Unique scenario name. This is what you pass to --scenario. Keep it short and stable. |
track | Yes | High-level bucket for the viewer: cpu, gpu, or remote. |
phase | Yes | Group name for bulk runs with --phase, such as cpu-screen or remote-screen. |
model | Yes | Model identifier. For CPU/GPU this is the Hugging Face repo or model id. For remote this is the provider model name passed through LiteLLM. |
quantization | Yes | Quantization label stored with the scenario. Use none unless you are intentionally benchmarking another quantized GPU setup. |
device | Yes | Runtime path: cpu, cuda, or remote. |
gguf_filename | CPU only | Required for CPU GGUF runs. Omit it for GPU and remote runs. |
dtype | No | GPU precision override. Default: auto. |
attn_implementation | No | GPU attention backend override. Default: auto. |
max_output_tokens | No | Per-scenario compression budget. Default when omitted: 768. |
concurrency | No | Number of in-flight cases for that scenario. Default: 1. Most useful for remote runs. |
enabled | No | Whether this scenario should be considered by the runner. Default: true. |
What to change for each track
Section titled “What to change for each track”- CPU: set
tracktocpu,devicetocpu, and provide bothmodelandgguf_filename. - GPU: set
tracktogpu,devicetocuda, and use a normal Transformers model id. Do not setgguf_filename. - Remote: set
tracktoremote,devicetoremote, and use the provider model name inmodel. The actual base URL and API key still come from your CtxSift config or env file, not frommatrix.json.
Practical rules
Section titled “Practical rules”nameshould be unique. If two rows share the same name, your CLI filtering becomes ambiguous.trackanddeviceshould agree. Do not create rows liketrack=cpuwithdevice=cuda.phaseis just a label, but it should match the track you plan to run together. For example, keep CPU rows undercpu-screen.quantizationis not used for CPU GGUF runtime behavior. It is mainly a scenario label there, and should normally staynone.concurrencyhigher than1is most useful for remote runs. For local CPU and many small local GPU models, higher concurrency can distort latency or exhaust memory.
Use a custom matrix file
Section titled “Use a custom matrix file”If you do not want to edit the repo’s default matrix, write your own file and pass it explicitly:
uv run python -m benchmark.runner --matrix my-matrix.json --scenario remote-my-model --remote --env-file .envThat is the easiest way to keep one small matrix per machine, provider, or experiment.
List all scenarios
Section titled “List all scenarios”uv run python -m benchmark.runner --list-scenariosRun one scenario
Section titled “Run one scenario”uv run python -m benchmark.runner --scenario cpu-granite-4.0-350m-no-quantRun a full screening phase
Section titled “Run a full screening phase”uv run python -m benchmark.runner --phase cpu-screenRun a smaller subset for a quick smoke test
Section titled “Run a smaller subset for a quick smoke test”uv run python -m benchmark.runner --scenario gpu-lfm2.5-1.2b-instruct-no-quant --max-cases 50Run selected case IDs only
Section titled “Run selected case IDs only”uv run python -m benchmark.runner --scenario cpu-granite-4.0-350m-no-quant --case-id pytest-01 --case-id kubectl-09Print model output during the run
Section titled “Print model output during the run”uv run python -m benchmark.runner --scenario gpu-qwen3.5-0.8b-no-quant --show-outputName the run yourself
Section titled “Name the run yourself”uv run python -m benchmark.runner --scenario gpu-qwen2.5-1.5b-no-quant --name smokeIf you do not provide --name, each scenario gets its own timestamped folder under benchmark/results/.
For remote scenarios, use the same dataset and scoring rules through the configured LiteLLM backend:
uv run python -m benchmark.runner --remote --env-file .env --scenario remote-gpt-4o-miniThe runner loads the env file first, resolves the standard remote configuration, and executes the selected remote scenario with concurrent requests to keep wall-clock time reasonable.
To inspect results in the local dashboard:
uv run python -m benchmark.viewer --open ./benchmark/results/If you want the latest generated snapshot directly, open the static viewer page here:
uv run python -m benchmark.viewer --open ./benchmark/results/remote-gpt-5.4-mini-20260518T065257ZYou can also point the viewer at one run folder instead of the full tree:
uv run python -m benchmark.viewer --open ./benchmark/results/remote-gpt-5.4-mini-20260518T065257ZThe viewer is the fastest way to compare models side by side, inspect acceptance counts, and see how quality and latency shift across CPU, GPU, and remote runs.