Output Processing
When CtxSift asks a model to compress noisy output, the model’s first answer is not treated as final truth.
CtxSift runs that answer through an output-processing pipeline before it stores the record, returns the final text, or scores the model in the benchmark. That pipeline exists because many real models do at least one of these:
- wrap otherwise-correct output in code fences
- echo prompt scaffolding like
Instruction:orOutput form: - leak chat control tokens like
<|im_end|> - leak visible reasoning such as
Okay, the user wants... - return the right facts in a slightly wrong wrapper
Output processing is CtxSift’s answer to that mess. It tries to make the final output usable without pretending the model was cleaner than it really was.
The four stages
Section titled “The four stages”At a high level, CtxSift processes model output in four steps:
- The model returns a raw answer
- CtxSift normalizes it into a cleaner candidate
- CtxSift validates that candidate against the requested intent
- If recovery is enabled and the answer looks salvageable, CtxSift tries deterministic recovery
The important split is:
- Normalization is always on
- Recovery is optional and controlled by
recovery_enabled
Normalization makes the output easier to judge. Recovery tries to rescue outputs that are close enough to be safely repaired.
Stage 1: Raw output
Section titled “Stage 1: Raw output”The raw output is the model’s first visible answer before CtxSift cleans it up.
This matters most in the benchmark. The benchmark keeps the raw answer so it can answer a simple question honestly: how clean was the model by itself?
In normal product use, CtxSift does not return the raw answer. It returns the recovered final answer by default.
Stage 2: Normalization
Section titled “Stage 2: Normalization”Normalization is the always-on cleanup pass. It is not meant to “fix” the answer semantically. It is meant to remove wrappers and obvious junk so validation can judge the actual payload more fairly.
Depending on intent, normalization can do things like:
- strip leaked reasoning tags such as
<thinking>or<analysis> - strip known chat control tokens such as
<|im_end|>and similar wrappers - collapse outer blank lines
- strip plain-text headings and bullet wrappers
- unwrap a single fenced block when that is safe for the requested contract
- trim leading visible thought preamble such as
Okay, the user wants...
The exact cleanup is intent-aware.
Plain-text intents
Section titled “Plain-text intents”For summary and recall, normalization is more forgiving. If the answer contains a safe removable preamble or a fence wrapper around otherwise-normal prose, CtxSift will usually strip it.
That is why a model can answer with:
Okay, the user wants the failing test ids.
- tests/api/test_users.py::test_create_user- tests/api/test_users.py::test_delete_userand still end up with a clean final plain-text result.
Strict intents
Section titled “Strict intents”For exact-lines, exact-format, json, yaml, table, and bullet-list, normalization is stricter.
CtxSift will still remove some clearly extraneous wrappers, but only when doing so does not change the meaning of the payload. For example:
- a whole-answer fence like
```json ... ```can be unwrapped - a visible thought preamble can be trimmed if the remaining payload still matches the requested contract
But if the inner payload still does not satisfy the requested exact or structured shape, normalization does not try to invent a fix.
Stage 3: Validation
Section titled “Stage 3: Validation”After normalization, CtxSift validates the candidate against the requested intent.
This is where output processing becomes contract-aware. A candidate that is fine for summary can still be invalid for json, and a candidate that is acceptable for recall can still fail exact-format.
Validation checks for things like:
- empty output
- prompt scaffolding echo
- role token leakage
- leaked control tokens
- leaked visible thought
- missing anchors for exact-line style tasks
- obvious structured contract breakage
- obvious exact-format contract breakage
Validation produces one of three outcomes:
| Status | Meaning |
|---|---|
accepted | The candidate looks valid for the requested intent |
soft_accepted | The candidate is usable, but has quality problems such as sparse leakage or recoverable wrapper mistakes |
rejected | The candidate clearly fails the requested contract |
This is also where CtxSift distinguishes different kinds of output damage.
Sparse vs dense leakage
Section titled “Sparse vs dense leakage”For plain-text intents, sparse leaked control tokens or sparse visible thought are usually treated as quality defects, not automatic hard failures.
For strict intents, the standard is much tighter. If leaked wrapper text or leaked reasoning makes the output unusable as exact text or machine-readable structure, the answer is rejected unless safe cleanup can remove the wrapper without changing the payload.
Recovery
Section titled “Recovery”Recovery is the optional fourth stage, controlled by recovery_enabled.
If recovery is on, CtxSift checks whether the answer looks like a salvageable wrapper failure rather than a fundamentally wrong answer. If so, it generates a small set of deterministic recovery candidates and picks the best one.
This is not another model call. It is a rule-based repair pass over the same text.
Typical recovery cases include:
- prompt scaffolding at the top, with the real answer underneath
- a single fenced block around otherwise-valid JSON, YAML, table text, or plain text
- a leading visible thought preamble before the actual payload
- known control tokens around an otherwise-correct answer
Recovery does not try to repair the content itself. It does not invent missing fields, infer missing objects, or rewrite wrong commands into right ones.
If the answer is fundamentally wrong, recovery leaves it wrong and validation still fails.
Safe fence unwrap
Section titled “Safe fence unwrap”One common case is fenced output:
```json{"file":"src/app.py","line":14}```For strict and structured intents, CtxSift now treats this as a recoverable wrapper problem when all of the following are true:
- the whole answer is just one fenced block
- the instruction did not explicitly ask for fenced output
- the inner payload parses or matches the requested contract after unwrap
That means:
- raw benchmark view still records the wrapper mistake
- recovered view can accept the unwrapped payload
If the inner payload still does not parse or does not match the requested shape, the answer stays rejected.
Visible thought leakage
Section titled “Visible thought leakage”CtxSift treats visible reasoning as a first-class failure mode.
This includes both tag-style leakage and prose-style meta reasoning:
<think>...</think><analysis>...</analysis>Okay, the user wants...I should return...However, the instruction says...
For plain-text intents, CtxSift can safely trim a leading visible-thought preamble in recovered output. That helps keep the product output clean while still allowing the benchmark to notice that the model leaked reasoning in the first place.
For strict intents, visible thought is much more dangerous because even one extra meta line can break exact text or machine-readable structure. In those cases, recovery only helps if the thought text is a removable wrapper around an otherwise valid payload.
The benchmark also tracks visible-thought density directly so semantic similarity alone cannot let a rambly answer score too well.
Raw vs recovered in the benchmark
Section titled “Raw vs recovered in the benchmark”The benchmark keeps two views of every answer:
| View | What it means |
|---|---|
| Raw | The model’s output before deterministic recovery |
| Recovered | The final output after normalization, validation-aware cleanup, and deterministic recovery |
The viewer shows:
- recovered score as the main score
- raw score beside it
- recovery lift so you can see how much recovery helped or hurt
This split matters because it separates two different questions:
- How good is the model on its own?
- How usable is the final product output after CtxSift’s deterministic cleanup?
If raw and recovered are close, the model is naturally clean. If recovered is much higher, the model is messy but salvageable. If both are low, the model is simply not a good fit.
Product behavior
Section titled “Product behavior”In normal product use, CtxSift returns the recovered output by default.
That means product behavior is intentionally closer to “what the agent can actually use” than “what the model literally emitted first.” The benchmark still preserves both views so evaluation stays honest.
You can disable deterministic recovery if you want to compare behavior directly:
ctxsift config set recovery_enabled falseor:
CTXSIFT_RECOVERY_ENABLED=falseThis toggle applies to both local and remote compression.
What output processing does not do
Section titled “What output processing does not do”Output processing is deliberately conservative. It does not:
- rewrite wrong facts into right ones
- synthesize missing JSON keys or missing table rows
- guess commands that were never returned
- turn a bad exact-match answer into a good one by semantic similarity
- hide raw benchmark quality by pretending the model was cleaner than it was
That line matters. CtxSift is trying to rescue safe wrapper failures, not launder weak generations into fake correctness.
In practice
Section titled “In practice”The easiest way to think about output processing is:
- normalization removes obvious wrappers and formatting junk
- validation checks whether the candidate matches the requested intent
- recovery rescues safe wrapper failures when possible
- benchmark raw vs recovered tells you whether the model was clean on its own or whether the product had to save it
If you want the full contract for what each intent asks the model to produce, see Compress.