Local | Free | Opensource

Save tokens and extend your coding sessions.

Command outputs and state recollection are the biggest sources of token overuse and context clutter. CtxSift is a skill to save tokens by keeping sessions clutter-free and helping agents recover state faster.

Install with Agent View on GitHub

Codex Antigravity Copilot Claude Code Cursor

OpenCode Cascade Replit Aider Other Agents

Getting Started

The motivation behind CtxSift

Show only high-signal tokens to the agent after command runs and context compactions.

In agentic workflows, raw command outputs and state recollection after compaction contribute to major token waste. Agents often pull raw terminal output into context even when they only need a few anchors, then pay the same cost again later when compaction forces them to reread files or rerun commands.

CtxSift was built to cut that loop down to two operations: keep only the signal that matters now, then recover it later without rebuilding the whole state trail.

It was inspired by the original Distill project and extends that direction toward local execution, file rereads, and read-after-compression state recovery for coding agents.

Problem

Raw output sprawls

Most command output is noise relative to the next step the agent actually needs to take.

Failure mode

Compaction forgets

After long sessions, the agent often has to reconstruct state by expensive rereads and reruns.

Design goal

Recover state faster

Compression is only useful if the agent can later look up the same finding with trust signals attached.

Approach

Local-first workflow

Run locally when possible, support remote providers when needed, and keep the workflow grounded in real command use.

How it works

With CtxSift, your agents use two steps to keep minimal token footprint:
1. Extract and cache only what they need from raw outputs
2. Look up context later instead of repeatedly re-running commands or dragging raw terminal output back into the session.
That's it. Unlike other token savers, which can get heavy can confuse the agent with multiple tools, CtxSift keeps it simple and light. No multiple tools, MCP servers or sandbox spin-up dependencies.
See why it matters →

Compress

Agent uses pipe or command-capture mode with a natural-language instruction to extract exactly what it needs.

Cache

The compressed output is stored automatically with command metadata so it can be searched later instead of recreated.

Recall

Agent queries the stored record set, optionally boosted by files, and gets back exactly what it stored earlier.

Freshness

When source files change, results get marked stale so older context gets down-ranked and eventually cleaned up.

Real compression examples

See what CtxSift keeps from raw command output and what it strips away.

These cases come from benchmark fixtures and latest benchmarked outputs, with token counts shown for the raw input and the final compressed record.

Service restart-loop summary

systemd summary

Command systemctl restart api.service

430 Raw tokens

25 Output tokens

94.2% Smaller

Before

$ systemctl restart api.service
Job for api.service failed because the control process exited with error code.
See "systemctl status api.service" and "journalctl -u api.service -n 50" for details.
$ systemctl status api.service --no-pager
* api.service - HTTP API
May 16 11:27:40 buildbox api-server[21871]: {"level":"error","msg":"parse config","file":"/etc/api/config.yaml","line":17,"error":"yaml: unmarshal errors: line 17: field timeuot not found in type config.Server"}
May 16 11:27:41 buildbox systemd[1]: api.service: Start request repeated too quickly.

After

api.service fails due to a config error at /etc/api/config.yaml line 17: unexpected field "timeuot".

Source: benchmark fixture systemd-02 and latest remote gpt-4.1 benchmark output

Clone failure recall

git recall

Command git clone --progress https://github.com/example/very-large-repo.git

461 Raw tokens

32 Output tokens

93.1% Smaller

Before

$ git clone --progress https://github.com/example/very-large-repo.git
Cloning into 'very-large-repo'...
remote: Enumerating objects: 248731, done.
remote: Compressing objects: 100% (84791/84791), done.
Receiving objects:  64% (159188/248731), 125.01 MiB | 210.00 KiB/s
error: RPC failed; curl 56 Recv failure: Connection reset by peer
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

After

git clone --progress https://github.com/example/very-large-repo.git failed with curl 56 Connection reset by peer and fetch-pack: invalid index-pack output

Source: benchmark fixture git-03 and latest remote gpt-4.1 benchmark output

Compose startup recall

docker-compose recall

Command docker compose up --build

364 Raw tokens

28 Output tokens

92.3% Smaller

Before

$ docker compose up --build
[+] Running 3/3
[ok] Network demo_default     Created
[ok] Container demo-db-1      Created
[ok] Container demo-app-1     Created
db-1   | database system is ready to accept connections
app-1  | applying migrations
app-1  | ERROR sqlalchemy.exc.ProgrammingError: relation "tenant_settings" does not exist
app-1 exited with code 1
Aborting on container exit...

After

docker compose up --build, demo-app-1, tenant_settings, sqlalchemy.exc.ProgrammingError, app-1 exited with code 1

Source: benchmark fixture docker-compose-02 and latest remote gpt-4.1 benchmark output

Supported Models

Use local models on CPU/GPU or remotely hosted LLMs for compression.

By default, CtxSift starts with a small GGUF model on local CPU. If you have CUDA available, local compression can use normal Hugging Face text-generation models instead. If you prefer hosted inference, remote compression works through LiteLLM-compatible endpoints.

Recall embeddings stay local and separate from compression, so the retrieval path remains the same whether compression is local or remote.

Local model guide →

Remote model guide →

Open benchmark guide →

Latest benchmark scores →

Support at a glance
Component	Default / support	How it works
Local compression	Granite 4.0 350M GGUF by default	CPU uses GGUF through built-in llama.cpp. CUDA local mode supports normal Hugging Face text-generation models.
Remote compression	Any LiteLLM-compatible provider	Enabled when remote base URL and model name are configured. Replaces local compression, not local embeddings.
Recall embeddings	Harrier OSS v1 0.6B	Used for storing and recalling records regardless of whether compression is local or remote.

Supported local model families are broader than the defaults shown here. The benchmarked picks below are the fastest way to start from known-good options.

Built-in default

Granite 4.0 350M GGUF

The built-in local CPU default. Fastest tested CPU model in the latest run at 2.14 s average inference, with a 46.93 benchmark score.

Recommended CPU

Qwen3.5 0.8B GGUF

Best overall CPU model in the latest local run: 56.45 score, 4.54 s average inference, and only 16 rejected cases out of 280.

Recommended GPU

LFM 2.5 1.2B

Fastest practical CUDA option in the latest local run: 0.81 s average inference with a 54.61 score. Best first GPU pick.

Higher-quality GPU

Qwen3.5 2B

Highest-scoring local GPU model in the latest run: 61.07 score at 16.92 s average inference. Good upgrade when quality matters more than latency.

Choose Your Setup Path

Start with the runtime path that matches your machine and workflow.

Use local CPU for the simplest default path, local GPU when you want faster local inference, and remote provider mode when you want hosted models through a LiteLLM-compatible endpoint.

Benchmarked model comparisons live on their own page. Use the benchmark guide when you want tested CPU and GPU recommendations rather than setup instructions.

Open benchmark guide →

Local CPU

Start with the built-in local path and get the lowest-friction setup for everyday use.

Local GPU

Use CUDA-backed local models when you want faster inference and stronger on-device options.

Remote Provider

Connect OpenAI-compatible or LiteLLM-compatible hosted models when local inference is not the right fit.

Common setup questions

CPU vs GPU install? Can I use remote models? What are the configuration options? How do I install the skill? Why is my GPU not detected? Which model is best for me?

Automated Install

Install with standalone scripts or let your agent install it with the install skill.

Download for Linux Install with Agent

Linux macOS Windows