Skip to content

Model Selection Guide

If you already know exactly which model you want, go to the local or remote guides and set it directly. If you do not, start here. The purpose of this document is not to list every model that can work. The goal is to narrow the choice to the few models that make sense for CtxSift’s actual job: compress noisy command output into something an agent can reuse later without losing anchors, breaking structure, or wasting tokens.

The recommendations below are based on the latest bundled benchmark snapshots in this repo:

  • CPU: benchmark/results/cpu-models-20260524T014526Z
  • GPU: benchmark/results/gpu-models-20260524T212353Z
  • Remote: benchmark/results/remote-models-20260523T233753Z

Score below means the benchmark’s main recovered score, not the raw unrecovered score.


Use this first. Most bad model decisions happen because people start from the model family instead of the runtime they actually have.

Your setupStart here
CPU only, no CUDAStay on the local GGUF path
CUDA GPU availableUse the local Transformers GPU path
You are fine calling external APIsUse the remote path
You are unsureStart local first, then benchmark before spending money or VRAM

If privacy, offline operation, and local recall matter most, stay local. If you want the highest-quality compression and do not mind provider cost or network dependence, use remote.


These are the shortest honest recommendations right now.

SituationBest first choiceWhy
You just installed CtxSift on CPUibm-granite/granite-4.0-350m-GGUFIt is the built-in default and the fastest tested CPU model
You want the best local CPU upgradeunsloth/Qwen3.5-0.8B-GGUFBest CPU score in the current run, without becoming painfully slow
You want the best practical CUDA defaultLiquidAI/LFM2.5-1.2B-InstructFastest GPU model by far, while still scoring well
You want the highest local GPU qualityQwen/Qwen3.5-2BBest GPU score in the current run
You want the best hosted qualitygpt-4.1Best remote result in the current run
You want a cheaper hosted defaultgpt-4o-miniFast, reliable, and much cheaper than flagship remote models

If you do not want to think about it any further, those are the right starting points.


CPU is where the tradeoff matters most, because one step up in quality can easily cost you 2x or 3x latency.

  • you want the quickest path to a working install
  • you care most about low latency on CPU
  • you do not want to think about model tuning yet

The current default, ibm-granite/granite-4.0-350m-GGUF, is still the fastest tested CPU model in the latest run at 2.14 s average inference. That is why it remains the product default. The point of the default is not to win the benchmark. The point is to be safe, small, and quick enough to make first-run local compression feel usable.

  • you want the best local CPU quality
  • you can tolerate roughly 4.5 s average inference instead of 2.1 s
  • you want a clear step up without moving to CUDA or remote

unsloth/Qwen3.5-0.8B-GGUF is the strongest CPU model in the current run at 56.45. This is the main CPU recommendation if the default feels too weak.

Consider the small LFM or Qwen2.5 variants when

Section titled “Consider the small LFM or Qwen2.5 variants when”
  • you care about CPU latency almost as much as the default
  • you still want a noticeable quality step up

Two interesting middle-ground CPU options in the current run are:

ModelAvg. Inference (s)ScoreWhy you would choose it
LiquidAI/LFM2.5-350M-GGUF2.3849.92Near-default speed with a healthier score
Qwen/Qwen2.5-0.5B-Instruct-GGUF3.3053.06Stronger score than the small LFM, still much faster than the top CPU pick

These are good if you want something meaningfully better than Granite 350M without paying the full latency cost of the top CPU choice.

Avoid these unless you have a specific reason

Section titled “Avoid these unless you have a specific reason”
  • unsloth/gemma-3-270m-it-GGUF
  • unsloth/gemma-3-1b-it-GGUF
  • LiquidAI/LFM2-350M-Extract-GGUF

They are not unusable in an absolute sense, but in the current CtxSift benchmark they are weak enough that there is usually a better option at nearby speed.


GPU changes the tradeoff shape. Once you have CUDA, the question is usually not “can I run local compression at all?” The question becomes “do I want speed or do I want the best local score?”

If you are on CUDA and you want the least risky starting point, use LiquidAI/LFM2.5-1.2B-Instruct.

Why:

  • it was the fastest GPU model in the current run at 0.81 s
  • it still scored 54.61
  • it is the easiest local CUDA model to recommend without caveats

This is the right default for most people with a usable NVIDIA card.

Move to Qwen3.5 2B when quality matters more than latency

Section titled “Move to Qwen3.5 2B when quality matters more than latency”

If your goal is “best local CUDA score, even if it is slower”, move to Qwen/Qwen3.5-2B.

In the latest run:

  • Qwen/Qwen3.5-2B scored 61.07
  • average inference was 16.92 s

That is a real quality step up, but it is also a large latency jump. Use it when the stronger compression is worth the wait.

These are the two models worth considering between the fast default and the best-quality upgrade:

ModelAvg. Inference (s)ScoreComment
Qwen/Qwen3.5-0.8B3.4359.13Strong small GPU model and much faster than the 2B tier
Qwen/Qwen2.5-1.5B-Instruct7.8059.28Slightly higher score, but at more than double the latency

If you want a sharper local model than LFM, but do not want to jump all the way to Qwen3.5 2B, these are the main two to compare.

The current benchmark does not make a strong case for:

  • unsloth/gemma-3-1b-it
  • ibm-granite/granite-4.0-micro
  • ibm-granite/granite-3.3-2b-instruct

Again, this does not mean they are universally bad models. It means they are not especially strong CtxSift compression picks relative to the better options in the same local run.


Remote is mostly a cost, latency, and quality decision.

gpt-4.1 is the strongest hosted result in the latest run:

  • 88.17 score
  • 1.33 s average inference
  • 1 rejected case

If you are optimizing for quality first, that is the current remote winner.

Use gpt-4o-mini when you want the safest everyday default

Section titled “Use gpt-4o-mini when you want the safest everyday default”

gpt-4o-mini is not the best score in the remote set, but it is one of the easiest remote recommendations because it stays:

  • fast
  • reliable
  • relatively cheaper than flagship hosted models

It posted 84.61 with only 1 rejected case, which is strong enough for a default hosted path.

Use gpt-4.1-mini when you want a middle ground

Section titled “Use gpt-4.1-mini when you want a middle ground”

gpt-4.1-mini is the practical middle between gpt-4.1 and gpt-4o-mini in the current run. It keeps most of the quality shape of the flagship result while staying close in speed.

Do not use these for CtxSift compression based on the current benchmark:

  • gpt-5-nano
  • gpt-5-mini

Both underperformed badly in the latest run. This is not a pricing opinion or a general model judgment. It is a CtxSift compression benchmark result.


If you prefer to think in priorities instead of hardware paths, use this matrix.

PriorityRecommended choice
Fastest local CPU pathibm-granite/granite-4.0-350m-GGUF
Best local CPU qualityunsloth/Qwen3.5-0.8B-GGUF
Best near-default CPU upgradeLiquidAI/LFM2.5-350M-GGUF
Fastest practical local CUDA pathLiquidAI/LFM2.5-1.2B-Instruct
Best local CUDA qualityQwen/Qwen3.5-2B
Best hosted qualitygpt-4.1
Cheapest safe hosted defaultgpt-4o-mini

When the benchmark should overrule this guide

Section titled “When the benchmark should overrule this guide”

This page is auto-generated from the benchmark snapshots bundled in the repo. That is already far better than generic model advice, but it is still not your exact machine.

You should run the benchmark yourself when:

  • your CPU is much weaker or much stronger than the benchmark machine
  • your GPU has very different VRAM or throughput characteristics
  • you care about one narrow class of outputs more than the full benchmark corpus
  • you want to compare recovered score versus raw score for your own target model

Use the benchmark when the choice is close. Use this guide when you just want the right short list.