Local Models Guide
CtxSift is built to run fully locally, ensuring your command line history, code snippets, and traces never leave your machine. It utilizes two distinct local model paths:
- Local Compression Model (LLM): Extracts signals and summarizes output.
- Local Embedding Model: Converts text to vectors for semantic recall.
Model default recommendations
Section titled “Model default recommendations”Depending on your hardware capability, CtxSift recommends specific models optimized for accuracy, latency, and resource constraints:
| Component | Hardware | Recommended Model | Engine / Backend |
|---|---|---|---|
| Compression | CPU Only | ibm-granite/granite-4.0-350m-GGUF | llama.cpp |
| Compression | GPU (CUDA) | LiquidAI/LFM2.5-1.2B-Instruct | Transformers |
| Embeddings | Any | microsoft/harrier-oss-v1-0.6b | Sentence Transformers |
Latest tested local compression models
Section titled “Latest tested local compression models”These tables come from the latest local benchmark snapshots bundled in the repo:
- CPU:
benchmark/results/cpu-models-20260524T014526Z - GPU:
benchmark/results/gpu-models-20260524T212353Z
They are sorted by average latency, fastest first, because raw score alone hides the practical tradeoff. Score here means the benchmark’s main recovered score. Treat the timing numbers as relative to the benchmark machine, not as universal promises for your own hardware.
CPU models
Section titled “CPU models”| Name | Avg. Inference (s) | Score | Comments |
|---|---|---|---|
ibm-granite/granite-4.0-350m-GGUF | 2.14 | 46.93 | Current built-in CPU default. Fastest tested CPU path and good enough for everyday use, but no longer the strongest quality option. |
LiquidAI/LFM2.5-350M-GGUF | 2.38 | 49.92 | Nearly Granite-default latency with better quality. Good low-latency alternative when you want a quick upgrade without jumping to a larger model. |
Qwen/Qwen2.5-0.5B-Instruct-GGUF | 3.30 | 53.06 | Strong for its size and still quick, but it rejects more cases than the best CPU choices. |
unsloth/gemma-3-270m-it-GGUF | 3.53 | 38.55 | Very small and fast, but quality is weak. Only worth it when your machine is extremely constrained. |
unsloth/Qwen2.5-Coder-0.5B-Instruct-128K-GGUF | 3.67 | 51.86 | Solid code-heavy CPU option. Better balanced than the base Qwen2.5-0.5B if you mostly compress developer tooling output. |
unsloth/Qwen3.5-0.8B-GGUF | 4.54 | 56.45 | Best CPU model overall in the current run. This is the main CPU upgrade we recommend over the default. |
unsloth/SmolLM2-360M-Instruct-GGUF | 5.48 | 47.64 | Usable lightweight fallback, but not clearly stronger than faster alternatives above it. |
unsloth/gemma-3-1b-it-GGUF | 5.55 | 40.15 | Middling quality for the latency. Usually skip it. |
LiquidAI/LFM2.5-1.2B-Instruct-GGUF | 6.48 | 48.25 | Very low hard-reject count, but too many soft accepts keep the score moderate. |
LiquidAI/LFM2-700M-GGUF | 6.76 | 49.15 | Reasonable middle-ground model, but slower than better-scoring alternatives. |
LiquidAI/LFM2-350M-Extract-GGUF | 7.81 | 39.74 | Extract-tuned and weak as a general-purpose CtxSift compression model. |
unsloth/Qwen3-0.6B-GGUF | 13.91 | 53.10 | Good quality, but too slow on CPU relative to Qwen3.5-0.8B unless you specifically want Qwen3. |
GPU models
Section titled “GPU models”| Name | Avg. Inference (s) | Score | Comments |
|---|---|---|---|
LiquidAI/LFM2.5-1.2B-Instruct | 0.81 | 54.61 | Current CUDA recommendation. Fastest GPU model by a wide margin and still strong enough to be the practical default. |
Qwen/Qwen3.5-0.8B | 3.43 | 59.13 | Strong small GPU model. Much faster than the 2B-tier options while still posting a very good score. |
Qwen/Qwen2.5-1.5B-Instruct | 7.80 | 59.28 | Slightly higher score than Qwen3.5-0.8B, but at more than double the latency. |
unsloth/gemma-3-1b-it | 11.81 | 42.69 | Weak for the latency. Not a strong CUDA pick. |
ibm-granite/granite-4.0-micro | 15.30 | 47.45 | Usable, but outclassed by faster or better-scoring Qwen and LFM models. |
Qwen/Qwen3.5-2B | 16.92 | 61.07 | Best GPU score in the current run. This is the higher-quality CUDA upgrade when you are willing to pay for more latency. |
ibm-granite/granite-3.3-2b-instruct | 17.43 | 46.43 | Reliable enough, but the score does not justify the latency against better Qwen options. |
HuggingFaceTB/SmolLM2-1.7B-Instruct | 17.67 | 53.00 | Decent fallback GPU model, but not especially compelling against LFM for speed or Qwen for quality. |
Qwen/Qwen3-1.7B | 33.80 | 58.05 | Good score, but far slower than the main recommended GPU models. |
CPU compression with llama.cpp
Section titled “CPU compression with llama.cpp”When running on the CPU, CtxSift uses an embedded llama.cpp runner. This engine requires models in the GGUF format.
Why GGUF on CPU?
Section titled “Why GGUF on CPU?”- Optimized for CPU instruction sets (AVX2, AVX-512).
- Highly memory-efficient.
- Loads specific pre-quantized files directly (e.g.
Q8_0orQ4_K_M).
Recommended GGUF settings
Section titled “Recommended GGUF settings”If you want a clear CPU quality upgrade over the default Granite 350M model, switch to Qwen3.5-0.8B-GGUF:
# Set the Hugging Face GGUF repository idctxsift config set local.model Qwen/Qwen3.5-0.8B-GGUF
# Set the target GGUF file within the repo's files tabctxsift config set local.gguf_filename Qwen3.5-0.8B-Q8_0.ggufGPU compression with Transformers
Section titled “GPU compression with Transformers”If you have a CUDA-compatible GPU, CtxSift uses the PyTorch/Hugging Face Transformers backend. This supports any standard autoregressive text-generation model.
Recommended: LiquidAI LFM2.5-1.2B-Instruct
Section titled “Recommended: LiquidAI LFM2.5-1.2B-Instruct”For local GPU compression, we still recommend and bias setup toward LiquidAI/LFM2.5-1.2B-Instruct.
- Best speed-to-quality default: It was the fastest GPU model in the latest local run at
0.81 saverage inference while still scoring54.61. - Easy practical win: It is the safest first CUDA pick before you spend more latency on bigger Qwen models.
- Good recovery behavior: It benefits from CtxSift’s deterministic output recovery without depending on it to become usable.
Configuring GPU mode
Section titled “Configuring GPU mode”Ensure you have the GPU dependencies installed:
uv tool install "ctxsift[gpu]"Configure CtxSift to run LFM on CUDA:
# Select GPU executionctxsift config set local.device cuda
# Configure the model repository (GGUF filename is ignored on GPU)ctxsift config set local.model LiquidAI/LFM2.5-1.2B-InstructIn-memory GPU quantization
Section titled “In-memory GPU quantization”If your GPU has limited VRAM, you can load larger models by enabling in-memory quantization using bitsandbytes.
To use GPU quantization:
- Install the quantization extras:
uv tool install "ctxsift[gpu,quant]"
- Set your desired quantization level:
# Use 8-bit precision (trades slight speed for lower VRAM)ctxsift config set local.quantization bnb-8bit# Use 4-bit NormalFloat precision (aggressive memory savings)ctxsift config set local.quantization bnb-4bit-nf4
Local model cache locations
Section titled “Local model cache locations”CtxSift caches models to avoid downloading them multiple times:
- Transformers (GPU & Embeddings): Cached in the standard Hugging Face hub cache directory (typically
~/.cache/huggingface/hub). - GGUF Models (CPU): Cached under the CtxSift local data folder:
- Linux/macOS:
~/.local/share/ctxsift/models/ - Windows:
%LOCALAPPDATA%\ctxsift\ctxsift\Cache\models\
- Linux/macOS:
To redirect the cache directory to a custom path (e.g. an external drive):
ctxsift config set local.model_cache_path "D:/model-cache"