Skip to content

Local Models Guide

CtxSift is built to run fully locally, ensuring your command line history, code snippets, and traces never leave your machine. It utilizes two distinct local model paths:

  1. Local Compression Model (LLM): Extracts signals and summarizes output.
  2. Local Embedding Model: Converts text to vectors for semantic recall.

Depending on your hardware capability, CtxSift recommends specific models optimized for accuracy, latency, and resource constraints:

ComponentHardwareRecommended ModelEngine / Backend
CompressionCPU Onlyibm-granite/granite-4.0-350m-GGUFllama.cpp
CompressionGPU (CUDA)LiquidAI/LFM2.5-1.2B-InstructTransformers
EmbeddingsAnymicrosoft/harrier-oss-v1-0.6bSentence Transformers

These tables come from the latest local benchmark snapshots bundled in the repo:

  • CPU: benchmark/results/cpu-models-20260524T014526Z
  • GPU: benchmark/results/gpu-models-20260524T212353Z

They are sorted by average latency, fastest first, because raw score alone hides the practical tradeoff. Score here means the benchmark’s main recovered score. Treat the timing numbers as relative to the benchmark machine, not as universal promises for your own hardware.

NameAvg. Inference (s)ScoreComments
ibm-granite/granite-4.0-350m-GGUF2.1446.93Current built-in CPU default. Fastest tested CPU path and good enough for everyday use, but no longer the strongest quality option.
LiquidAI/LFM2.5-350M-GGUF2.3849.92Nearly Granite-default latency with better quality. Good low-latency alternative when you want a quick upgrade without jumping to a larger model.
Qwen/Qwen2.5-0.5B-Instruct-GGUF3.3053.06Strong for its size and still quick, but it rejects more cases than the best CPU choices.
unsloth/gemma-3-270m-it-GGUF3.5338.55Very small and fast, but quality is weak. Only worth it when your machine is extremely constrained.
unsloth/Qwen2.5-Coder-0.5B-Instruct-128K-GGUF3.6751.86Solid code-heavy CPU option. Better balanced than the base Qwen2.5-0.5B if you mostly compress developer tooling output.
unsloth/Qwen3.5-0.8B-GGUF4.5456.45Best CPU model overall in the current run. This is the main CPU upgrade we recommend over the default.
unsloth/SmolLM2-360M-Instruct-GGUF5.4847.64Usable lightweight fallback, but not clearly stronger than faster alternatives above it.
unsloth/gemma-3-1b-it-GGUF5.5540.15Middling quality for the latency. Usually skip it.
LiquidAI/LFM2.5-1.2B-Instruct-GGUF6.4848.25Very low hard-reject count, but too many soft accepts keep the score moderate.
LiquidAI/LFM2-700M-GGUF6.7649.15Reasonable middle-ground model, but slower than better-scoring alternatives.
LiquidAI/LFM2-350M-Extract-GGUF7.8139.74Extract-tuned and weak as a general-purpose CtxSift compression model.
unsloth/Qwen3-0.6B-GGUF13.9153.10Good quality, but too slow on CPU relative to Qwen3.5-0.8B unless you specifically want Qwen3.
NameAvg. Inference (s)ScoreComments
LiquidAI/LFM2.5-1.2B-Instruct0.8154.61Current CUDA recommendation. Fastest GPU model by a wide margin and still strong enough to be the practical default.
Qwen/Qwen3.5-0.8B3.4359.13Strong small GPU model. Much faster than the 2B-tier options while still posting a very good score.
Qwen/Qwen2.5-1.5B-Instruct7.8059.28Slightly higher score than Qwen3.5-0.8B, but at more than double the latency.
unsloth/gemma-3-1b-it11.8142.69Weak for the latency. Not a strong CUDA pick.
ibm-granite/granite-4.0-micro15.3047.45Usable, but outclassed by faster or better-scoring Qwen and LFM models.
Qwen/Qwen3.5-2B16.9261.07Best GPU score in the current run. This is the higher-quality CUDA upgrade when you are willing to pay for more latency.
ibm-granite/granite-3.3-2b-instruct17.4346.43Reliable enough, but the score does not justify the latency against better Qwen options.
HuggingFaceTB/SmolLM2-1.7B-Instruct17.6753.00Decent fallback GPU model, but not especially compelling against LFM for speed or Qwen for quality.
Qwen/Qwen3-1.7B33.8058.05Good score, but far slower than the main recommended GPU models.

When running on the CPU, CtxSift uses an embedded llama.cpp runner. This engine requires models in the GGUF format.

  • Optimized for CPU instruction sets (AVX2, AVX-512).
  • Highly memory-efficient.
  • Loads specific pre-quantized files directly (e.g. Q8_0 or Q4_K_M).

If you want a clear CPU quality upgrade over the default Granite 350M model, switch to Qwen3.5-0.8B-GGUF:

# Set the Hugging Face GGUF repository id
ctxsift config set local.model Qwen/Qwen3.5-0.8B-GGUF
# Set the target GGUF file within the repo's files tab
ctxsift config set local.gguf_filename Qwen3.5-0.8B-Q8_0.gguf

If you have a CUDA-compatible GPU, CtxSift uses the PyTorch/Hugging Face Transformers backend. This supports any standard autoregressive text-generation model.

Section titled “Recommended: LiquidAI LFM2.5-1.2B-Instruct”

For local GPU compression, we still recommend and bias setup toward LiquidAI/LFM2.5-1.2B-Instruct.

  • Best speed-to-quality default: It was the fastest GPU model in the latest local run at 0.81 s average inference while still scoring 54.61.
  • Easy practical win: It is the safest first CUDA pick before you spend more latency on bigger Qwen models.
  • Good recovery behavior: It benefits from CtxSift’s deterministic output recovery without depending on it to become usable.

Ensure you have the GPU dependencies installed:

uv tool install "ctxsift[gpu]"

Configure CtxSift to run LFM on CUDA:

# Select GPU execution
ctxsift config set local.device cuda
# Configure the model repository (GGUF filename is ignored on GPU)
ctxsift config set local.model LiquidAI/LFM2.5-1.2B-Instruct

If your GPU has limited VRAM, you can load larger models by enabling in-memory quantization using bitsandbytes.

To use GPU quantization:

  1. Install the quantization extras:
    uv tool install "ctxsift[gpu,quant]"
  2. Set your desired quantization level:
    # Use 8-bit precision (trades slight speed for lower VRAM)
    ctxsift config set local.quantization bnb-8bit
    # Use 4-bit NormalFloat precision (aggressive memory savings)
    ctxsift config set local.quantization bnb-4bit-nf4

CtxSift caches models to avoid downloading them multiple times:

  • Transformers (GPU & Embeddings): Cached in the standard Hugging Face hub cache directory (typically ~/.cache/huggingface/hub).
  • GGUF Models (CPU): Cached under the CtxSift local data folder:
    • Linux/macOS: ~/.local/share/ctxsift/models/
    • Windows: %LOCALAPPDATA%\ctxsift\ctxsift\Cache\models\

To redirect the cache directory to a custom path (e.g. an external drive):

ctxsift config set local.model_cache_path "D:/model-cache"