Longer context. Less memory. Retrieval intact.
At 128K context, standard Q8 KV lost the needle. fraQtl used less memory than Q8 — and retrieved it.
| fp16 KV (baseline) | 22.7 GB | 5 / 5 retrieval |
| Q8 KV | 15.4 GB | 1 / 5 retrieval |
| fraQtl D1 | 13.3 GB | 5 / 5 retrieval |
Below Q8 memory, fp16-level retrieval — in a real long-context run. Standard low-bit KV saves memory but loses the needle; fraQtl saves memory and keeps it. See the full receipt →
What is the KV cache?
Every transformer-based language model uses an attention mechanism to decide which previous tokens matter for predicting the next one. To avoid recomputing attention from scratch at every step, models store two matrices — Keys (K) and Values (V) — for every token they've seen. This is the KV cache.
The KV cache is what gives a model its "memory" during a conversation. Without it, generation would be quadratically slow. With it, generation is fast — but the cache grows linearly with every token.
Why KV cache is the bottleneck
For small contexts (a few hundred tokens), the KV cache is negligible. But long-context workloads run at 32K and 128K+. At those lengths, the cache dominates GPU memory:
| CONTEXT LENGTH | KV CACHE SIZE (7B MODEL) | REALITY |
|---|---|---|
| 4K tokens | 0.5 GB | Fits easily |
| 32K tokens | 4.3 GB | Tight on one GPU |
| 128K tokens | 17 GB | Exceeds model weights |
| 512K tokens | 68 GB | Needs multiple GPUs just for cache |
At 128K context, the cache is larger than the model itself. You're buying GPUs to store memory, not to compute.
For every dollar you spend on GPU memory for a 7B model at 32K context, roughly 40 cents goes to the KV cache. At 128K, it's 70 cents.
Why this matters for inference cost
GPU memory is the most expensive resource in LLM deployment. Every gigabyte of KV cache is a gigabyte that can't be used for batching more users, running longer conversations, or deploying larger models.
- Memory = hardware cost. When the KV cache outgrows the GPU at long context, you reach for more hardware. Less KV memory means long-context workloads that needed extra GPUs for cache can fit on fewer.
- Memory = throughput. A smaller cache means more requests can fit in the same memory budget.
- Memory = context length. The longest context you can serve is gated by KV cache size — compress it and the same GPU holds more.
For production inference at scale — chatbots, coding assistants, document analysis, RAG pipelines — KV cache compression is one of the highest-leverage memory optimizations for long-context inference.
Why existing approaches fall short
The field has tried several strategies to shrink the KV cache. Each has a fundamental tradeoff:
Token eviction (SnapKV, TOVA, StreamingLLM)
Drop tokens the model "probably" doesn't need. The problem: you don't know what the model will need next. Eviction is irreversible. If a dropped token turns out to be critical later, the model hallucinates or loses coherence. Needle-in-a-haystack recall drops to 40–93%.
Rank reduction (SVD, low-rank projection)
Delete entire dimensions from the cache. This destroys information permanently. Even removing the "least important" dimensions causes catastrophic attention routing errors — the model reads the wrong tokens. PPL degrades by 4–364 points depending on aggressiveness.
Naive quantization (KIVI, round-to-nearest)
Reduce precision uniformly across all dimensions. Better than deletion, but without understanding which dimensions matter more, you waste precision budget on directions that don't contribute to prediction. PPL degrades by +0.27 to +1.00.
fraQtl: a different approach
fraQtl compresses the KV cache using a mathematically-derived importance metric that estimates which directions carry downstream signal and which can be aggressively compressed.
The key insight: not all dimensions are equal. A small fraction of the cache carries most of the information the model actually uses for prediction. fraQtl calibrates an importance-aware basis and allocates precision toward the directions most likely to affect downstream behavior.
For pre-compressed artifacts, the model loads through standard Hugging Face workflows — no code changes. Runtime KV compression is available separately in early access.
Research backing
The headline above is the customer claim. Underneath it is the research that makes it work — V-cache compression results across architectures, and a retrieval bake-off against published methods. These are lab/research numbers, reported for transparency, not the front-page promise.
| MODEL | V-CACHE COMPRESSION | PPL DELTA | NIAH (375 TRIALS) |
|---|---|---|---|
| Mistral 7B | 3.5× | +0.012 | 100% |
| Llama 3.2 3B | 3.5× | +0.014 ± 0.004 | 100% |
| Qwen 2.5 3B | 3.5× | +0.015 | 100% |
V-cache compression across 3 models with 3-seed error bars. Early-run NIAH (375 trials, 2 models). The bake-off below shows the full C44 1080-trial retrieval comparison against competing methods.
vs. published methods — NIAH bake-off
Needle-in-a-haystack retrieval on Mistral-7B-Instruct, 8K–31K context, 3 needle types × 360 trials per method (1080 trials total per row, except H2O which OOMs at 16K+).
| METHOD | NIAH RETRIEVAL | PPL DELTA (V-ONLY) |
|---|---|---|
| fraQtl V-only | 98.5% | +0.012 |
| fraQtl V+K | 97.8% | — |
| TOVA | 97.8% | +0.259 |
| FP16 baseline | 97.0% | 0 |
| SnapKV | 94.1% | +0.214 |
| StreamingLLM | 37.8% | +0.548 |
| H2O | 0.0% (OOM at 16K+) | — |
| KVQuant | — | +0.27 |
| KIVI | — | +1.00 |
How to use it
Compressed artifacts. Published fraQtl-compressed models load through the standard Hugging Face / Transformers workflow — no custom install. See the current public artifacts on Hugging Face.
Runtime KV compression. Available in early access. We calibrate on your workload and benchmark against your own FP16 baseline before anything ships — bring your model and we'll confirm support.
Long-context memory without losing the needle.
Below Q8 memory, fp16-level retrieval — built for the failure mode where uniform low-bit KV loses recall. Bring your model; we benchmark against your stack.
Get Early AccessResearch
fraQtl is backed by two preprints (peer review in progress), validated across multiple transformer families with public receipts reported per model. The mathematical foundations are derived from first principles — not heuristics.
- Paper I: Quantization Dominates Rank Reduction for KV-Cache Compression (arXiv:2604.11501) — 5 models, 124M–14B parameters
- Paper II: Variance Is Not Importance — 46 experiments, spectral analysis of compressibility