THE PROBLEM

KV cache and VRAM are the bottleneck.

At long context, KV cache size determines how many users you can serve concurrently and what models fit on your hardware. Most teams are forced to choose: shorter context, smaller model, or more GPUs. fraQtl removes that choice.

QWEN 3.6 35B-A3B · 1× A100-80GB

VRAM at increasing context length

FP16 hits the GPU ceiling at 64K. fraQtl runs 128K with 28 GB headroom.

A100-80GB CEILING

71 GB

25.6 GB

16K

BOTH FIT

OOM

82.9 GB

36.8 GB

64K

FP16 NEEDS 2 GPUS

OOM

85+ GB

51.7 GB

128K

FRAQTL: 28 GB FREE

FP16 BASELINE
FRAQTL COMPRESSED

FP16 64K = 82.86 GB measured (OOM on 80 GB ceiling). 16K and 128K FP16 figures are conservative model + KV-cache extrapolations. fraQtl numbers measured on a single A100-80GB with the public artifact.

30-DAY PILOT

Bring your model. Keep the savings.

Send us your model and a workload sample. We calibrate, deliver a compressed artifact + benchmark report + integration path for your deployment stack. If the numbers don't move, no commitment.

You send

Model (HF org/name), a small workload sample, target context length, current GPU setup. The Tally form takes ~2 min.

We calibrate + benchmark

Per-model calibration on your workload. Compressed artifact + benchmark report against your FP16 baseline. Pilot turnaround: 1–2 weeks.

You deploy

A compressed artifact that loads through standard Transformers, or runtime KV integration on your stack (vLLM runtime measured — see the lead receipt). Integration support included.

DELIVERY 1 · ARTIFACT

Compressed model on HuggingFace

Loads through standard Transformers; integration support included.

DELIVERY 2 · RUNTIME KV

One-line cache compression

Composes with the artifact for additional savings at long context. Early-access integration.

Request Pilot

Free technical pilot for the first 5 qualified design partners.

RELEASES SHIP LOG · LATEST FIRST

Compressed models, ready to load.

Public artifacts on HuggingFace. Reproducible numbers. Lower-memory benchmarked artifacts that load through the standard Transformers loader.

LIVE TODAY 2026 · APR 27

Qwen 3.6 35B-A3B compressed

MoE · 35B params (3B active) · weight-compressed artifact · 23.8 GB on disk

HuggingFace →

MMLU

82.24%

FP16: 82.40% · −0.16pp

∞Bench Passkey

30 / 30

at 125,315 tokens

VRAM @ 128K

51.7 GB

FP16 OOMs · 1× A100

    Next: Llama 3.1 70B compressed · Gemma family · subscribe to release updates →
  

WHY IT MATTERS

Fewer GPUs. Longer context. Lower cost.

At long context, KV cache becomes a major driver of GPU memory. fraQtl reduces memory pressure while preserving retrieval — allowing larger contexts or more concurrent workloads on the same hardware. On Mistral-7B at 128K in llama.cpp, fraQtl D1 holds full 5/5 needle retrieval at 41.5% lower live VRAM than fp16 KV and 14.1% lower than Q8 KV — while standard Q8 KV drops to 1/5 retrieval and Q4 KV to 0/5 at the same context.

Longer context

32K · 64K · 128K

Long-context inference becomes more practical when cache memory is compressed.

Fewer GPUs

35B AT 128K · 1× A100

Memory compression can turn multi-GPU deployments into single-GPU for selected workloads.

More concurrency

SAME MEMORY BUDGET

Smaller KV cache means more requests can fit into the same memory budget.

Quantization in the right basis replaces rank reduction as the default for KV compression.

BENCHMARKS

The numbers behind the pilot.

All numbers traceable to our public Hugging Face model card and the benchmark receipts.

HOW WE FRAME COMPARISONS

BF16 / FP16 is the quality reference. Public Q4 is the practical size baseline teams already deploy. fraQtl targets FP16-level behavior plus long-context KV memory savings the existing Q4 stack doesn't address.

Matched Q4_K_M bakeoff in progress — results published when locked.

LEAD RECEIPT NEW · JULY 2026

Compressed KV at full speed — and it compounds under load.

The fraQtl runtime reads its packed KV pages straight into the tensor cores — no reconstruction step. In real vLLM serving on one A100-80GB (CUDA graphs on), single-user decode runs at 95–99% of fp16 speed at 8K–32K — and 1.32× faster than fp16 at 128K; under concurrent load the memory advantage becomes a throughput advantage: 2.1× fp16 aggregate tokens/sec at 128K context per user. Two models, one kernel, zero per-model changes. Every number retrieval-verified — never “lossless.”

95–99%

of fp16 decode speed, single user

Mistral-7B · 95% @8K · 98.5% @32K · 1.32× fp16 @128K

2.1×

fp16 aggregate throughput @128K/user

140.9 vs 66.6 tok/s · 1.34× fp8 · 6 vs 2 users

2.42×

more KV tokens per GPU

997,200 vs 412,544 pool tokens · Mistral-7B

315/315

needle cells retrieved, exact match

7 depths × 3 keys per context · all arms

REAL VLLM SERVING · ONE A100-80GB

CUDA GRAPHS ON · TEMPERATURE 0 · FP8 ALWAYS SHOWN

Alone, fraQtl matches fp16 — and beats it at 128K. Under load, the advantage grows.

At one user, weights dominate and everyone is within a few percent. Add users and fp16 runs out of KV memory first, fp8 second — fraQtl keeps serving: parity with fp8 at 1 user grows to 1.4× fp8 at 6 users.

CELL	FRAQTL	FP16	FP8 KV	NEEDLES
Mistral-7B · 8K · batch 1 · decode tok/s	85.7	90.39	90.05	✓
Mistral-7B · 32K · batch 1 · decode tok/s	76.84	77.98	82.81	✓
Qwen3-4B-2507 · 128K · batch 1 · decode tok/s	69.5	52.59	72.7	✓
Mistral-7B · KV pool tokens · one A100	997,200	412,544	825,104	—
Qwen3-4B-2507 · users @128K ctx each · no preemption	6	2	5	✓ every user
Qwen3-4B-2507 · aggregate tok/s @128K · clean max	140.9	66.6	105.0	✓
Qwen3-4B-2507 · 32K · 24 concurrent users · aggregate tok/s	429.7	80/80 needles retrieved		✓ all

AGGREGATE THROUGHPUT · 128K CONTEXT PER USER · EACH CONFIG AT ITS CLEAN MAX

fraQtl

140.9 tok/s · 6 users

fp8 KV

105.0 tok/s · 5 users

fp16

66.6 tok/s · 2 users

Prefill at 128K: ~3,950 tok/s — at parity with fp16, faster than fp8. Every user needle-tested at their own depth, every rung of the ladder.

RETRIEVAL VERIFICATION · THE HONESTY LAYER

We never say “lossless.” We say retrieval-verified: needle-in-a-haystack passkey grids — 7 depths × 3 keys per context, exact-match gated — run on every arm of every receipt. Qwen3-4B-2507: 189/189 cells across 8K/32K/128K. Mistral-7B-v0.3: 126/126 cells across 8K/32K. Same kernel for both models, zero per-model kernel changes — a new model's sidecar builds in under an hour.

Method: vLLM (real paged serving), CUDA graphs on, prefix caching off, two-pass subtractive decode timing, temperature 0, exact-match passkey retrieval per arm. fp8 = fp16 weights + fp8-E4M3 KV (FlashInfer) — the strong baseline; weights identical across all arms. fraQtl = rank-protected eigenbasis KV read fragment-native into the tensor cores, zero reconstruction. 128K receipts run with chunked prefill disabled (known perf bug under chunking, fix in progress; stock arms unaffected). A batch-routing bug that dropped needles at batch ≥2 was found 07-03, root-caused, fixed, and fully re-receipted — pre-fix numbers were never published. No 256K claims yet: capacity verified, quality gate open. Qwen3-4B KV pool: 845,136 tokens vs fp16 361,776 (fp8 709,488) — 2.3× capacity. All cells reproducible from the public Hugging Face sidecar repos (huggingface.co/fraQtl) with fraqtl_repro_receipts.py; run IDs available on request.

RECEIPT · LLAMA.CPP 128K

Standard low-bit KV loses retrieval at long context. fraQtl doesn't.

Mistral-7B at 128K context in llama.cpp. Same model, same prompts — only the KV-cache representation changes. Q8 and Q4 reduce memory but give up needle retrieval; fraQtl D1 reaches memory below Q8 with retrieval intact.

KV CONFIG @ 128K	LIVE VRAM	NIAH	VS FP16
fp16 KV (baseline)	22,657 MiB	5 / 5	—
Q8 KV	15,437 MiB	1 / 5	−31.9%
Q4 KV	11,287 MiB	0 / 5	−50.2%
fraQtl D1	13,261 MiB	5 / 5	−41.5%

      Mistral-7B-Instruct · 128K context · llama.cpp · NIAH = 5-needle retrieval. Live VRAM is measured resident KV-cache footprint.

      fraQtl D1 uses 14.1% less live VRAM than Q8 KV (13,261 vs 15,437 MiB) and keeps 5/5 vs Q8's 1/5 — lower memory and intact retrieval at the same time.

      Retrieval gating has since been strengthened: the vLLM runtime receipts below use 7-depth × 3-key exact-match passkey grids — 189/189 cells on Qwen3-4B-2507, 126/126 on Mistral-7B. See the lead receipt ↑

      RESEARCH BACKING — partial-stack bake-off vs published methods (NIAH, PPL Δ, KIVI / TOVA / SnapKV)
      expand ▾
    

Four numbers vs the nearest published method.

Partial-stack research bake-off (Mistral / Llama, sub-4-bit, C44). The headline customer claim is the D1 llama.cpp receipt above; these are a separate research comparison, reproducible on the benchmark page.

METRIC	FRAQTL	NEAREST PUBLISHED	DIFFERENCE
NIAH retention (1080 trials)	98.5%	97.8% · TOVA	+0.7 pp
PPL Δ at 3.5× compression	+0.012	+0.214 · SnapKV	~18× tighter
NIAH at 128K (Llama 3.1 8B)	100%	0% · KIVI-2	+100 pp
GPUs needed at 128K (35B MoE)	1× A100-80GB	2× A100-80GB · FP16	1× vs 2×

      Sources: Mistral-7B-Instruct C44 bake-off (1080 NIAH trials, 8K–31K, 3 needle types). Llama 3.1 8B C44d/e (128K, 3 needles × 5 trials per depth). Qwen 3.6 35B-A3B VRAM measured on a single A100-80GB.

      Honest note: at matched 4-bit, fraQtl ties KIVI-4 / KVQuant-4 within sampling noise. The wins above are at sub-4-bit, at 128K, vs eviction methods, and on hardware footprint.

      Show full method breakdown · 7 baselines, 7 axes
      +
    

METHOD	MEMORY	QUALITY	CALIBRATION	OVERHEAD	NIAH 1080 TRIALS	PPL Δ V-ONLY
fraQtl (V-only)	3.5× ✓	✓ within noise	✓ 0.3s	no measured (C44)	98.5%	+0.012
fraQtl (V+K)	~3× ✓	✓ within noise	✓ 0.3s	benchmark-dependent	97.8%	—
FP16 baseline	1×	—	—	—	97.0%	0
TOVA	✓	⚠ degrades	✓ none	per-token	97.8%	+0.259
SnapKV	✓	⚠ degrades	✓ none	per-token	94.1%	+0.214
StreamingLLM	✓	✗ poor	✓ none	per-token	37.8%	+0.548
H2O^†	✓	✗ OOM	✓ none	eager attn	0.0%	—
KVQuant-2	✓ 2-bit	⚠ moderate	✗ 5–15 min	custom kernels	—	+0.27
KIVI-2	✓ 2-bit	✗ collapses at 128K	✓ none	per-token	37.8% → 0%@128K	+1.00

NIAH: needle-in-a-haystack retrieval. C44 bake-off, Mistral-7B-Instruct, 8K–31K context, 3 needle types × 360 trials per method.
† H2O: OOM at 16K+ context (eager attention required).

RECEIPT 2

Quality matches FP16 within sampling noise.

Qwen 3.6 35B-A3B compressed (single-seed) · vs FP16 baseline

BENCHMARK	FP16	FRAQTL	Δ
MMLU 0-shot	82.40%	82.24%	−0.16 pp
∞Bench Passkey @ 125,315 tokens	30 / 30	30 / 30	parity
HumanEval pass@1	reference	100% retention	within sampling noise
Wikitext-2 PPL Δ	—	+0.033	tight

      Single-seed numbers from the public Qwen 3.6 artifact. 3-seed bit-identical verification on PPL: σ = 0 across seeds 7 / 123 / 2024. Architecture-level 3-seed validation (V/K theorem) on Mistral 7B GQA-4 + Qwen 2.5 3B GQA-2 + Llama 3.2 3B + Phi-3.
    

RECEIPT 3

Memory savings break the multi-GPU requirement.

VRAM · Single A100-80GB

QWEN 3.6 35B-A3B

FP16 OOMs at 64K. fraQtl runs 128K with 28 GB headroom.

Same model. Same hardware. 8× the context. The difference between needing 1 GPU and needing 2.

CONTEXT	FP16 BASELINE	FRAQTL	HARDWARE
16K	~71 GB	25.6 GB	Both fit on 1× A100-80GB
64K	82.9 GB → OOM	36.8 GB	FP16 needs 2 GPUs; fraQtl 1
128K	85+ GB	51.7 GB	FP16 needs 2 GPUs; fraQtl 1 · 28 GB free

FP16 64K = 82.86 GB measured (OOMs on 80 GB ceiling). 16K and 128K FP16 figures are model + KV-cache extrapolations — conservative lower bounds.
fraQtl numbers measured on a single A100-80GB with the public artifact at huggingface.co/fraQtl/Qwen3.6-35B-A3B-compressed.

RESEARCH ROADMAP

What's shipping vs what's in the lab.

Two lanes. Public means measured, reproducible, and safe to deploy today. Research means active work we're not making customer claims on yet — listed so you can see where the substrate is heading.

● Public · website-ready

Measured

D1 long-context KV

Mistral-7B 128K in llama.cpp: below Q8 live VRAM, 5/5 retrieval. The llama.cpp lane receipt above.

Measured · new

vLLM full-speed KV

95–99% of fp16 decode at 2.42× KV capacity, needle intact. the lead receipt above.

Public artifact

Qwen 3.6 35B-A3B

Compressed weight artifact on Hugging Face. 128K on 1× A100, MMLU 82.24% vs 82.40% FP16.

Quality anchor

Hi-Fi calibration

FP16-reference calibration recipe behind the public artifact — quality and size, no speed claim.

◐ Research · in progress · no customer claims yet

MoE expert compression Task-aware sidecars vLLM batch>1 · concurrency speed matrix Multi-depth NIAH sweep

Active research, not part of the pilot deliverable and carrying no performance or accuracy claim until published with receipts. See the research →

NEW · CAIRN BY FRAQTL

Your agents repeat themselves. Most caches would get it wrong.

fraQtl compresses the memory inference reads — CAIRN recycles the work agents repeat. It audits your agent’s tool-call traces and certifies which repeated work is provably safe to reuse, priced honestly, net of provider prompt caching. Read-only, local, open source.

MEASURED
CAIRN AUDIT · PUBLIC CODING-AGENT CORPORA
tool commands audited4,152,027
re-reads of prior work437,013
“safe-looking” hits that were stale64.4%
reuse certified by output identity50,632
tokens avoidable (point / carried)88.9M / 2.15B
Kwai SWE-smith 66k + NVIDIA SWE-Hero OpenHands · raw JSON receipts in the repo. Carried figure is an upper-bound model, labeled as such.

CERTIFIED

Provably identical — reuse it. Provenance matched, output hash identical.

DELTA

Changed — serve the diff. A compact delta instead of re-ingesting everything.

BLOCKED

Unprovable — re-run live. The refusal is the product.

EXPLORE CAIRN → RUN THE AUDIT — OPEN SOURCE

Task-aware inference compression for long-context LLMs.