BENCHMARKS · PUBLIC ARTIFACTS · WEIGHT MATRIX

Receipts.

Lead receipt: Mistral-7B at 128K in llama.cpp — fraQtl D1 KV below Q8 memory with retrieval preserved. Plus the public Qwen 3.6 35B-A3B compressed artifact (MMLU + ∞Bench + VRAM) and the matched-4-bit weight matrix across MHA, GQA-2, GQA-3, and GQA-4. Every number traces to a committed script and raw JSON, reproducible under golden_eval v1.

Headline receipt: Mistral-7B at 128K context in llama.cpp — fraQtl D1 runs at 13,261 MiB live VRAM while keeping 5/5 needle retrieval. That is 41.5% below the fp16 KV baseline and 14.1% below Q8 KV — and at the same long context Q8 KV drops to 1/5 retrieval and Q4 KV to 0/5.
Public artifact: Qwen 3.6 35B-A3B compressed runs 128K context on 1× A100-80GB at 51.7 GB VRAM, MMLU 82.24% (FP16: 82.40%), ∞Bench passkey 30/30 at 125,315 tokens.
Weight matrix: matched-4-bit results across 4 architecture families using the same core recipe (K_{GATE_UP} = K_DOWN = 256, INT3 + sign correction), with validation reported per model. GOLDEN_EVAL_V1 · WikiText-2 64×512 · 256+256 prefix+continuation · PPL + KL(FP16 ‖ compressed)

● Measured vs extrapolated are labeled separately Public artifact Runtime KV Research-grade

D1Long-context KV · Mistral-7B 128K · llama.cpp

Live VRAM and needle-in-a-haystack retrieval at 128K context, measured in llama.cpp. Same model, same prompt set — only the KV-cache representation changes. Standard low-bit KV (Q8, Q4) reduces memory but loses retrieval at this context length; fraQtl D1 reduces memory below Q8 while retrieval stays intact.

KV-cache config	Live VRAM @ 128K	NIAH retrieval	VRAM vs fp16	Retrieval at long context
fp16 KV (baseline)	22,657 MiB	5 / 5	—	full retrieval, full memory
Q8 KV	15,437 MiB	1 / 5	−31.9%	retrieval breaks
Q4 KV	11,287 MiB	0 / 5	−50.2%	retrieval gone
fraQtl D1	13,261 MiB	5 / 5	−41.5%	below Q8 memory, retrieval preserved

NIAH = needle-in-a-haystack retrieval at 128K context (5 needles). Live VRAM is the measured resident footprint reported by llama.cpp for the KV-cache configuration shown.

fraQtl D1 vs Q8 KV: 14.1% lower live VRAM (13,261 vs 15,437 MiB) and 5/5 vs 1/5 retrieval — lower memory and intact retrieval at the same time.

The clean claim: at 128K, fraQtl D1 holds full 5/5 retrieval at 13,261 MiB — 41.5% under fp16 KV and 14.1% under Q8 KV. Standard low-bit KV gives up retrieval to reach lower memory; fraQtl reaches lower memory and keeps retrieval. Importance-aware compression, not uniform low-bit quantization.

00Public artifact · Qwen 3.6 35B-A3B

Live on Hugging Face. Loads through standard Transformers workflow. 23.8 GB on disk. MoE: 35B params, 3B active.

Metric	fraQtl compressed	FP16 reference	Δ	Notes
MMLU (5-shot)	82.24%	82.40%	−0.16 pp	14,042 questions, 57 subjects
∞Bench passkey	30 / 30	30 / 30	tied	at 125,315 tokens
VRAM @ 16K context	25.6 GB	~71 GB	−45 GB	both fit on 1× A100-80GB
VRAM @ 64K context	36.8 GB	82.86 GB → OOM	FP16 needs 2× A100	measured OOM on 80 GB ceiling
VRAM @ 128K context	51.7 GB	85+ GB	FP16 needs 2× A100	28 GB headroom on 1× A100
Disk size	23.8 GB	~70 GB	~3× smaller	safetensors, full artifact

VRAM at 16K and 128K for FP16 are model + KV-cache extrapolations — conservative lower bounds. 64K FP16 = 82.86 GB measured (OOMs on 80 GB ceiling).

Weight-compressed artifact. Runtime KV-cache compression is a separate early-access layer and is not stacked on top of this artifact in these numbers.

Reproducible: huggingface.co/fraQtl/Qwen3.6-35B-A3B-compressed · loads through the standard Transformers workflow

01The weight matrix

All Δ PPL and KL are against each row's own FP16 baseline (same eval set). Single-seed unless marked. 3-seed Mistral row is mean ± std.

Model / Architecture	FP16 PPL	bnb NF4 Δ / KL / b/w	AWQ 4-bit Δ / KL / b/w	GPTQ 4-bit Δ / KL / b/w	fraQtl INT3+sign Δ / KL / b/w
Mistral-7B-Instruct-v0.2 (GQA-4)	6.6068	+0.1430 / 0.0274 / 4.50 (3-seed)	+0.1590 ±0.0001 / 0.0293 / 4.12 (3-seed)	+0.1721 / 0.0436 / 4.12	+0.0504 ±0.0108 / 0.0165 / 3.62 (3-seed)
Llama-3.2-3B-Instruct (GQA-3)	12.3720	+0.7445 / 0.0644 / 4.50	+0.8015 / 0.0605 / 4.12	BROKEN ¹	+0.4279 / 0.0254 / 3.86
Qwen2.5-3B-Instruct (GQA-2)	8.3597	+0.5091 / 0.0843 / 4.50	+0.5865 / 0.0775 / 4.12 ²	+0.5945 / 0.0991 / 4.12	+0.2241 / 0.0362 / 4.18
Phi-3-mini-4k-instruct (TRUE MHA)	6.5048	+0.5873 / 0.0965 / 4.50	FAILED ³	n/a ⁴	+0.2061 / 0.0466 / 3.86

¹ gptqmodel 1.9.0 predates Llama 3 (released 2024-04): library-version failure, not a GPTQ-method failure. Newer gptqmodel 2.x has its own PyPI-drift instability (blocked original C61). Excluded from ratio/scoreboard math.

² AWQ on Qwen2 needed a nn.Module.__getattr__ monkey-patch inside awq.quantize() to forward Catcher's missing attention_type attribute. Script: notebooks/benchmarks/C60_qwen_awq_patched.py.

³ AWQ on Phi-3 hits a DIFFERENT AutoAWQ bug (KeyError: 'type') than Qwen2's. Not chased per sunk-cost rule; deferred to llm-compressor sprint. Script: C63_phi3_mha_golden.py.

⁴ GPTQ not attempted on Phi-3 this session (focus: MHA universality via bnb/fraQtl).

fraQtl ratios vs each peer

Ratios are stable across architectures. The MHA result (Phi-3, 2.85× vs bnb) lands in the same band as the GQA-4 result (Mistral, 2.84× vs bnb).

Model	Δ vs bnb NF4	Δ vs AWQ 4-bit	Δ vs GPTQ 4-bit	KL vs bnb	KL vs AWQ	KL vs GPTQ
Mistral 7B Instruct	2.84× tighter	3.16× tighter	3.41× tighter	1.66×	1.78×	2.64×
Llama 3.2 3B Instruct	1.74× tighter	1.87× tighter	n/a (GPTQ broken)	2.54×	2.39×	n/a
Qwen 2.5 3B Instruct	2.27× tighter	2.62× tighter	2.65× tighter	2.33×	2.14×	2.74×
Phi-3-mini-4k (MHA)	2.85× tighter	n/a (AWQ failed)	n/a (not attempted)	2.07×	n/a	n/a

Scoreboard — honest counts

BROKEN peer results (Llama GPTQ) and library failures (AWQ on Phi-3) are NOT counted as fraQtl wins — they're counted as "peer didn't run."

Peer	Attempted	Usable	fraQtl wins (matched)
bnb NF4	4/4	4/4	4/4
AWQ 4-bit	4/4	3/4 (Qwen needed patch, Phi-3 failed)	3/3
GPTQ 4-bit	3/4 (Phi-3 not attempted)	2/3 (Llama broken)	2/2

Total: 9/9 matched-bits wins where the peer produced a usable number, across 4 architectures. Same core recipe with per-model validation; no architecture-specific algorithm changes.

02fraQtl config key

fraQtl has TWO configs in the KV lane. Don't conflate them. Read before the KV tables below.

PARTIAL STACK · EXPERIMENT-GRADE

fraQtl V-only 4b

Eigenbasis only + INT4 uniform quantization. No sign correction.

C44b (1080-cell NIAH) · C44c sanity · C44d (128K multi-needle) · C44e (128K depth-sweep)

FULL STACK · MIXED-PRECISION LOW-BIT COMPRESSION

fraQtl V-only INT3 + sign

Eigenbasis + k=16 FP16 protect + LM-INT3 on sacrifice dims + sign correction. 3-seed-validated, pinned-eval.

C38a cross-arch 3-seed · C25q Qwen 3.6 35B-A3B MoE

How to read these: headline numbers use the full stack (INT3+sign) from C38a. Long-context retrieval comparisons use the partial stack (V-only 4b) from the C44 family — both are reported and labeled per row.

03KV cache substrate

The matrix above compresses weights. fraQtl's V-theorem + sign-correction mechanism also applies to KV cache (runtime-dynamic tensor). Cross-architecture KV results follow — per-needle NIAH and 3-seed PPL/KL where measured.

Cross-arch NIAH at matched 4-bit vs KIVI / KVQuant / eviction peers

Config = fraQtl V-only 4b (partial stack, eigenbasis + INT4 uniform, no sign correction). Full-stack INT3+sign numbers in the PPL/KL table below.

Model / Arch	Protocol	fraQtl V-only 4b	fraQtl V+K 4b	KIVI-4	KVQuant-4	KIVI-2	PyramidKV 0.7
Mistral 7B Instruct (GQA-4)	C44b · 1080-cell NIAH 4K–31K	94.4%	93.3%	93.3%	93.3%	37.8%	86.1%
Qwen 2.5 3B Instruct (GQA-2)	C44b · 1080-cell NIAH 4K–31K	99.4%	79.4% ⁵	98.9%	97.8% / 100% (sink0)	1.7%	69.4%
Llama 3.1 8B Instruct (GQA-8)	C44d · 128K multi-needle	93.3%	n/a	93.3%	n/a	0.0%	n/a
Llama 3.1 8B Instruct (GQA-8)	C44e · 128K depth-sweep ⁶	100%	n/a	100%	n/a	0.0%	n/a

⁵ fraQtl V+K on Qwen 3B GQA-2 used Mistral's blind k_protect — per-model calibration pending (see EVAL-PROTOCOL-LOCKED). Not a fraQtl limitation claim.

⁶ notebooks/benchmarks/C44e_llama3_8b_128k_shallow_depth.py (separate from C44e_pyramidkv_bakeoff.py).

The partial-stack V-only 4b configuration already ties KIVI-4 on this retrieval grid; the full-stack INT3+sign numbers in the table below extend the matched-bits margin further.

Cross-arch KV PPL/KL 3-seed pinned

The 3-seed rows bolded below are the headline numbers. Full-stack = INT3 + sign correction.

Model / Arch	Config	Δ PPL (3-seed)	KL (3-seed)	NIAH (3-seed)	Source
Partial stack · C25 family
Mistral 7B Instruct (GQA-4)	V-only k=16 LM-INT3	+0.027 ±0.005	0.00381 ±0.00001	—	C25
Mistral 7B Instruct (GQA-4)	V+K k=16 LM-INT3	+0.062 ±0.001	0.00751 ±0.00013	—	C25
Full stack · INT3 + sign · C38a (3-seed)
Mistral 7B Instruct (GQA-4)	V-only k=16 INT3+sign	+0.0015 ±0.0044	0.00317	—	C38a v2
Mistral 7B Instruct (GQA-4)	K k=16 INT3+sign	+0.0070 ±0.0043	—	—	C38a
Llama 3.2 3B Instruct (GQA-3)	V k=8 INT3+sign	+0.0181 ±0.015	0.00308	99.4% (179/180)	C38a
Llama 3.2 3B Instruct (GQA-3)	K k=16 INT3+sign	+0.0221 ±0.0079	0.00312	100% (180/180)	C38a
Qwen 2.5 3B Instruct (GQA-2)	V k=8 INT3+sign	+0.0542 ±0.0116	0.00362	51.7% ⁷	C38a
Phi-3-mini-128k-instruct (TRUE MHA)	V-only k=16 INT3+sign ⁹	+0.0002 (1-seed)	0.00122	93.3% (56/60)	C62 2026-04-21
Phi-3-mini-128k-instruct (TRUE MHA)	V+K k=16 INT3+sign ⁹	+0.0073 (1-seed)	0.00383	95.0% (57/60 — ties FP16)	C62 2026-04-21
Partial stack · MoE hybrid attention · C25q
Qwen 3.6 35B-A3B (MoE hybrid)	V-only k=16 INT3	+0.045 ±0.011	0.0183 ±0.0023	—	C25q
Qwen 3.6 35B-A3B (MoE hybrid)	V+K k=16 INT3	+0.166 ±0.049	0.0221 ±0.0015	—	C25q

⁷ Qwen 2.5 3B V NIAH: 3-seed mean 51.7%, FP16 baseline 58.3% — small-model short-context NIAH has low ceiling. KLD is 1.5× tighter than INT4 uniform; honest mixed result on GQA-2 V cache, not a clean NIAH win. The sign-correction paradigm's KLD advantage is the cross-arch invariant; PPL/NIAH narrow on GQA-2.

⁹ Phi-3 MHA rows are 1-seed (C62 patched run, source commit 3a0bff7). 1-seed is reported for the ratio comparisons here — 140× tighter Δ PPL vs KIVI-4 and 15.5× vs KVQuant-4; a 3-seed re-run is the basis for any absolute-delta public citation. Raw JSON verified against run output.

Full-stack carries the cross-architecture story onto MHA. Partial-stack on MHA KV previously lost to KIVI-4 (73.3% vs 91.7%); full-stack V-only INT3+sign flips it — +0.0002 Δ PPL (140× tighter than KIVI-4), 93.3% NIAH. V+K ties FP16 at 95.0%.

Caveat: the Phi-3 MHA V and V+K rows are 1-seed; a 3-seed re-run is the basis for any absolute-delta public citation. The ratio comparisons above hold at 1-seed.

04KIVI-2 long-context retrieval failure mode

Combined signal from C44b + C44c sanity + C44d + C44e shallow-depth. Every context length. Every needle type. Every depth position.

Context	Architecture	Grid	KIVI-2 retention
4K–31K	Mistral 7B Instruct (GQA-4)	1080 cells · 3 needles × 4 ctx × 5 depths × 3 trials	37.8%
4K–31K	Qwen 2.5 3B Instruct (GQA-2)	1080 cells	1.7%
128K	Llama 3.1 8B Instruct (GQA-8)	15 cells · 3 needles × 5 trials @ depth 50 (C44d)	0.0%
128K	Llama 3.1 8B Instruct (GQA-8)	9 cells · technical_password × 3 depths × 3 trials (C44e)	0.0%

Pattern: KIVI-2's per-token K quantization collapses at EVERY context length tested, across EVERY needle type, across EVERY depth position. fraQtl V-only 4-bit ties or beats KIVI-4 everywhere. fraQtl 2-bit regimes via V-only k=16 INT3 sit at 3.5× (different Pareto point from KIVI-2's 8×).

05KV substrate — scoreboard

fraQtl V-only 4b (partial stack) vs peer KV families. The full-stack INT3+sign numbers in §03 extend the margin further.

Peer (KV)	Architectures tested	fraQtl partial-stack V-only 4b result
KIVI-2 (per-token K 2-bit)	Mistral GQA-4 / Qwen GQA-2 / Llama 3.1 GQA-8 @128K × 2 protocols / Phi-3 MHA	5/5 catastrophic margins (fraQtl V-only ≥73% vs KIVI-2 ≤38%)
KIVI-4 (per-token K 4-bit)	Mistral / Qwen / Llama 3.1 @128K × 2 / Phi-3 MHA	4/5 non-losses — Mistral + Qwen + 2× Llama 128K non-losses; Phi-3 MHA partial-stack 73.3% vs KIVI-4 91.7% — LOSES (full-stack V-only is 140× tighter Δ PPL vs KIVI-4 — see §03)
KVQuant-4	Mistral / Qwen / Phi-3 MHA	2/3 ties or wins on GQA (Mistral 94.4>93.3 / Qwen 99.4>97.8 / Qwen sink0 100). Phi-3 MHA partial-stack 73.3% vs KVQuant-4 93.3% — LOSES (277× Δ PPL gap); full-stack V-only flips it at 15.5× tighter Δ PPL, V+K ties NIAH at 95.0% — see §03
PyramidKV 0.7	Mistral / Qwen	2/2 wins (+8.3 pp / +30.0 pp aggregate over PyramidKV)
SnapKV / H2O / StreamingLLM / TOVA / ExpectedAttention (C44 original, Mistral only)	Mistral 7B Instruct	fraQtl 100% · TOVA 97.8% (near-tie) · SnapKV 94.1% · ExpectedAttention 53.3% · StreamingLLM 37.8% · H2O 22.0% — 3 of 5 competitors fail catastrophically (<54%)

Source for eviction-peer row: docs/MLP-QUANTIZATION-C14-C17-RESULTS.md L1916–1982. Original "fraQtl is the ONLY method" framing retracted per source L1978.

06Memory lane — runtime GPU memory

Distinct from the weight-compression-ratio matrix above. Runtime GPU memory story on Mistral 7B at 32K context.

Artifact	State	Measurement
FP16 baseline @ 32K	measured	13.91 GB weights · 23.12 GB peak inference · 56.88 GB headroom (A100-80GB)
fraQtl-packed (current loader) @ 32K	measured	9.84 GB on disk (30% smaller) · 14.03 GB weights in memory (loader dequantizes to FP16 by design) · coherent generation · 20.4 tok/s

Disk compression is real and shipped: a 9.84 GB Mistral-7B artifact (30% smaller) with coherent generation verified at 20.4 tok/s. Runtime GPU-memory packing is in development and reported here only where measured.

07Boundaries — what's NOT in this matrix

Hold-the-line

GPTQ on Phi-3: not attempted this session. No implicit claim.
AWQ 3/5-bit: AutoAWQ 0.2.9 is 4-bit only; multi-bit pending llm-compressor scoped image (C50D-AWQ-MULTI-BIT.md).
KV cache compression: different substrate. See C44B-KIVI-KVQUANT-BAKEOFF.md, C44E-PYRAMIDKV-BAKEOFF.md.
MoE matched-protocol: Qwen 3.6 35B-A3B has KV-cache numbers; weight-compression matched-bits vs AWQ/GPTQ on MoE is next-session C50d work.
Throughput / latency / memory: infra agent lane, see C51-THROUGHPUT-BAKEOFF.md.
Multi-seed beyond Mistral: per EVAL-PROTOCOL-LOCKED ratio rule, 1-seed is acceptable for ratio comparisons above 1.5× threshold. All reported ratios exceed 1.5×.

08Artifacts + commit hashes

Every number in this matrix traces to a committed script + raw JSON on the fraqtl-hf-cache Modal volume.

Scripts, raw JSONs, commit hashes · per row

Row	Script	Raw JSON (fraqtl-hf-cache:fraqtl-results/)	Commit hash(es)
Mistral 3-seed (fraQtl, bnb, AWQ)	C60_golden_mistral_instruct.py	c60_golden_mistralai_Mistral-7B-Instruct-v02_seed{42,123,2024}.json	1209261, 9038743, d3af566
Mistral GPTQ 1-seed	C64_gptq_pinned.py via modal_run_gptq.py	c64_gptq_pinned_mistralai_Mistral-7B-Instruct-v02_seed42.json	17b563f
Llama 3B golden (fraQtl, bnb, AWQ) 1-seed	C60_golden_mistral_instruct.py (env MODEL=…)	c60_golden_meta-llama_Llama-32-3B-Instruct_seed42.json	1209261
Llama 3B GPTQ BROKEN 1-seed	C64_gptq_pinned.py (env MODEL=…)	c64_gptq_pinned_meta-llama_Llama-32-3B-Instruct_seed42.json	cc24b1d
Qwen 3B golden (fraQtl, bnb) 1-seed	C60_golden_mistral_instruct.py (env MODEL=…)	c60_golden_Qwen_Qwen25-3B-Instruct_seed42.json	1209261
Qwen 3B AWQ (Catcher-patched) 1-seed	C60_qwen_awq_patched.py	c60_qwen_awq_patched_seed42.json	19bea7f
Qwen 3B GPTQ 1-seed	C64_gptq_pinned.py (env MODEL=…)	c64_gptq_pinned_Qwen_Qwen25-3B-Instruct_seed42.json	cc24b1d
Phi-3 MHA golden (fraQtl, bnb) 1-seed	C63_phi3_mha_golden.py	c63_phi3_mha_golden_seed42.json	cc24b1d
C44b KIVI/KVQuant 1080-cell (Mistral + Qwen)	C44b_kivi_kvquant_bakeoff.py	c44b_{mistral,qwen3b}_full_seed42.json	1209261
C44e PyramidKV 1080-cell (Mistral + Qwen)	C44e_pyramidkv_bakeoff.py	c44e_pyramidkv_{mistral,qwen3b}_seed42.json	3b5e8a8, c1388b8
C44d Llama 3.1 8B 128K multi-needle NIAH	C44d_llama3_8b_128k_3needle.py	c44d_llama3_8b_128k_3needle_seed42.json (volume)	see docs/C44D-MULTI-NEEDLE-128K.md
C44e-shallow Llama 3.1 8B 128K depth sweep	C44e_llama3_8b_128k_shallow_depth.py	c44e_llama3_8b_128k_shallow_depth_seed42.json (volume)	see docs/C44E-SHALLOW-DEPTH-128K.md
C62 Phi-3 MHA full-stack KV (V-only + V+K INT3+sign, 1-seed)	notebooks/benchmarks/C62_phi3_fullstack.py	c62_phi3_mha_seed42.json (8 configs, partial + full-stack side-by-side)	3a0bff7 · 527afad (matrix mirror)
Memory @ 32K (FP16 + fraQtl-packed)	fraqtl/docs/MEASURED-MEMORY-32K.md	nvidia-smi trace @ A100-80GB	f5db558
PackedLinear scaffold + sanity test	fraqtl/src/fraqtl/packed_linear.py · experiments/packed_linear_sanity.py	0.04% mean-rel error vs nn.Linear	19e27c3

One compression principle. Multiple architectures.

Try the public Qwen 3.6 35B-A3B compressed artifact, or pilot fraQtl on your own model stack.

Hugging Face Request Pilot KV Cache Explainer

Receipt-backed Every number traces to a committed script and raw JSON (source data commit 3a0bff7).
Protocol: golden_eval v1 · WikiText-2 test 64×512 · 256+256 prefix+continuation.
Reproduce: github.com/fraqtl · HF: huggingface.co/fraqtl · Questions: samuel@fraqtl.ai