BENCHMARKS · PUBLIC ARTIFACTS · WEIGHT MATRIX

Receipts.

Lead receipt: Mistral-7B at 128K in llama.cpp — fraQtl D1 KV below Q8 memory with retrieval preserved. Plus the public Qwen 3.6 35B-A3B compressed artifact (MMLU + ∞Bench + VRAM) and the matched-4-bit weight matrix across MHA, GQA-2, GQA-3, and GQA-4. Every number traces to a committed script and raw JSON, reproducible under golden_eval v1.

Headline receipt: Mistral-7B at 128K context in llama.cpp — fraQtl D1 runs at 13,261 MiB live VRAM while keeping 5/5 needle retrieval. That is 41.5% below the fp16 KV baseline and 14.1% below Q8 KV — and at the same long context Q8 KV drops to 1/5 retrieval and Q4 KV to 0/5.
Public artifact: Qwen 3.6 35B-A3B compressed runs 128K context on 1× A100-80GB at 51.7 GB VRAM, MMLU 82.24% (FP16: 82.40%), ∞Bench passkey 30/30 at 125,315 tokens.
Weight matrix: matched-4-bit results across 4 architecture families using the same core recipe (KGATE_UP = KDOWN = 256, INT3 + sign correction), with validation reported per model. GOLDEN_EVAL_V1 · WikiText-2 64×512 · 256+256 prefix+continuation · PPL + KL(FP16 ‖ compressed)
● Measured vs extrapolated are labeled separately Public artifact Runtime KV Research-grade

D1Long-context KV · Mistral-7B 128K · llama.cpp

Live VRAM and needle-in-a-haystack retrieval at 128K context, measured in llama.cpp. Same model, same prompt set — only the KV-cache representation changes. Standard low-bit KV (Q8, Q4) reduces memory but loses retrieval at this context length; fraQtl D1 reduces memory below Q8 while retrieval stays intact.

KV-cache config Live VRAM @ 128K NIAH retrieval VRAM vs fp16 Retrieval at long context
fp16 KV (baseline) 22,657 MiB 5 / 5 full retrieval, full memory
Q8 KV 15,437 MiB 1 / 5 −31.9% retrieval breaks
Q4 KV 11,287 MiB 0 / 5 −50.2% retrieval gone
fraQtl D1 13,261 MiB 5 / 5 −41.5% below Q8 memory, retrieval preserved

NIAH = needle-in-a-haystack retrieval at 128K context (5 needles). Live VRAM is the measured resident footprint reported by llama.cpp for the KV-cache configuration shown.

fraQtl D1 vs Q8 KV: 14.1% lower live VRAM (13,261 vs 15,437 MiB) and 5/5 vs 1/5 retrieval — lower memory and intact retrieval at the same time.

The clean claim: at 128K, fraQtl D1 holds full 5/5 retrieval at 13,261 MiB — 41.5% under fp16 KV and 14.1% under Q8 KV. Standard low-bit KV gives up retrieval to reach lower memory; fraQtl reaches lower memory and keeps retrieval. Importance-aware compression, not uniform low-bit quantization.

00Public artifact · Qwen 3.6 35B-A3B

Live on Hugging Face. Loads through standard Transformers workflow. 23.8 GB on disk. MoE: 35B params, 3B active.

Metric fraQtl compressed FP16 reference Δ Notes
MMLU (5-shot) 82.24% 82.40% −0.16 pp 14,042 questions, 57 subjects
∞Bench passkey 30 / 30 30 / 30 tied at 125,315 tokens
VRAM @ 16K context 25.6 GB ~71 GB −45 GB both fit on 1× A100-80GB
VRAM @ 64K context 36.8 GB 82.86 GB → OOM FP16 needs 2× A100 measured OOM on 80 GB ceiling
VRAM @ 128K context 51.7 GB 85+ GB FP16 needs 2× A100 28 GB headroom on 1× A100
Disk size 23.8 GB ~70 GB ~3× smaller safetensors, full artifact

VRAM at 16K and 128K for FP16 are model + KV-cache extrapolations — conservative lower bounds. 64K FP16 = 82.86 GB measured (OOMs on 80 GB ceiling).

Weight-compressed artifact. Runtime KV-cache compression is a separate early-access layer and is not stacked on top of this artifact in these numbers.

Reproducible: huggingface.co/fraQtl/Qwen3.6-35B-A3B-compressed · loads through the standard Transformers workflow

01The weight matrix

All Δ PPL and KL are against each row's own FP16 baseline (same eval set). Single-seed unless marked. 3-seed Mistral row is mean ± std.

Model / Architecture FP16
PPL
bnb NF4
Δ / KL / b/w
AWQ 4-bit
Δ / KL / b/w
GPTQ 4-bit
Δ / KL / b/w
fraQtl INT3+sign
Δ / KL / b/w
Mistral-7B-Instruct-v0.2 (GQA-4) 6.6068 +0.1430 / 0.0274 / 4.50 (3-seed) +0.1590 ±0.0001 / 0.0293 / 4.12 (3-seed) +0.1721 / 0.0436 / 4.12 +0.0504 ±0.0108 / 0.0165 / 3.62 (3-seed)
Llama-3.2-3B-Instruct (GQA-3) 12.3720 +0.7445 / 0.0644 / 4.50 +0.8015 / 0.0605 / 4.12 BROKEN 1 +0.4279 / 0.0254 / 3.86
Qwen2.5-3B-Instruct (GQA-2) 8.3597 +0.5091 / 0.0843 / 4.50 +0.5865 / 0.0775 / 4.12 2 +0.5945 / 0.0991 / 4.12 +0.2241 / 0.0362 / 4.18
Phi-3-mini-4k-instruct (TRUE MHA) 6.5048 +0.5873 / 0.0965 / 4.50 FAILED 3 n/a 4 +0.2061 / 0.0466 / 3.86

1 gptqmodel 1.9.0 predates Llama 3 (released 2024-04): library-version failure, not a GPTQ-method failure. Newer gptqmodel 2.x has its own PyPI-drift instability (blocked original C61). Excluded from ratio/scoreboard math.

2 AWQ on Qwen2 needed a nn.Module.__getattr__ monkey-patch inside awq.quantize() to forward Catcher's missing attention_type attribute. Script: notebooks/benchmarks/C60_qwen_awq_patched.py.

3 AWQ on Phi-3 hits a DIFFERENT AutoAWQ bug (KeyError: 'type') than Qwen2's. Not chased per sunk-cost rule; deferred to llm-compressor sprint. Script: C63_phi3_mha_golden.py.

4 GPTQ not attempted on Phi-3 this session (focus: MHA universality via bnb/fraQtl).

fraQtl ratios vs each peer

Ratios are stable across architectures. The MHA result (Phi-3, 2.85× vs bnb) lands in the same band as the GQA-4 result (Mistral, 2.84× vs bnb).

Model Δ vs bnb NF4 Δ vs AWQ 4-bit Δ vs GPTQ 4-bit KL vs bnb KL vs AWQ KL vs GPTQ
Mistral 7B Instruct2.84× tighter3.16× tighter3.41× tighter1.66×1.78×2.64×
Llama 3.2 3B Instruct1.74× tighter1.87× tightern/a (GPTQ broken)2.54×2.39×n/a
Qwen 2.5 3B Instruct2.27× tighter2.62× tighter2.65× tighter2.33×2.14×2.74×
Phi-3-mini-4k (MHA)2.85× tightern/a (AWQ failed)n/a (not attempted)2.07×n/an/a

Scoreboard — honest counts

BROKEN peer results (Llama GPTQ) and library failures (AWQ on Phi-3) are NOT counted as fraQtl wins — they're counted as "peer didn't run."

PeerAttemptedUsablefraQtl wins (matched)
bnb NF44/44/44/4
AWQ 4-bit4/43/4 (Qwen needed patch, Phi-3 failed)3/3
GPTQ 4-bit3/4 (Phi-3 not attempted)2/3 (Llama broken)2/2
Total: 9/9 matched-bits wins where the peer produced a usable number, across 4 architectures. Same core recipe with per-model validation; no architecture-specific algorithm changes.

02fraQtl config key

fraQtl has TWO configs in the KV lane. Don't conflate them. Read before the KV tables below.

PARTIAL STACK · EXPERIMENT-GRADE
fraQtl V-only 4b
Eigenbasis only + INT4 uniform quantization. No sign correction.
C44b (1080-cell NIAH) · C44c sanity · C44d (128K multi-needle) · C44e (128K depth-sweep)
FULL STACK · MIXED-PRECISION LOW-BIT COMPRESSION
fraQtl V-only INT3 + sign
Eigenbasis + k=16 FP16 protect + LM-INT3 on sacrifice dims + sign correction. 3-seed-validated, pinned-eval.
C38a cross-arch 3-seed · C25q Qwen 3.6 35B-A3B MoE
How to read these: headline numbers use the full stack (INT3+sign) from C38a. Long-context retrieval comparisons use the partial stack (V-only 4b) from the C44 family — both are reported and labeled per row.

03KV cache substrate

The matrix above compresses weights. fraQtl's V-theorem + sign-correction mechanism also applies to KV cache (runtime-dynamic tensor). Cross-architecture KV results follow — per-needle NIAH and 3-seed PPL/KL where measured.

Cross-arch NIAH at matched 4-bit vs KIVI / KVQuant / eviction peers

Config = fraQtl V-only 4b (partial stack, eigenbasis + INT4 uniform, no sign correction). Full-stack INT3+sign numbers in the PPL/KL table below.

Model / Arch Protocol fraQtl V-only 4b fraQtl V+K 4b KIVI-4 KVQuant-4 KIVI-2 PyramidKV 0.7
Mistral 7B Instruct (GQA-4) C44b · 1080-cell NIAH 4K–31K 94.4% 93.3% 93.3% 93.3% 37.8% 86.1%
Qwen 2.5 3B Instruct (GQA-2) C44b · 1080-cell NIAH 4K–31K 99.4% 79.4% 5 98.9% 97.8% / 100% (sink0) 1.7% 69.4%
Llama 3.1 8B Instruct (GQA-8) C44d · 128K multi-needle 93.3% n/a 93.3% n/a 0.0% n/a
Llama 3.1 8B Instruct (GQA-8) C44e · 128K depth-sweep 6 100% n/a 100% n/a 0.0% n/a

5 fraQtl V+K on Qwen 3B GQA-2 used Mistral's blind k_protect — per-model calibration pending (see EVAL-PROTOCOL-LOCKED). Not a fraQtl limitation claim.

6 notebooks/benchmarks/C44e_llama3_8b_128k_shallow_depth.py (separate from C44e_pyramidkv_bakeoff.py).

The partial-stack V-only 4b configuration already ties KIVI-4 on this retrieval grid; the full-stack INT3+sign numbers in the table below extend the matched-bits margin further.

Cross-arch KV PPL/KL 3-seed pinned

The 3-seed rows bolded below are the headline numbers. Full-stack = INT3 + sign correction.

Model / Arch Config Δ PPL (3-seed) KL (3-seed) NIAH (3-seed) Source
Partial stack · C25 family
Mistral 7B Instruct (GQA-4)V-only k=16 LM-INT3+0.027 ±0.0050.00381 ±0.00001C25
Mistral 7B Instruct (GQA-4)V+K k=16 LM-INT3+0.062 ±0.0010.00751 ±0.00013C25
Full stack · INT3 + sign · C38a (3-seed)
Mistral 7B Instruct (GQA-4)V-only k=16 INT3+sign+0.0015 ±0.00440.00317C38a v2
Mistral 7B Instruct (GQA-4)K k=16 INT3+sign+0.0070 ±0.0043C38a
Llama 3.2 3B Instruct (GQA-3)V k=8 INT3+sign+0.0181 ±0.0150.0030899.4% (179/180)C38a
Llama 3.2 3B Instruct (GQA-3)K k=16 INT3+sign+0.0221 ±0.00790.00312100% (180/180)C38a
Qwen 2.5 3B Instruct (GQA-2)V k=8 INT3+sign+0.0542 ±0.01160.0036251.7% 7C38a
Phi-3-mini-128k-instruct (TRUE MHA)V-only k=16 INT3+sign 9+0.0002 (1-seed)0.0012293.3% (56/60)C62 2026-04-21
Phi-3-mini-128k-instruct (TRUE MHA)V+K k=16 INT3+sign 9+0.0073 (1-seed)0.0038395.0% (57/60 — ties FP16)C62 2026-04-21
Partial stack · MoE hybrid attention · C25q
Qwen 3.6 35B-A3B (MoE hybrid)V-only k=16 INT3+0.045 ±0.0110.0183 ±0.0023C25q
Qwen 3.6 35B-A3B (MoE hybrid)V+K k=16 INT3+0.166 ±0.0490.0221 ±0.0015C25q

7 Qwen 2.5 3B V NIAH: 3-seed mean 51.7%, FP16 baseline 58.3% — small-model short-context NIAH has low ceiling. KLD is 1.5× tighter than INT4 uniform; honest mixed result on GQA-2 V cache, not a clean NIAH win. The sign-correction paradigm's KLD advantage is the cross-arch invariant; PPL/NIAH narrow on GQA-2.

9 Phi-3 MHA rows are 1-seed (C62 patched run, source commit 3a0bff7). 1-seed is reported for the ratio comparisons here — 140× tighter Δ PPL vs KIVI-4 and 15.5× vs KVQuant-4; a 3-seed re-run is the basis for any absolute-delta public citation. Raw JSON verified against run output.

Full-stack carries the cross-architecture story onto MHA. Partial-stack on MHA KV previously lost to KIVI-4 (73.3% vs 91.7%); full-stack V-only INT3+sign flips it — +0.0002 Δ PPL (140× tighter than KIVI-4), 93.3% NIAH. V+K ties FP16 at 95.0%.
Caveat: the Phi-3 MHA V and V+K rows are 1-seed; a 3-seed re-run is the basis for any absolute-delta public citation. The ratio comparisons above hold at 1-seed.

04KIVI-2 long-context retrieval failure mode

Combined signal from C44b + C44c sanity + C44d + C44e shallow-depth. Every context length. Every needle type. Every depth position.

Context Architecture Grid KIVI-2 retention
4K–31KMistral 7B Instruct (GQA-4)1080 cells · 3 needles × 4 ctx × 5 depths × 3 trials37.8%
4K–31KQwen 2.5 3B Instruct (GQA-2)1080 cells1.7%
128KLlama 3.1 8B Instruct (GQA-8)15 cells · 3 needles × 5 trials @ depth 50 (C44d)0.0%
128KLlama 3.1 8B Instruct (GQA-8)9 cells · technical_password × 3 depths × 3 trials (C44e)0.0%
Pattern: KIVI-2's per-token K quantization collapses at EVERY context length tested, across EVERY needle type, across EVERY depth position. fraQtl V-only 4-bit ties or beats KIVI-4 everywhere. fraQtl 2-bit regimes via V-only k=16 INT3 sit at 3.5× (different Pareto point from KIVI-2's 8×).

05KV substrate — scoreboard

fraQtl V-only 4b (partial stack) vs peer KV families. The full-stack INT3+sign numbers in §03 extend the margin further.

Peer (KV) Architectures tested fraQtl partial-stack V-only 4b result
KIVI-2 (per-token K 2-bit) Mistral GQA-4 / Qwen GQA-2 / Llama 3.1 GQA-8 @128K × 2 protocols / Phi-3 MHA 5/5 catastrophic margins (fraQtl V-only ≥73% vs KIVI-2 ≤38%)
KIVI-4 (per-token K 4-bit) Mistral / Qwen / Llama 3.1 @128K × 2 / Phi-3 MHA 4/5 non-losses — Mistral + Qwen + 2× Llama 128K non-losses; Phi-3 MHA partial-stack 73.3% vs KIVI-4 91.7% — LOSES (full-stack V-only is 140× tighter Δ PPL vs KIVI-4 — see §03)
KVQuant-4 Mistral / Qwen / Phi-3 MHA 2/3 ties or wins on GQA (Mistral 94.4>93.3 / Qwen 99.4>97.8 / Qwen sink0 100). Phi-3 MHA partial-stack 73.3% vs KVQuant-4 93.3% — LOSES (277× Δ PPL gap); full-stack V-only flips it at 15.5× tighter Δ PPL, V+K ties NIAH at 95.0% — see §03
PyramidKV 0.7 Mistral / Qwen 2/2 wins (+8.3 pp / +30.0 pp aggregate over PyramidKV)
SnapKV / H2O / StreamingLLM / TOVA / ExpectedAttention (C44 original, Mistral only) Mistral 7B Instruct fraQtl 100% · TOVA 97.8% (near-tie) · SnapKV 94.1% · ExpectedAttention 53.3% · StreamingLLM 37.8% · H2O 22.0% — 3 of 5 competitors fail catastrophically (<54%)

Source for eviction-peer row: docs/MLP-QUANTIZATION-C14-C17-RESULTS.md L1916–1982. Original "fraQtl is the ONLY method" framing retracted per source L1978.

06Memory lane — runtime GPU memory

Distinct from the weight-compression-ratio matrix above. Runtime GPU memory story on Mistral 7B at 32K context.

Artifact State Measurement
FP16 baseline @ 32K measured 13.91 GB weights · 23.12 GB peak inference · 56.88 GB headroom (A100-80GB)
fraQtl-packed (current loader) @ 32K measured 9.84 GB on disk (30% smaller) · 14.03 GB weights in memory (loader dequantizes to FP16 by design) · coherent generation · 20.4 tok/s
Disk compression is real and shipped: a 9.84 GB Mistral-7B artifact (30% smaller) with coherent generation verified at 20.4 tok/s. Runtime GPU-memory packing is in development and reported here only where measured.

07Boundaries — what's NOT in this matrix

Hold-the-line
  • GPTQ on Phi-3: not attempted this session. No implicit claim.
  • AWQ 3/5-bit: AutoAWQ 0.2.9 is 4-bit only; multi-bit pending llm-compressor scoped image (C50D-AWQ-MULTI-BIT.md).
  • KV cache compression: different substrate. See C44B-KIVI-KVQUANT-BAKEOFF.md, C44E-PYRAMIDKV-BAKEOFF.md.
  • MoE matched-protocol: Qwen 3.6 35B-A3B has KV-cache numbers; weight-compression matched-bits vs AWQ/GPTQ on MoE is next-session C50d work.
  • Throughput / latency / memory: infra agent lane, see C51-THROUGHPUT-BAKEOFF.md.
  • Multi-seed beyond Mistral: per EVAL-PROTOCOL-LOCKED ratio rule, 1-seed is acceptable for ratio comparisons above 1.5× threshold. All reported ratios exceed 1.5×.

08Artifacts + commit hashes

Every number in this matrix traces to a committed script + raw JSON on the fraqtl-hf-cache Modal volume.

Scripts, raw JSONs, commit hashes · per row
Row Script Raw JSON (fraqtl-hf-cache:fraqtl-results/) Commit hash(es)
Mistral 3-seed (fraQtl, bnb, AWQ)C60_golden_mistral_instruct.pyc60_golden_mistralai_Mistral-7B-Instruct-v02_seed{42,123,2024}.json1209261, 9038743, d3af566
Mistral GPTQ 1-seedC64_gptq_pinned.py via modal_run_gptq.pyc64_gptq_pinned_mistralai_Mistral-7B-Instruct-v02_seed42.json17b563f
Llama 3B golden (fraQtl, bnb, AWQ) 1-seedC60_golden_mistral_instruct.py (env MODEL=…)c60_golden_meta-llama_Llama-32-3B-Instruct_seed42.json1209261
Llama 3B GPTQ BROKEN 1-seedC64_gptq_pinned.py (env MODEL=…)c64_gptq_pinned_meta-llama_Llama-32-3B-Instruct_seed42.jsoncc24b1d
Qwen 3B golden (fraQtl, bnb) 1-seedC60_golden_mistral_instruct.py (env MODEL=…)c60_golden_Qwen_Qwen25-3B-Instruct_seed42.json1209261
Qwen 3B AWQ (Catcher-patched) 1-seedC60_qwen_awq_patched.pyc60_qwen_awq_patched_seed42.json19bea7f
Qwen 3B GPTQ 1-seedC64_gptq_pinned.py (env MODEL=…)c64_gptq_pinned_Qwen_Qwen25-3B-Instruct_seed42.jsoncc24b1d
Phi-3 MHA golden (fraQtl, bnb) 1-seedC63_phi3_mha_golden.pyc63_phi3_mha_golden_seed42.jsoncc24b1d
C44b KIVI/KVQuant 1080-cell (Mistral + Qwen)C44b_kivi_kvquant_bakeoff.pyc44b_{mistral,qwen3b}_full_seed42.json1209261
C44e PyramidKV 1080-cell (Mistral + Qwen)C44e_pyramidkv_bakeoff.pyc44e_pyramidkv_{mistral,qwen3b}_seed42.json3b5e8a8, c1388b8
C44d Llama 3.1 8B 128K multi-needle NIAHC44d_llama3_8b_128k_3needle.pyc44d_llama3_8b_128k_3needle_seed42.json (volume)see docs/C44D-MULTI-NEEDLE-128K.md
C44e-shallow Llama 3.1 8B 128K depth sweepC44e_llama3_8b_128k_shallow_depth.pyc44e_llama3_8b_128k_shallow_depth_seed42.json (volume)see docs/C44E-SHALLOW-DEPTH-128K.md
C62 Phi-3 MHA full-stack KV (V-only + V+K INT3+sign, 1-seed)notebooks/benchmarks/C62_phi3_fullstack.pyc62_phi3_mha_seed42.json (8 configs, partial + full-stack side-by-side)3a0bff7 · 527afad (matrix mirror)
Memory @ 32K (FP16 + fraQtl-packed)fraqtl/docs/MEASURED-MEMORY-32K.mdnvidia-smi trace @ A100-80GBf5db558
PackedLinear scaffold + sanity testfraqtl/src/fraqtl/packed_linear.py · experiments/packed_linear_sanity.py0.04% mean-rel error vs nn.Linear19e27c3

One compression principle. Multiple architectures.

Try the public Qwen 3.6 35B-A3B compressed artifact, or pilot fraQtl on your own model stack.

Hugging Face Request Pilot KV Cache Explainer
Receipt-backed Every number traces to a committed script and raw JSON (source data commit 3a0bff7).
Protocol: golden_eval v1 · WikiText-2 test 64×512 · 256+256 prefix+continuation.
Reproduce: github.com/fraqtl · HF: huggingface.co/fraqtl · Questions: samuel@fraqtl.ai