PREPRINT · APRIL 2026

Preserve attention. Compress the rest.

Most KV-cache methods compress by deletion — removing directions, losing signal. fraQtl compresses by precision: every token survives, attention routing is preserved. Proven across five transformer models at matched storage budgets.

Read Paper ↗ View Results

contact@fraqtl.ai  ·  FraQtl AI Research

364×
MAX MARGIN OVER
RANK REDUCTION
+0.18
PPL FOR 75%
KV REDUCTION (Mistral 7B)
5
MODELS TESTED
124M – 14B PARAMS
CORE RESULTS

Three findings that change how you compress transformers.

At matched storage budgets, quantization consistently outperforms rank reduction. The gap is not about basis — it is about the geometry of softmax routing.

01
Quantization beats rank reduction everywhere
In every model and budget tested, INT4 outperforms rank reduction at matched bit cost by 4 to 364 PPL. The margin grows with GQA aggressiveness.
02
Deletion causes discrete routing failures
Rank reduction flips 4.6% of attention routing decisions vs 0.03% for INT4. Bounded noise preserves score ordering. Deletion does not.
03
The basis doesn't matter — the paradigm does
Quantization quality is basis-independent (spread $<0.4$ PPL across all rotations). The advantage is about preserving all dimensions.
04
Joint $K$+$V$ INT4 at 75% reduction costs +0.18 PPL
Both $K$ and $V$ are safely quantizable. Per-channel symmetric INT4 requires no retraining, no special basis, no optimization.
MECHANISM

Why deletion fails where noise succeeds.

Each cell is a KV-cache dimension. Watch what happens to attention routing under each compression strategy.

Rank throws away signal.
Quantization preserves it.

Rank Reduction — dimensions deleted
PPL: 9.19 (FP16)
Quantization — precision reduced
PPL: 9.19 (FP16)
THE FIGURE

Perplexity vs bits/dimension on Mistral 7B.

Rank reduction explodes below 4 bits. Uniform quantization collapses below 3 bits. fraQtl stays flat to 2 bits — the dead zone is where every other method fails.

Rank Reduction Uniform Quant (GPTQ) fraQtl V-Split + Lloyd-Max FP16 baseline
MECHANISM — LIVE

Routing flips: why deletion is catastrophic.

When rank reduction deletes a KV direction it creates a score perturbation $|\delta| \approx \sigma_{\text{removed}}$. If this exceeds the gap $\Delta = s_{i_1} - s_{i_2}$, attention flips to the wrong token. Quantization keeps $|\delta|$ bounded at $\frac{\sigma}{2^b}$ — $768\times$ smaller at INT4.

FRAQTL INT4
94%
Routing Stability ✅
Attention flow preserved — perturbation bounded at σ/2b
RANK REDUCTION k=32
61%
Routing Stability ❌
High attention drift — deleted directions create unbounded perturbation
Rank Reduction — direction deleted
FP16 baseline — Token A routes
Quantization INT4 — precision reduced
FP16 baseline — Token A routes
QUANTIZATION VISUALIZED

Bounded noise vs information loss.

Every KV value survives quantization — just rounded to the nearest step. Rank reduction eliminates entire directions. Watch how precision degrades gracefully while deletion destroys structure.

FP16 original signal
INT4 quantized (all dims)
Rank-reduced (dims deleted)
INTERACTIVE

Pick your memory budget. See who wins.

At every storage constraint, fraQtl outperforms rank reduction. Drag the slider — the gap only grows tighter.

Memory budget: 3.6 b/d
MATCHED-BUDGET COMPARISON

Quantization vs rank reduction across all models.

Every row is a real experimental result at matched storage. Filter by model or method. Sort any column.

MODEL ↕ ARCH ↕ BUDGET ↕ METHOD ↕ DIMS ↕ PPL ↕ vs FP16 ↕ MARGIN ↕
THEORETICAL RESULT

A perturbation asymmetry formalises the gap.

Under the softmax Fisher metric, projection damage exceeds quantization damage by $3 \times 2^{2b}$ per direction — $768\times$ at INT4.

PROPOSITION 2 — PERTURBATION ASYMMETRY
For direction $u$ with signal $\sigma_u$ under $G = \mathrm{diag}(\alpha) - \alpha\alpha^\top$: $$\mathrm{KL}_{\mathrm{proj}} = \tfrac{1}{2}\,\sigma_u^2 \cdot u^\top G u \qquad \text{vs} \qquad \mathbb{E}[\mathrm{KL}_{\mathrm{quant}}] = \frac{\sigma_u^2}{2 \cdot 3 \cdot 2^{2b}} \cdot u^\top G u$$ $$\text{Ratio:}\quad 3 \times 2^{2b} \quad (768\times \text{ at INT4})$$ The sensitivity $u^\top G u$ cancels — both methods face the same softmax geometry. The difference is entirely in perturbation magnitude: rank reduction deletes the direction's contribution to the score gap $\Delta = s_{i_1} - s_{i_2}$; if $|\delta_i| > |\Delta|$, attention flips — a discrete failure. Quantization perturbs by $\mathcal{O}(\lambda_i / 2^b)$, crossing the boundary only when noise exceeds $\Delta$, which is rare at $b \geq 4$.
THE RESEARCH

Independent ML research from FraQtl AI.

This work emerged from a systematic attempt to improve rank reduction — and the discovery that the paradigm itself was the barrier. After exhausting every closed-form metric, perturbation series, and learned correction, the breakthrough came from changing the compression operator entirely.

The result is currently in peer review.

FraQtl AI Research
ML RESEARCH · COMPRESSION · ATTENTION GEOMETRY
fraQtl