RESEARCH · THE MATH BEHIND IT

The math behind why fraQtl preserves model behavior.

Production compression with a closed-form basis. The pullback metric over attention values picks the directions that downstream loss depends on; everything else is quantized with bounded distortion. Empirically validated across multiple transformer architectures. Patent pending.

← Back to product Request Pilot Models on HF

RUNTIME · VERIFIED

Real inference. Real numbers. No retraining.

Every number measured at inference time. 3-seed validation on Mistral 7B (GQA-4) and Qwen 2.5 3B (GQA-2); additional architectures scout-tested (Phi-3 MHA, Llama 3.2 3B GQA-3, Llama 3.1 8B GQA-8, Qwen 3.6 35B-A3B). Same V-cache primitive across the reported architecture families, with validation reported per model. Compressed artifacts load through standard Hugging Face workflows.

MODEL SCALING

Near-zero on 3-seed validations (Mistral 7B, Qwen 2.5 3B). Scout-tested on 5 more architectures.

DOMAIN TRANSFER

Works everywhere. Better on code than encyclopedia.

VS PUBLISHED METHODS · NIAH RETRIEVAL

Mistral-7B-Instruct · 1080 trials · retrieval preserved at long context

CONTEXT SCALING

Quality preserved at long context on the architectures we tested. Per-model calibration step.

HOW IT WORKS

Eigenbasis rotation absorbed into $W_O$.

Compressed artifacts load through standard Hugging Face workflows.

No hooks. No kernels. No code changes. Load and serve.

KV CACHE MEMORY AT 32K CONTEXT

Without

KV cache: full size

Needs 2 GPUs

With fraQtl

V cache: 3.5× smaller

Fits 1 GPU ✓

3.5× V-cache compression · 98.5% NIAH retrieval · no measurable overhead in tested setup

98.5% retrieval across 1080 long-context trials, 2 architectures. V+K variant gives ~3× total KV at 97.8% retrieval.

DOWNSTREAM TASK ACCURACY

fraQtl preserves or improves real task performance. Mistral-7B-Instruct, 3 out of 4 tasks improved.

TASK	FP16	FRAQTL	DELTA
SQuAD v2 (QA)	58.5%	60.4%	+1.9%
TriviaQA (QA)	28.6%	31.3%	+2.7%
CNN/DailyMail	21.1%	20.6%	-0.5%
XSum (Summarization)	20.1%	23.1%	+3.0%

Average: +1.8% across 4 tasks

CORE RESULTS

Three findings that change how you compress transformers.

At matched storage budgets, quantization consistently outperforms rank reduction. The gap is not about basis — it is about the geometry of softmax routing.

Quantization outperforms rank reduction across tested models

In every model and budget tested, INT4 outperforms rank reduction at matched bit cost by 4 to 364 PPL. The margin grows with GQA aggressiveness.

Deletion causes discrete routing failures

Rank reduction flips 4.6% of attention routing decisions vs 0.03% for INT4. Bounded noise preserves score ordering. Deletion does not.

For KV cache, preserving directions matters more than choosing a basis

Quantization quality is basis-independent (spread $<0.4$ PPL across all rotations). The advantage is about preserving all dimensions.

Joint $K$+$V$ INT4 at 75% reduction costs +0.18 PPL

Both $K$ and $V$ are safely quantizable. Per-channel symmetric INT4 requires no retraining, no special basis, no optimization.

MECHANISM

Why deletion fails where noise succeeds.

Each cell is a KV-cache dimension. Watch what happens to attention routing under each compression strategy.

Rank throws away signal.
Quantization preserves it.

Rank Reduction — dimensions deleted

PPL: 9.19 (FP16)

Quantization — precision reduced

PPL: 9.19 (FP16)

THE FIGURE

Perplexity vs bits/dimension on Mistral 7B.

Rank reduction explodes below 4 bits. Uniform quantization collapses below 3 bits. fraQtl stays flat to 2 bits — the dead zone is where every other method fails.

Rank Reduction Uniform Quant (GPTQ) fraQtl V-Split + Lloyd-Max FP16 baseline

MECHANISM — LIVE

Routing flips: why deletion is catastrophic.

When rank reduction deletes a KV direction it creates a score perturbation $|\delta| \approx \sigma_{\text{removed}}$. If this exceeds the gap $\Delta = s_{i_1} - s_{i_2}$, attention flips to the wrong token. Quantization keeps $|\delta|$ bounded at $\frac{\sigma}{2^b}$ — $768\times$ smaller at INT4.

FRAQTL INT4

94%

Routing Stability ✅

Attention flow preserved — perturbation bounded at σ/2b

RANK REDUCTION k=32

61%

Routing Stability ❌

High attention drift — deleted directions create unbounded perturbation

Rank Reduction — direction deleted

FP16 baseline — Token A routes

Quantization INT4 — precision reduced

FP16 baseline — Token A routes

QUANTIZATION VISUALIZED

Bounded noise vs information loss.

Every KV value survives quantization — just rounded to the nearest step. Rank reduction eliminates entire directions. Watch how precision degrades gracefully while deletion destroys structure.

FP16 original signal

INT4 quantized (all dims)

Rank-reduced (dims deleted)

INTERACTIVE

Pick your memory budget. See who wins.

At every storage constraint, fraQtl outperforms rank reduction. Drag the slider — the gap only grows tighter.

Memory budget: 3.6 b/d

THEORETICAL RESULT

A perturbation asymmetry formalises the gap.

Under the softmax Fisher metric, projection damage exceeds quantization damage by $3 \times 2^{2b}$ per direction — $768\times$ at INT4.

PROPOSITION 2 — PERTURBATION ASYMMETRY

For direction $u$ with signal $\sigma_u$ under $G = \mathrm{diag}(\alpha) - \alpha\alpha^\top$: $$\mathrm{KL}_{\mathrm{proj}} = \tfrac{1}{2}\,\sigma_u^2 \cdot u^\top G u \qquad \text{vs} \qquad \mathbb{E}[\mathrm{KL}_{\mathrm{quant}}] = \frac{\sigma_u^2}{2 \cdot 3 \cdot 2^{2b}} \cdot u^\top G u$$ $$\text{Ratio:}\quad 3 \times 2^{2b} \quad (768\times \text{ at INT4})$$ The sensitivity $u^\top G u$ cancels — both methods face the same softmax geometry. The difference is entirely in perturbation magnitude: rank reduction deletes the direction's contribution to the score gap $\Delta = s_{i_1} - s_{i_2}$; if $|\delta_i| > |\Delta|$, attention flips — a discrete failure. Quantization perturbs by $\mathcal{O}(\lambda_i / 2^b)$, crossing the boundary only when noise exceeds $\Delta$, which is rare at $b \geq 4$.

TRY IT NOW

Two ways to compress.

Pre-compressed weights you can download today, or runtime KV cache compression for tested architectures.

WEIGHT-LEVEL

FREE

Pre-compressed models. Ready to use.

Download from HuggingFace. Load with transformers. No compression step needed.

          # No fraQtl install needed

          from transformers import AutoModelForCausalLM

          model = AutoModelForCausalLM.from_pretrained(

            "fraQtl/Mistral-7B-compressed")

Mistral 7B · k=64 ↗ Llama 3.2 3B · k=32 ↗ Qwen 2.5 3B · k=32 ↗

View all on HuggingFace →

RUNTIME

EARLY ACCESS

One line of code. 3.5× V-cache compression on tested architectures.

Compatible with GQA / MHA HuggingFace models (Qwen, Llama, Mistral, Phi tested). No retraining, no architecture changes. No measurable overhead in our benchmarks.

          import fraqtl

          fraqtl.enable_cache_compression(model, k=32, bits=3, compress_keys=True)

          # 3.5× V-cache. 98.5% retrieval on 1080 trials.

	Baseline	fraQtl
NIAH retrieval	97.0%	98.5%
PPL	5.2448	5.2568
V cache	1×	3.5×
Overhead	—	zero

Request Pilot →

RESEARCH

The mathematics behind fraQtl.

Peer review in progress. Full preprints available.

PAPER I · 2026 · arXiv:2604.11501

Quantization Dominates Rank Reduction for KV-Cache Compression

Quantization consistently outperforms rank reduction for KV-cache compression across transformer models. We derive the attention-weighted V-cache theorem ($V^\top\alpha^\top\alpha V$ eigenbasis) and show why deleting directions causes routing failures while bounded quantization noise preserves retrieval. At matched storage budgets, INT4 matches FP16 within sampling noise while rank reduction loses 4–364 PPL.

KV-Cache V-Theorem Softmax Geometry 5 Models · 124M–14B

arXiv ↗

PAPER II · 2026 · arXiv:2604.20682

Variance Is Not Importance: Spectral Analysis of Transformer Compressibility

A structural analysis of transformer compressibility showing that high-variance directions are not necessarily downstream-important directions. This motivates downstream-aware compression metrics instead of raw variance preservation. 46 experiments across GPT-2 and Mistral 7B; we identify five structural properties including variance ≠ importance, conditional block linearity, and a depth-dependent linearity gradient ($R^2$: 0.17→0.93).

Spectral Analysis Variance ≠ Importance Transformer Geometry

arXiv ↗

PAPER III · 2026 · IN PREP

Dual-Loss LoRA for Early Exit in Large Language Models

A separate adaptive-inference branch exploring early-exit capability through LoRA fine-tuning and dual-loss supervision. Not the main compression paper — part of fraQtl's broader inference-efficiency research.

Adaptive Inference Early Exit LoRA

IN
PREP

THE RESEARCH

The math behind production inference compression.

This work emerged from a systematic attempt to improve rank reduction — and the discovery that the paradigm itself was the barrier. After exhausting every closed-form metric, perturbation series, and learned correction, the breakthrough came from changing the compression operator entirely.

The result is currently in peer review.

FraQtl AI Research

ML RESEARCH · COMPRESSION · ATTENTION GEOMETRY

contact@fraqtl.ai ↗