RESEARCH · THE MATH BEHIND IT

The math behind why fraQtl preserves model behavior.

Production compression with a closed-form basis. The pullback metric over attention values picks the directions that downstream loss depends on; everything else is quantized with bounded distortion. Empirically validated across multiple transformer architectures. Patent pending.

← Back to product Request Pilot Models on HF
RUNTIME · VERIFIED

Real inference. Real numbers. No retraining.

Every number measured at inference time. 3-seed validation on Mistral 7B (GQA-4) and Qwen 2.5 3B (GQA-2); additional architectures scout-tested (Phi-3 MHA, Llama 3.2 3B GQA-3, Llama 3.1 8B GQA-8, Qwen 3.6 35B-A3B). Same V-cache primitive across the reported architecture families, with validation reported per model. Compressed artifacts load through standard Hugging Face workflows.

MODEL SCALING
Near-zero on 3-seed validations (Mistral 7B, Qwen 2.5 3B). Scout-tested on 5 more architectures.
DOMAIN TRANSFER
Works everywhere. Better on code than encyclopedia.
VS PUBLISHED METHODS · NIAH RETRIEVAL
Mistral-7B-Instruct · 1080 trials · retrieval preserved at long context
CONTEXT SCALING
Quality preserved at long context on the architectures we tested. Per-model calibration step.
HOW IT WORKS
Eigenbasis rotation absorbed into $W_O$.
Compressed artifacts load through standard Hugging Face workflows.
No hooks. No kernels. No code changes. Load and serve.
KV CACHE MEMORY AT 32K CONTEXT
Without
KV cache: full size
Needs 2 GPUs
With fraQtl
V cache: 3.5× smaller
Fits 1 GPU ✓
3.5× V-cache compression · 98.5% NIAH retrieval · no measurable overhead in tested setup
98.5% retrieval across 1080 long-context trials, 2 architectures. V+K variant gives ~3× total KV at 97.8% retrieval.
DOWNSTREAM TASK ACCURACY
fraQtl preserves or improves real task performance. Mistral-7B-Instruct, 3 out of 4 tasks improved.
TASK FP16 FRAQTL DELTA
SQuAD v2 (QA)58.5%60.4%+1.9%
TriviaQA (QA)28.6%31.3%+2.7%
CNN/DailyMail21.1%20.6%-0.5%
XSum (Summarization)20.1%23.1%+3.0%
Average: +1.8% across 4 tasks
CORE RESULTS

Three findings that change how you compress transformers.

At matched storage budgets, quantization consistently outperforms rank reduction. The gap is not about basis — it is about the geometry of softmax routing.

01
Quantization outperforms rank reduction across tested models
In every model and budget tested, INT4 outperforms rank reduction at matched bit cost by 4 to 364 PPL. The margin grows with GQA aggressiveness.
02
Deletion causes discrete routing failures
Rank reduction flips 4.6% of attention routing decisions vs 0.03% for INT4. Bounded noise preserves score ordering. Deletion does not.
03
For KV cache, preserving directions matters more than choosing a basis
Quantization quality is basis-independent (spread $<0.4$ PPL across all rotations). The advantage is about preserving all dimensions.
04
Joint $K$+$V$ INT4 at 75% reduction costs +0.18 PPL
Both $K$ and $V$ are safely quantizable. Per-channel symmetric INT4 requires no retraining, no special basis, no optimization.
MECHANISM

Why deletion fails where noise succeeds.

Each cell is a KV-cache dimension. Watch what happens to attention routing under each compression strategy.

Rank throws away signal.
Quantization preserves it.

Rank Reduction — dimensions deleted
PPL: 9.19 (FP16)
Quantization — precision reduced
PPL: 9.19 (FP16)
THE FIGURE

Perplexity vs bits/dimension on Mistral 7B.

Rank reduction explodes below 4 bits. Uniform quantization collapses below 3 bits. fraQtl stays flat to 2 bits — the dead zone is where every other method fails.

Rank Reduction Uniform Quant (GPTQ) fraQtl V-Split + Lloyd-Max FP16 baseline
MECHANISM — LIVE

Routing flips: why deletion is catastrophic.

When rank reduction deletes a KV direction it creates a score perturbation $|\delta| \approx \sigma_{\text{removed}}$. If this exceeds the gap $\Delta = s_{i_1} - s_{i_2}$, attention flips to the wrong token. Quantization keeps $|\delta|$ bounded at $\frac{\sigma}{2^b}$ — $768\times$ smaller at INT4.

FRAQTL INT4
94%
Routing Stability ✅
Attention flow preserved — perturbation bounded at σ/2b
RANK REDUCTION k=32
61%
Routing Stability ❌
High attention drift — deleted directions create unbounded perturbation
Rank Reduction — direction deleted
FP16 baseline — Token A routes
Quantization INT4 — precision reduced
FP16 baseline — Token A routes
QUANTIZATION VISUALIZED

Bounded noise vs information loss.

Every KV value survives quantization — just rounded to the nearest step. Rank reduction eliminates entire directions. Watch how precision degrades gracefully while deletion destroys structure.

FP16 original signal
INT4 quantized (all dims)
Rank-reduced (dims deleted)
INTERACTIVE

Pick your memory budget. See who wins.

At every storage constraint, fraQtl outperforms rank reduction. Drag the slider — the gap only grows tighter.

Memory budget: 3.6 b/d
MATCHED-BUDGET COMPARISON

Quantization vs rank reduction across all models.

Every row is a real experimental result at matched storage. Filter by model or method. Sort any column.

MODEL ↕ ARCH ↕ BUDGET ↕ METHOD ↕ DIMS ↕ PPL ↕ vs FP16 ↕ MARGIN ↕
THEORETICAL RESULT

A perturbation asymmetry formalises the gap.

Under the softmax Fisher metric, projection damage exceeds quantization damage by $3 \times 2^{2b}$ per direction — $768\times$ at INT4.

PROPOSITION 2 — PERTURBATION ASYMMETRY
For direction $u$ with signal $\sigma_u$ under $G = \mathrm{diag}(\alpha) - \alpha\alpha^\top$: $$\mathrm{KL}_{\mathrm{proj}} = \tfrac{1}{2}\,\sigma_u^2 \cdot u^\top G u \qquad \text{vs} \qquad \mathbb{E}[\mathrm{KL}_{\mathrm{quant}}] = \frac{\sigma_u^2}{2 \cdot 3 \cdot 2^{2b}} \cdot u^\top G u$$ $$\text{Ratio:}\quad 3 \times 2^{2b} \quad (768\times \text{ at INT4})$$ The sensitivity $u^\top G u$ cancels — both methods face the same softmax geometry. The difference is entirely in perturbation magnitude: rank reduction deletes the direction's contribution to the score gap $\Delta = s_{i_1} - s_{i_2}$; if $|\delta_i| > |\Delta|$, attention flips — a discrete failure. Quantization perturbs by $\mathcal{O}(\lambda_i / 2^b)$, crossing the boundary only when noise exceeds $\Delta$, which is rare at $b \geq 4$.
TRY IT NOW

Two ways to compress.

Pre-compressed weights you can download today, or runtime KV cache compression for tested architectures.

WEIGHT-LEVEL
FREE

Pre-compressed models. Ready to use.

Download from HuggingFace. Load with transformers. No compression step needed.

# No fraQtl install needed
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
  "fraQtl/Mistral-7B-compressed")
Mistral 7B · k=64 ↗ Llama 3.2 3B · k=32 ↗ Qwen 2.5 3B · k=32 ↗
View all on HuggingFace →
RUNTIME
EARLY ACCESS

One line of code. 3.5× V-cache compression on tested architectures.

Compatible with GQA / MHA HuggingFace models (Qwen, Llama, Mistral, Phi tested). No retraining, no architecture changes. No measurable overhead in our benchmarks.

import fraqtl
fraqtl.enable_cache_compression(model, k=32, bits=3, compress_keys=True)
# 3.5× V-cache. 98.5% retrieval on 1080 trials.
Baseline fraQtl
NIAH retrieval 97.0% 98.5%
PPL 5.2448 5.2568
V cache 3.5×
Overhead zero
Request Pilot →
RESEARCH

The mathematics behind fraQtl.

Peer review in progress. Full preprints available.

PAPER I · 2026 · arXiv:2604.11501

Quantization Dominates Rank Reduction for KV-Cache Compression

Quantization consistently outperforms rank reduction for KV-cache compression across transformer models. We derive the attention-weighted V-cache theorem ($V^\top\alpha^\top\alpha V$ eigenbasis) and show why deleting directions causes routing failures while bounded quantization noise preserves retrieval. At matched storage budgets, INT4 matches FP16 within sampling noise while rank reduction loses 4–364 PPL.

KV-Cache V-Theorem Softmax Geometry 5 Models · 124M–14B
arXiv ↗
PAPER II · 2026 · arXiv:2604.20682

Variance Is Not Importance: Spectral Analysis of Transformer Compressibility

A structural analysis of transformer compressibility showing that high-variance directions are not necessarily downstream-important directions. This motivates downstream-aware compression metrics instead of raw variance preservation. 46 experiments across GPT-2 and Mistral 7B; we identify five structural properties including variance ≠ importance, conditional block linearity, and a depth-dependent linearity gradient ($R^2$: 0.17→0.93).

Spectral Analysis Variance ≠ Importance Transformer Geometry
arXiv ↗
PAPER III · 2026 · IN PREP

Dual-Loss LoRA for Early Exit in Large Language Models

A separate adaptive-inference branch exploring early-exit capability through LoRA fine-tuning and dual-loss supervision. Not the main compression paper — part of fraQtl's broader inference-efficiency research.

Adaptive Inference Early Exit LoRA
IN
PREP
THE RESEARCH

The math behind production inference compression.

This work emerged from a systematic attempt to improve rank reduction — and the discovery that the paradigm itself was the barrier. After exhausting every closed-form metric, perturbation series, and learned correction, the breakthrough came from changing the compression operator entirely.

The result is currently in peer review.

FraQtl AI Research
ML RESEARCH · COMPRESSION · ATTENTION GEOMETRY
fraQtl