PRODUCTION INFERENCE COMPRESSION

Run longer-context LLMs on fewer GPUs.

fraQtl compresses model weights and KV cache for production inference. Same model behavior, lower VRAM, longer context, no retraining.

Qwen 3.6 35B-A3B
128K on 1× A100
KV cache
3.5× smaller V-cache
NIAH retrieval
98.5% · 1080 trials
No retraining
calibrate + compress
Request Pilot See Benchmarks

Free technical pilot for the first 5 qualified design partners.
Prefer email? contact@fraqtl.ai

THE PROBLEM

KV cache and VRAM are the bottleneck.

At long context, KV cache size determines how many users you can serve concurrently and what models fit on your hardware. Most teams are forced to choose: shorter context, smaller model, or more GPUs. fraQtl removes that choice.

QWEN 3.6 35B-A3B · 1× A100-80GB
VRAM at increasing context length

FP16 hits the GPU ceiling at 64K. fraQtl runs 128K with 28 GB headroom.

A100-80GB CEILING
71 GB
25.6 GB
16K
BOTH FIT
OOM
82.9 GB
36.8 GB
64K
FP16 NEEDS 2 GPUS
OOM
85+ GB
51.7 GB
128K
FRAQTL: 28 GB FREE
FP16 BASELINE
FRAQTL COMPRESSED

FP16 64K = 82.86 GB measured (OOM on 80 GB ceiling). 16K and 128K FP16 figures are conservative model + KV-cache extrapolations. fraQtl numbers measured on a single A100-80GB with the public artifact.

30-DAY PILOT

Bring your model. Keep the savings.

Send us your model and a workload sample. We calibrate, deliver a compressed artifact + benchmark report + integration path for your deployment stack. If the numbers don't move, no commitment.

1
You send
Model (HF org/name), a small workload sample, target context length, current GPU setup. The Tally form takes ~2 min.
2
We calibrate + benchmark
Per-model calibration on your workload. Compressed artifact + benchmark report against your FP16 baseline. Pilot turnaround: 1–2 weeks.
3
You deploy
Drop-in compressed artifact, or runtime KV integration on your stack (vLLM integration in progress; SGLang / TGI / custom supported). Integration support included.
DELIVERY 1 · ARTIFACT
Compressed model on HuggingFace
Loads through standard Transformers. No special kernels. Drop-in for inference servers.
DELIVERY 2 · RUNTIME KV
One-line cache compression
Compose with the artifact for additional savings at long context. enable_cache_compression(model)
Request Pilot

Free technical pilot for the first 5 qualified design partners.

RELEASES SHIP LOG · LATEST FIRST

Compressed models, ready to load.

Public artifacts on HuggingFace. Reproducible numbers. Lower-memory benchmarked artifacts that load through the standard Transformers loader.

LIVE TODAY 2026 · APR 27

Qwen 3.6 35B-A3B compressed

MoE · 35B params (3B active) · weight-compressed artifact · 23.8 GB on disk
HuggingFace →
MMLU
82.24%
FP16: 82.40% · −0.16pp
∞Bench Passkey
30 / 30
at 125,315 tokens
VRAM @ 128K
51.7 GB
FP16 OOMs · 1× A100
INSTALL pip install fraqtl-runtime
Next: Llama 3.1 70B compressed · Gemma family · subscribe to release updates →
WHY IT MATTERS

Fewer GPUs. Longer context. Lower cost.

At long context, KV cache becomes a major driver of GPU memory. fraQtl reduces memory pressure while preserving downstream retrieval and task quality — allowing larger contexts or more concurrent workloads on the same hardware. The V-only path achieves 3.5× V-cache compression at 98.5% NIAH retrieval; a V+K variant trades a little precision for ~3× total KV reduction.

Longer context
32K · 64K · 128K
Long-context inference becomes more practical when cache memory is compressed.
Fewer GPUs
35B AT 128K · 1× A100
Memory compression can turn multi-GPU deployments into single-GPU for selected workloads.
More concurrency
SAME MEMORY BUDGET
Smaller KV cache means more requests can fit into the same memory budget.

Quantization in the right basis replaces rank reduction as the default for KV compression.

BENCHMARKS

The numbers behind the pilot.

All numbers traceable to our public HuggingFace model card. pip install fraqtl-runtime to reproduce.

HOW WE FRAME COMPARISONS

BF16 / FP16 is the quality reference. Public Q4 is the practical size baseline teams already deploy. fraQtl targets FP16-level behavior plus long-context KV memory savings the existing Q4 stack doesn't address.

Matched Q4_K_M bakeoff in progress — results published when locked.

RECEIPT 1

Four numbers vs the nearest published method.

METRIC FRAQTL NEAREST PUBLISHED DIFFERENCE
NIAH retention (1080 trials) 98.5% 97.8% · TOVA +0.7 pp
PPL Δ at 3.5× compression +0.012 +0.214 · SnapKV ~18× tighter
NIAH at 128K (Llama 3.1 8B) 100% 0% · KIVI-2 +100 pp
GPUs needed at 128K (35B MoE) 1× A100-80GB 2× A100-80GB · FP16 50% hardware
Sources: Mistral-7B-Instruct C44 bake-off (1080 NIAH trials, 8K–31K, 3 needle types). Llama 3.1 8B C44d/e (128K, 3 needles × 5 trials per depth). Qwen 3.6 35B-A3B VRAM measured on a single A100-80GB.
Honest note: at matched 4-bit, fraQtl ties KIVI-4 / KVQuant-4 within sampling noise. The wins above are at sub-4-bit, at 128K, vs eviction methods, and on hardware footprint.
Show full method breakdown · 7 baselines, 7 axes +
METHOD MEMORY QUALITY CALIBRATION OVERHEAD NIAH
1080 TRIALS
PPL Δ
V-ONLY
fraQtl (V-only) 3.5× ✓ ✓ within noise ✓ 0.3s 0% 98.5% +0.012
fraQtl (V+K) ~3× ✓ ✓ within noise ✓ 0.3s 0% 97.8%
FP16 baseline 97.0% 0
TOVA ⚠ degrades ✓ none per-token 97.8% +0.259
SnapKV ⚠ degrades ✓ none per-token 94.1% +0.214
StreamingLLM ✗ poor ✓ none per-token 37.8% +0.548
H2O ✗ OOM ✓ none eager attn 0.0%
KVQuant-2 ✓ 2-bit ⚠ moderate ✗ 5–15 min custom kernels +0.27
KIVI-2 ✓ 2-bit ✗ collapses at 128K ✓ none per-token 37.8% → 0%@128K +1.00
NIAH: needle-in-a-haystack retrieval. C44 bake-off, Mistral-7B-Instruct, 8K–31K context, 3 needle types × 360 trials per method.
† H2O: OOM at 16K+ context (eager attention required).
RECEIPT 2

Quality matches FP16 within sampling noise.

Qwen 3.6 35B-A3B compressed (single-seed) · vs FP16 baseline
BENCHMARK FP16 FRAQTL Δ
MMLU 0-shot 82.40% 82.24% −0.16 pp
∞Bench Passkey @ 125,315 tokens 30 / 30 30 / 30 parity
HumanEval pass@1 reference 100% retention within sampling noise
Wikitext-2 PPL Δ +0.033 tight
Single-seed numbers from the public Qwen 3.6 artifact. 3-seed bit-identical verification on PPL: σ = 0 across seeds 7 / 123 / 2024. Architecture-level 3-seed validation (V/K theorem) on Mistral 7B GQA-4 + Qwen 2.5 3B GQA-2 + Llama 3.2 3B + Phi-3.
RECEIPT 3

Memory savings break the multi-GPU requirement.

VRAM · Single A100-80GB
QWEN 3.6 35B-A3B

FP16 OOMs at 64K. fraQtl runs 128K with 28 GB headroom.

Same model. Same hardware. 8× the context. The difference between needing 1 GPU and needing 2.

CONTEXT FP16 BASELINE FRAQTL HARDWARE
16K ~71 GB 25.6 GB Both fit on 1× A100-80GB
64K 82.9 GB → OOM 36.8 GB FP16 needs 2 GPUs; fraQtl 1
128K 85+ GB 51.7 GB FP16 needs 2 GPUs; fraQtl 1 · 28 GB free
FP16 64K = 82.86 GB measured (OOMs on 80 GB ceiling). 16K and 128K FP16 figures are model + KV-cache extrapolations — conservative lower bounds.
fraQtl numbers measured on a single A100-80GB with the public artifact at huggingface.co/fraQtl/Qwen3.6-35B-A3B-compressed.
MECHANISM

Why deletion fails where noise succeeds.

Each cell is a KV-cache dimension. Watch what happens to attention routing under each compression strategy.

Rank throws away signal.
Quantization preserves it.

Rank Reduction — dimensions deleted
PPL: 9.19 (FP16)
Quantization — precision reduced
PPL: 9.19 (FP16)
Full mechanism on /research →
PILOT INTAKE

Send us your model. We'll tell you if fraQtl can help.

A 30-day technical pilot. We calibrate on your workload, benchmark against your FP16 baseline, and hand you a deployable artifact. Free for the first 5 qualified design partners.

Direct contact: contact@fraqtl.ai
Replies within 1 business day.