CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

AI Summary of Scholarly Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. See full disclosure ↓

2026-02-05·View original paper ↗·Follow this topic (RSS)

Computer Science & AI Networks & Cloud Computing

Publication Signals show what we were able to verify about where this research was published.Available publication signals for this source were verified.ⓘ Publication Signals reflect the source’s verifiable credentials, not the quality of the research.

Fewer signals were independently confirmable for this source. That reflects the limits of what’s on record — not a judgment about the research.

✔ Published in indexed journal
✔ No retraction or integrity flags

Key findings from this study

The study found that disaggregated KV-cache offloading to FPGA memory via CXL interconnects achieves 3.2× throughput improvements over GPU-only baselines.
The authors report that FPGA-accelerated compression reduces memory bandwidth requirements by up to 4× without degrading inference accuracy.
The researchers demonstrate that speculative cache prefetching effectively reduces latency by predicting future token access patterns during autoregressive decoding.

Overview

CXL-SpecKV proposes a disaggregated KV-cache architecture for LLM serving that offloads key-value caches to remote FPGA memory via Compute Express Link interconnects. The system combines memory disaggregation, speculative prefetching, and FPGA-accelerated compression to mitigate GPU memory constraints during autoregressive decoding.

Methods and approach

The architecture comprises three components: a CXL-based framework that relocates KV-caches to FPGA memory with minimal latency overhead, a speculative mechanism that predicts and preloads future token cache entries, and an FPGA engine that compresses and decompresses KV-cache data. Evaluation used state-of-the-art LLM models, comparing against GPU-only baselines for throughput, memory cost, and inference accuracy.

Results

CXL-SpecKV achieved 3.2× higher throughput relative to GPU-only systems while reducing memory costs by 2.8×. The FPGA-accelerated compression engine reduced memory bandwidth requirements by up to 4×. Inference accuracy remained comparable to baseline models across evaluated configurations.

The disaggregated approach successfully alleviates GPU memory bottlenecks during batch decoding. Speculative prefetching reduced latency by predicting cache access patterns. The system demonstrated scalability improvements in serving capacity without sacrificing model output quality.

Implications

Disaggregated memory architectures using CXL interconnects provide a viable pathway for scaling LLM serving in datacenter environments. Hardware-software codesign combining FPGAs with intelligent prefetching mechanisms can extend effective memory capacity beyond physical GPU constraints. The approach enables higher batch sizes and throughput, improving resource utilization in production inference clusters.

These results suggest that memory wall challenges in LLM deployment warrant heterogeneous solutions beyond traditional GPU upgrades. Organizations deploying large-scale LLM inference could adopt similar disaggregation strategies to optimize cost-performance tradeoffs. Further investigation into CXL maturity, interconnect standardization, and integration with diverse LLM architectures remains necessary for widespread adoption.

Scope and limitations

This summary is based on the study abstract and available metadata. It does not include a full analysis of the complete paper, supplementary materials, or underlying datasets unless explicitly stated. Findings should be interpreted in the context of the original publication.

Disclosure

Research title: CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
Authors: Dong Liu, Yang Yu
Institutions: Columbia University, Yale University
Publication date: 2026-02-05
DOI: https://doi.org/10.1145/3748173.3779188
OpenAlex record: View
Image credit: Photo by Homa Appliances on Unsplash (Source • License)
Disclosure: This post was generated by Claude (Anthropic). The original authors did not write or review this post.

CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

Key findings from this study

Overview

Methods and approach

Results

Implications

Scope and limitations

Disclosure

Get the weekly research newsletter

Related research in Computer Science & AI

More posts

Next-to-leading power terms can be significant in slepton pair production

Modular symmetry shapes quintessence and de Sitter vacua

BIR-Adapter reduces training needs for blind image restoration

Gamma-limit analysis of thin incompressible magnetoelastic shallow shells