CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

Close-up view of multiple server racks with orange and yellow LED status indicators, network cables, and hardware components mounted vertically in a modern datacenter facility with blue-tinted lighting.
Image Credit: Photo by Homa Appliances on Unsplash (SourceLicense)

AI Summary of Scholarly Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. See full disclosure ↓

Publication Signals show what we were able to verify about where this research was published.STANDARDAvailable publication signals for this source were verified. Publication Signals reflect the source’s verifiable credentials, not the quality of the research.

Fewer signals were independently confirmable for this source. That reflects the limits of what’s on record — not a judgment about the research.

  • ✔ Published in indexed journal
  • ✔ No retraction or integrity flags

Key findings from this study

  • The study found that disaggregated KV-cache offloading to FPGA memory via CXL interconnects achieves 3.2× throughput improvements over GPU-only baselines.
  • The authors report that FPGA-accelerated compression reduces memory bandwidth requirements by up to 4× without degrading inference accuracy.
  • The researchers demonstrate that speculative cache prefetching effectively reduces latency by predicting future token access patterns during autoregressive decoding.

Overview

CXL-SpecKV proposes a disaggregated KV-cache architecture for LLM serving that offloads key-value caches to remote FPGA memory via Compute Express Link interconnects. The system combines memory disaggregation, speculative prefetching, and FPGA-accelerated compression to mitigate GPU memory constraints during autoregressive decoding.

Methods and approach

The architecture comprises three components: a CXL-based framework that relocates KV-caches to FPGA memory with minimal latency overhead, a speculative mechanism that predicts and preloads future token cache entries, and an FPGA engine that compresses and decompresses KV-cache data. Evaluation used state-of-the-art LLM models, comparing against GPU-only baselines for throughput, memory cost, and inference accuracy.

Results

CXL-SpecKV achieved 3.2× higher throughput relative to GPU-only systems while reducing memory costs by 2.8×. The FPGA-accelerated compression engine reduced memory bandwidth requirements by up to 4×. Inference accuracy remained comparable to baseline models across evaluated configurations.

The disaggregated approach successfully alleviates GPU memory bottlenecks during batch decoding. Speculative prefetching reduced latency by predicting cache access patterns. The system demonstrated scalability improvements in serving capacity without sacrificing model output quality.

Implications

Disaggregated memory architectures using CXL interconnects provide a viable pathway for scaling LLM serving in datacenter environments. Hardware-software codesign combining FPGAs with intelligent prefetching mechanisms can extend effective memory capacity beyond physical GPU constraints. The approach enables higher batch sizes and throughput, improving resource utilization in production inference clusters.

These results suggest that memory wall challenges in LLM deployment warrant heterogeneous solutions beyond traditional GPU upgrades. Organizations deploying large-scale LLM inference could adopt similar disaggregation strategies to optimize cost-performance tradeoffs. Further investigation into CXL maturity, interconnect standardization, and integration with diverse LLM architectures remains necessary for widespread adoption.

Scope and limitations

This summary is based on the study abstract and available metadata. It does not include a full analysis of the complete paper, supplementary materials, or underlying datasets unless explicitly stated. Findings should be interpreted in the context of the original publication.

Disclosure

  • Research title: CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
  • Authors: Dong Liu, Yang Yu
  • Institutions: Columbia University, Yale University
  • Publication date: 2026-02-05
  • DOI: https://doi.org/10.1145/3748173.3779188
  • OpenAlex record: View
  • Image credit: Photo by Homa Appliances on Unsplash (SourceLicense)
  • Disclosure: This post was generated by Claude (Anthropic). The original authors did not write or review this post.

Get the weekly research newsletter

Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.

More posts