Parallel Computing and Optimization Techniques
External reference: https://openalex.org/T10054
-
Adaptive CPU frequency scaling for energy-efficient and sustainable edge computing under renewable energy uncertainty
Deep reinforcement learning improves CPU frequency scaling for edge computing systems powered by renewable energy, reducing prediction error by 35% and optimizing the energy-latency tradeoff.
-
Thoth: Uncovering Data-Dependent Memory Access Patterns via Annotation-Directed Load Sampling
Thoth hardware prefetcher improves performance on sparse data structures by tracking producer-consumer load pairs and using annotation-directed sampling to capture complex memory access patterns.
-
CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
System offloads key-value caches to remote FPGA memory using CXL interconnects, achieving 3.2× throughput gains and 2.8× memory cost reduction for datacenter LLM serving.
-
It’s about Time: Temporal Abstractions for Asynchronous GPU Tensor Computations
Framework for temporal abstractions that simplify coordination of asynchronous GPU tensor computations, reducing complexity and hardware-dependent errors in specialized concurrent execution.
-
SLAWS: Spatial Locality Analysis and Workload Orchestration for Sparse Matrix Multiplication
SLAWS framework enhances sparse matrix multiplication by analyzing data locality patterns and orchestrating workloads adaptively, overcoming limitations of fixed-architecture accelerators.
-
PAT: Accelerating LLM Decoding via P refix- A ware A t tention with Resource Efficient Multi-Tile Kernel
PAT optimizes LLM decode-phase attention by exploiting shared request prefixes and adaptive kernel tiling, reducing memory bandwidth bottlenecks in multi-request serving scenarios.
-
Queueing model reduces energy use in ternary optical computers
Study proposes queuing-based service model to optimize energy consumption and performance in ternary optical computers through threshold-based scheduling.
-
Liger+ dynamically balances latency and throughput in large model inference
Distributed inference system using interleaved parallelism to dynamically balance latency-throughput trade-offs via task-aware batch management and strategic kernel scheduling across multiple GPUs.
-
WaSC decouples WASM system access with low startup and memory use
WaSC hardens WebAssembly sandboxes through system interface decoupling, achieving machine-level isolation while maintaining WASM performance advantages for serverless computing environments.
-
Integrating Quantum Software Tools with(in) MLIR
A practical guide for integrating quantum software tools using MLIR infrastructure, demonstrated through a case study connecting PennyLane and Munich Quantum Toolkit.

