In-Network Collective Operations: Game Changer or Challenge for AI Workloads?

2026-01-27T02:59:08Z

This paper summarizes the opportunities of in-network collective operations (INC) for accelerated collective operations in AI workloads. We provide sufficient detail to make this important field accessible to non-experts in AI or networking, fostering a connection between these communities. Consider two types of INC: Edge-INC, where the system is implemented at the node level, and Core-INC, where the system is embedded within network switches. We outline the potential performance benefits as well as six key obstacles in the context of both Edge-INC and Core-INC that may hinder their adoption. Finally, we present a set of predictions for the future development and application of INC.

Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents

2026-01-26T22:37:12Z

LLM-based agent applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs and latency due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agent applications where outputs depend on external data and environmental contexts. We propose Agentic Plan Caching (APC), a novel test-time memory that extracts, stores, adapts, and reuses structured plan templates from planning stages of agent applications across semantically similar tasks to reduce the cost and latency of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agent applications shows that our system can reduce costs by 50.31% and latency by 27.28% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

KeyMemRT Compiler and Runtime: Unlocking Memory-Scalable FHE

2026-01-26T12:55:18Z

Fully Homomorphic Encryption (FHE) enables privacy preserving computation but it suffers from high latency and memory consumption. The computations are secured with special keys called rotation keys which often take up the majority of memory. In complex FHE applications, these rotation keys can cause a large memory bottleneck limiting program throughput. Existing compilers make little effort to solve this problem, instead relying on systems with massive memory availability. This resource requirement is a barrier to FHE uptake because optimizing FHE programs by hand is challenging due to their scale, complexity and expertise required. In this work, we present KeyMemRT; an MLIR based compiler and runtime framework that individually manages rotation key lifetimes to lower memory utilization and to allow arbitrary number of rotation indices to be supported without memory bloating. KeyMemRT relies on dataflow analysis to determine key lifetimes and is the first FHE compiler to provide automatic key management, handle fine-grained key-mangement and manage boostrap keys. We implement frontends for Orion and HEIR and show improvements over state-of-the-art FHE compilers. KeyMemRT achieves memory reduction of 1.74x and a speedup of 1.20x over ANT-ACE, and memory reduction of 1.16x and a speedup of 1.73x over memory-optimized compiler Fhelipe. We provide KeyMemRT as a post-optimizing compiler that can be targeted by any FHE compiler.

On the Bandwidth Consumption of Blockchains

2026-01-26T11:57:42Z

With the advent of blockchain technology, the number of proposals has boomed. The network traffic imposed by these blockchain proposals increases the cost of hosting nodes. Unfortunately, as of today, we are not aware of any comparative study of the bandwidth consumption of blockchains. In this paper, we propose the first empirical comparison of blockchain bandwidth consumption. To this end, we measure the network traffic of blockchain network nodes of five blockchain protocols: Algorand, Aptos, Avalanche, Redbelly and Solana. We study the variation over time, differentiate the receiving and sending traffic and analyze how this traffic varies with the number of nodes and validators. We conclude that the transport protocol is the main factor impacting the network traffic, segregating node roles helps reduce traffic and different blockchains are differently impacted by the network size.

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

2026-01-26T02:45:04Z

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

2026-01-24T06:17:59Z

SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B.

Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs

2026-01-23T22:54:56Z

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. We then extend this work to Intel GPUs where we design and implement mixed precision, 2-bit GEMM kernels, and show their performance to be close to optimal. We integrated our optimized Xe2 kernels in the vLLM framework as a quantization plugin and evaluated end-to-end LLM inference results for a range of LLM models and Xe2 GPUs. Depending on the model and platform, we see a 4x - 8x reduction in GEMM time compared to the BF16 case, and we get up to 6.3x speedup in end-to-end latency compared to the BF16 execution. Our optimized runtime advances the state of LLM inference on AI PCs and Intel Xe GPUs, paving the way for efficient deployment of ultra-low-bit LLM models.

DataStates-LLM: Scalable Checkpointing for Transformer Models Using Composable State Providers

2026-01-23T18:26:14Z

The rapid growth of Large Transformer-based models, specifically Large Language Models (LLMs), now scaling to trillions of parameters, has necessitated training across thousands of GPUs using complex hybrid parallelism strategies (e.g., data, tensor, and pipeline parallelism). Checkpointing this massive, distributed state is critical for a wide range of use cases, such as resilience, suspend-resume, investigating undesirable training trajectories, and explaining model evolution. However, existing checkpointing solutions typically treat model state as opaque binary blobs, ignoring the ``3D heterogeneity'' of the underlying data structures--varying by memory location (GPU vs. Host), number of ``logical'' objects sharded and split across multiple files, data types (tensors vs. Python objects), and their serialization requirements. This results in significant runtime overheads due to blocking device-to-host transfers, data-oblivious serialization, and storage I/O contention. In this paper, we introduce DataStates-LLM, a novel checkpointing architecture that leverages State Providers to decouple state abstraction from data movement. DataStates-LLM exploits the immutability of model parameters during the forward and backward passes to perform ``lazy'', non-blocking asynchronous snapshots. By introducing State Providers, we efficiently coalesce fragmented, heterogeneous shards and overlap the serialization of metadata with bulk tensor I/O. We evaluate DataStates-LLM on models up to 70B parameters on 256 A100-40GB GPUs. Our results demonstrate that DataStates-LLM achieves up to 4$\times$ higher checkpointing throughput and reduces end-to-end training time by up to 2.2$\times$ compared to state-of-the-art solutions, effectively mitigating the serialization and heterogeneity bottlenecks in extreme-scale LLM training.

ECCO: Evidence-Driven Causal Reasoning for Compiler Optimization

2026-01-23T01:23:20Z

Compiler auto-tuning faces a dichotomy between traditional black-box search methods, which lack semantic guidance, and recent Large Language Model (LLM) approaches, which often suffer from superficial pattern matching and causal opacity. In this paper, we introduce ECCO, a framework that bridges interpretable reasoning with combinatorial search. We first propose a reverse engineering methodology to construct a Chain-of-Thought dataset, explicitly mapping static code features to verifiable performance evidence. This enables the model to learn the causal logic governing optimization decisions rather than merely imitating sequences. Leveraging this interpretable prior, we design a collaborative inference mechanism where the LLM functions as a strategist, defining optimization intents that dynamically guide the mutation operations of a genetic algorithm. Experimental results on seven datasets demonstrate that ECCO significantly outperforms the LLVM opt -O3 baseline, achieving an average 24.44% reduction in cycles.

Class Confidence Aware Reweighting for Long Tailed Learning

2026-01-22T12:58:05Z

Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an Ω(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.

Evaluating Compiler Optimization Impacts on zkVM Performance

2026-01-22T10:25:30Z

Zero-knowledge proofs (ZKPs) are the cornerstone of programmable cryptography. They enable (1) privacy-preserving and verifiable computation across blockchains, and (2) an expanding range of off-chain applications such as credential schemes. Zero-knowledge virtual machines (zkVMs) lower the barrier by turning ZKPs into a drop-in backend for standard compilation pipelines. This lets developers write proof-generating programs in conventional languages (e.g., Rust or C++) instead of hand-crafting arithmetic circuits. However, these VMs inherit compiler infrastructures tuned for traditional architectures rather than for proof systems. In particular, standard compiler optimizations assume features that are absent in zkVMs, including cache locality, branch prediction, or instruction-level parallelism. Therefore, their impact on proof generation is questionable. We present the first systematic study of the impact of compiler optimizations on zkVMs. We evaluate 64 LLVM passes, six standard optimization levels, and an unoptimized baseline across 58 benchmarks on two RISC-V-based zkVMs (RISC Zero and SP1). While standard LLVM optimization levels do improve zkVM performance (over 40\%), their impact is far smaller than on traditional CPUs, since their decisions rely on hardware features rather than proof constraints. Guided by a fine-grained pass-level analysis, we~\emph{slightly} refine a small set of LLVM passes to be zkVM-aware, improving zkVM execution time by up to 45\% (average +4.6\% on RISC Zero, +1\% on SP1) and achieving consistent proving-time gains. Our work highlights the potential of compiler-level optimizations for zkVM performance and opens new direction for zkVM-specific passes, backends, and superoptimizers.

Reliability by design: quantifying and eliminating fabrication risk in LLMs. From generative to consultative AI: a comparative analysis in the legal domain and lessons for high-stakes knowledge bases

2026-01-21T21:26:42Z

This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations. It distinguishes three AI paradigms: (1) standalone generative models ("creative oracle"), (2) basic retrieval-augmented systems ("expert archivist"), and (3) an advanced, end-to-end optimized RAG system ("rigorous archivist"). The authors introduce two reliability metrics -False Citation Rate (FCR) and Fabricated Fact Rate (FFR)- and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert, double-blind review. Results show that standalone models are unsuitable for professional use (FCR above 30%), while basic RAG greatly reduces errors but still leaves notable misgrounding. Advanced RAG, using techniques such as embedding fine-tuning, re-ranking, and self-correction, reduces fabrication to negligible levels (below 0.2%). The study concludes that trustworthy legal AI requires rigor-focused, retrieval-based architectures emphasizing verification and traceability, and provides an evaluation framework applicable to other high-risk domains.

Task-parallelism in SWIFT for heterogeneous compute architectures

2026-01-21T13:03:59Z

This paper highlights first steps towards enabling graphics processing unit (GPU) acceleration of the task-parallel smoothed particle hydrodynamics (SPH) solver SWIFT. Novel combinations of algorithms are presented, enabling SWIFT to function as a truly heterogeneous software leveraging task-parallelism on CPUs for memory-bound computations concurrently with GPUs for compute-bound computations while minimising the effects of CPU-GPU communication latency. The proposed algorithms are validated in extensive testing. The GPU acceleration methodology is shown to deliver up to 3.5 and 7.5 speedups for the offloaded computations when including and excluding the time required to prepare and post-process data transfers on the CPU side, respectively. The overall performance of the GPU-accelerated hydrodynamic solver for a full simulation on a single Grace-Hopper superchip is 1.8 times faster compared to the superchips fully parallelised CPU capabilities. This constitutes an improvement from 8 million particle updates/s for the full CPU-only baseline (115,000 updates per CPU core) to 15 million updates/s for the GPU-accelerated SPH solver. Moreover, it displays near-perfect strong scaling on 4 Grace-Hopper nodes. The GPU-acceleration is also demonstrated to give a 29 percent improvement in energy efficiency in comparison to CPU-only baselines. Finally, inter-influential bottlenecks in the prototype solver presented in this work are identified: A significant amount of time (up to 80 percent) of a GPU-offloading cycle is spent on preparing and post-processing particle data on the CPU for the transfer to and from the GPU, respectively. Approaches are suggested to minimise their effects and maximise the solver's performance in our future work.

SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction

2026-01-21T11:47:56Z

The rapid expansion of Transformer-based large language models has dramatically increased the need for high-performance GPUs. As a result, there is growing demand for fast, accurate, and widely generalizable GPU performance models to support next-generation hardware selection and system-level exploration. However, current data-driven methods are limited, exhibiting poor generalization across hardware and inadequate modeling of complex production-level kernels common in modern inference stacks. To address these issues, we present SyncPerf, a unified GPU modeling framework. This approach first employs an analytical model to quantify a given kernel's demands on the GPU's heterogeneous instruction pipelines. These analytical features are then fed into a machine learning (ML) model to capture complex cross-pipeline interactions and resource dependencies, enabling high-fidelity performance prediction. Our evaluation across 11 GPU types from four generations of major architectures on two widely-used serving systems demonstrates that SyncPerf delivers high fidelity and strong generalizability. It achieves accurate predictions, with only 6.1% average error at the kernel level and 8.5% for end-to-end inference -- reducing the error of state-of-the-art methods by 6.7x and 4.4x, respectively. We also demonstrate SynPerf's value "beyond simulation" by utilizing its performance ceiling to diagnose implementation shortcomings and guide the optimization of a production fused MoE Triton kernel, achieving up to 1.7x speedup.

AFLL: Real-time Load Stabilization for MMO Game Servers Based on Circular Causality Learning

2026-01-21T03:41:03Z

Massively Multiplayer Online (MMO) game servers must handle thousands of simultaneous players while maintaining sub-100ms response times. When server load exceeds capacity, traditional approaches either uniformly throttle all message types regardless of importance (damaging gameplay) or apply fixed heuristic rules that fail to adapt to dynamic workloads. This paper presents AFLL (Adaptive Feedback Loop Learning), a real-time load stabilization system that learns the causal relationship between outgoing server messages and subsequent incoming client requests. AFLL employs backpropagation to continuously adjust message type weights, enabling predictive throttling that blocks low-priority messages before overload occurs while guaranteeing critical message delivery. Through controlled experiments with 1,000 concurrent players, AFLL reduced average CPU time by 48.3% (13.2ms to 6.8ms), peak CPU time by 51.7% (54.0ms to 26.1ms), and thread contention by 64.4% (19.6% to 7.0%), while maintaining zero learning overhead through background computation and caching optimizations. The system achieved remarkable reproducibility (CV < 2% across all metrics) and identified a three-stage causal chain linking message blocking to load reduction. AFLL demonstrates that circular causality learning enables practical real-time adaptation for latency-critical systems.