https://arxiv.org/api/Yrn9zsTKmacZH5daHamO+1VVqlE2026-03-26T14:12:04Z507718015http://arxiv.org/abs/2601.04719v1GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models2026-01-08T08:35:56ZThe key-value (KV) cache in large language models presents a significant memory bottleneck during inference, growing linearly with sequence length and often exceeding the memory footprint of model weights themselves. We implement and evaluate GPU-accelerated INT8 quantization for KV cache compression, achieving 4$\times$ memory reduction with minimal accuracy degradation. We develop four CUDA kernel variants -- naive, tiled, coarsened, and vectorized -- and benchmark them across realistic workload sizes up to 1 billion elements. Our vectorized kernel achieves up to 1,694$\times$ speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. These results demonstrate that INT8 quantization provides a practical approach for reducing memory pressure in LLM inference with negligible computational overhead (6--58ms) and minimal impact on downstream model behavior2026-01-08T08:35:56ZMaanas TanejaPurab Shingvihttp://arxiv.org/abs/2601.04707v1MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training2026-01-08T08:19:47ZGraph Neural Networks (GNNs) are powerful tools for learning graph-structured data, but their scalability is hindered by inefficient mini-batch generation, data transfer bottlenecks, and costly inter-GPU synchronization. Existing training frameworks fail to overlap these stages, leading to suboptimal resource utilization. This paper proposes MQ-GNN, a multi-queue pipelined framework that maximizes training efficiency by interleaving GNN training stages and optimizing resource utilization. MQ-GNN introduces Ready-to-Update Asynchronous Consistent Model (RaCoM), which enables asynchronous gradient sharing and model updates while ensuring global consistency through adaptive periodic synchronization. Additionally, it employs global neighbor sampling with caching to reduce data transfer overhead and an adaptive queue-sizing strategy to balance computation and memory efficiency. Experiments on four large-scale datasets and ten baseline models demonstrate that MQ-GNN achieves up to \boldmath $\bm{4.6\,\times}$ faster training time and 30% improved GPU utilization while maintaining competitive accuracy. These results establish MQ-GNN as a scalable and efficient solution for multi-GPU GNN training.2026-01-08T08:19:47ZIEEE Access 2025Irfan UllahYoung-Koo Lee10.1109/ACCESS.2025.3539976http://arxiv.org/abs/2601.04545v1Personalized Model-Based Design of Human Centric AI enabled CPS for Long term usage2026-01-08T03:17:59ZHuman centric critical systems are increasingly involving artificial intelligence to enable knowledge extraction from sensor collected data. Examples include medical monitoring and control systems, gesture based human computer interaction systems, and autonomous cars. Such systems are intended to operate for a long term potentially for a lifetime in many scenarios such as closed loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoting systems for stroke diagnosis, and rehabilitation. Long term operation of such AI enabled human centric applications can expose them to corner cases for which their operation is may be uncertain. This can be due to many reasons such as inherent flaws in the design, limited resources for testing, inherent computational limitations of the testing methodology, or unknown use cases resulting from human interaction with the system. Such untested corner cases or cases for which the system performance is uncertain can lead to violations in the safety, sustainability, and security requirements of the system. In this paper, we analyze the existing techniques for safety, sustainability, and security analysis of an AI enabled human centric control system and discuss their limitations for testing the system for long term use in practice. We then propose personalized model based solutions for potentially eliminating such limitations.2026-01-08T03:17:59ZBernard NgabonzizaAyan BanerjeeSandeep K. S. Guptahttp://arxiv.org/abs/2601.03159v1Rapid Augmentations for Time Series (RATS): A High-Performance Library for Time Series Augmentation2026-01-06T16:33:51ZTime series augmentation is critical for training robust deep learning models, particularly in domains where labelled data is scarce and expensive to obtain. However, existing augmentation libraries for time series, mainly written in Python, suffer from performance bottlenecks, where running time grows exponentially as dataset sizes increase -- an aspect limiting their applicability in large-scale, production-grade systems. We introduce RATS (Rapid Augmentations for Time Series), a high-performance library for time series augmentation written in Rust with Python bindings (RATSpy). RATS implements multiple augmentation methods spanning basic transformations, frequency-domain operations and time warping techniques, all accessible through a unified pipeline interface with built-in parallelisation. Comprehensive benchmarking of RATSpy versus a commonly used library (tasug) on 143 datasets demonstrates that RATSpy achieves an average speedup of 74.5\% over tsaug (up to 94.8\% on large datasets), with up to 47.9\% less peak memory usage.2026-01-06T16:33:51ZWadie SkafFelix KernAryamaan Basu RoyTejas PradhanRoman KalkreuthHolger Hooshttp://arxiv.org/abs/2601.02735v1Scalable Tree Ensemble Proximities in Python2026-01-06T05:57:43ZTree ensemble methods such as Random Forests naturally induce supervised similarity measures through their decision tree structure, but existing implementations of proximities derived from tree ensembles typically suffer from quadratic time or memory complexity, limiting their scalability. In this work, we introduce a general framework for efficient proximity computation by defining a family of Separable Weighted Leaf-Collision Proximities. We show that any proximity measure in this family admits an exact sparse matrix factorization, restricting computation to leaf-level collisions and avoiding explicit pairwise comparisons. This formulation enables low-memory, scalable proximity computation using sparse linear algebra in Python. Empirical benchmarks demonstrate substantial runtime and memory improvements over traditional approaches, allowing tree ensemble proximities to scale efficiently to datasets with hundreds of thousands of samples on standard CPU hardware.2026-01-06T05:57:43ZAdrien AumonGuy WolfKevin R. MoonJake S. Rhodeshttp://arxiv.org/abs/2503.18773v3BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache2026-01-05T18:08:27ZThe growth of long-context Large Language Models (LLMs) significantly increases memory and bandwidth pressure during autoregressive decoding due to the expanding Key-Value (KV) cache. While accuracy-preserving KV-cache quantization (e.g., 4-bit or 2-bit) reduces memory footprint, existing systems decode inefficiently by relying solely on CUDA cores, underutilizing Tensor Cores-the dominant compute resource on GPUs.
We present BitDecoding, the first inference system to efficiently decode low-bit KV caches by cooperatively leveraging CUDA cores and Tensor Cores. BitDecoding smartly induces Tensor-Core-friendly layouts, introduces warp-level dequantization parallelism, and provides unified system support through query transformation, high-performance tensor- and channel-wise quantization, and a software-pipelined dequantization kernel enabling mixed-precision execution. Architecture-aware optimizations further leverage Hopper's warpgroup tensor instructions and Blackwell's NVFP4 (MXFP4) tensor formats.
Evaluated on Blackwell, Hopper, and Ampere GPUs, BitDecoding achieves an average 7.5x decoding speedup over FP16 FlashDecoding-v2, up to 8.6x on Blackwell with NVFP4, and up to 4.3x over state-of-the-art approaches. On LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding latency by 3x. BitDecoding is open-sourced at https://github.com/OpenBitSys/BitDecoding.2025-03-24T15:22:41ZDayou DuShijie CaoJianyi ChengLuo MaiTing CaoMao Yanghttp://arxiv.org/abs/2601.01288v1PyBatchRender: A Python Library for Batched 3D Rendering at Up to One Million FPS2026-01-03T21:19:57ZReinforcement learning from pixels is often bottlenecked by the performance and complexity of 3D rendered environments. Researchers face a trade-off between high-speed, low-level engines and slower, more accessible Python frameworks. To address this, we introduce PyBatchRender, a Python library for high-throughput, batched 3D rendering that achieves over 1 million FPS on simple scenes. Built on the Panda3D game engine, it utilizes its mature ecosystem while enhancing performance through optimized batched rendering for up to 1000X speedups. Designed as a physics-agnostic renderer for reinforcement learning from pixels, PyBatchRender offers greater flexibility than dedicated libraries, simpler setup than typical game-engine wrappers, and speeds rivaling state-of-the-art C++ engines like Madrona. Users can create custom scenes entirely in Python with tens of lines of code, enabling rapid prototyping for scalable AI training. Open-source and easy to integrate, it serves to democratize high-performance 3D simulation for researchers and developers. The library is available at https://github.com/dolphin-in-a-coma/PyBatchRender.2026-01-03T21:19:57ZEvgenii RudakovJonathan ShockBenjamin Ultan Cowleyhttp://arxiv.org/abs/2601.00974v1Improving the Graph Challenge Reference Implementation2026-01-02T19:51:12ZThe MIT/IEEE/Amazon Graph Challenge provides a venue for individuals and teams to showcase new innovations in large-scale graph and sparse data analysis. The Anonymized Network Sensing Graph Challenge processes over 100 billion network packets to construct privacy-preserving traffic matrices, with a GraphBLAS reference implementation demonstrating how hypersparse matrices can be applied to this problem. This work presents a refactoring and benchmarking of a section of the reference code to improve clarity, adaptability, and performance. The original Python implementation spanning approximately 1000 lines across 3 files has been streamlined to 325 lines across two focused modules, achieving a 67% reduction in code size while maintaining full functionality. Using pMatlab and pPython distributed array programming libraries, the addition of parallel maps allowed for parallel benchmarking of the data. Scalable performance is demonstrated for large-scale summation and analysis of traffic matrices. The resulting implementation increases the potential impact of the Graph Challenge by providing a clear and efficient foundation for participants.2026-01-02T19:51:12ZPresented at IEEE MIT URTC 2025Inna VoloshchukHayden JananthanChansup ByunJeremy Kepnerhttp://arxiv.org/abs/2512.24530v2A Magnified View into Heterogeneous-ISA Thread Migration Performance without State Transformation2026-01-02T17:24:46ZHeterogeneous-ISA processor designs have attracted considerable research interest. However, unlike their homogeneous-ISA counterparts, explicit software support for bridging ISA heterogeneity is required. The lack of a compilation toolchain ready to support heterogeneous-ISA targets has been a major factor hindering research in this exciting emerging area. For any such compiler, "getting right" the mechanics involved in state transformation upon migration and doing this efficiently is of critical importance. In particular, any runtime conversion of the current program stack from one architecture to another would be prohibitively expensive. In this paper, we design and develop Unifico, a new multi-ISA compiler that generates binaries that maintain the same stack layout during their execution on either architecture. Unifico avoids the need for runtime stack transformation, thus eliminating overheads associated with ISA migration. Additional responsibilities of the Unifico compiler backend include maintenance of a uniform ABI and virtual address space across ISAs. Unifico is implemented using the LLVM compiler infrastructure, and we are currently targeting the x86-64 and ARMv8 ISAs. We have evaluated Unifico across a range of compute-intensive NAS benchmarks and show its minimal impact on overall execution time, where less than 6% (10%) overhead is introduced on average for high-end (low-end) processors. We also analyze the performance impact of Unifico's key design features and demonstrate that they can be further optimized to mitigate this impact. When compared against the state-of-the-art Popcorn compiler, Unifico reduces binary size overhead from ~200% to ~10%, whilst eliminating the stack transformation overhead during ISA migration.2025-12-31T00:24:44ZEdit: Removed bogus journal footnoteNikolaos MavrogeorgisUniversity of Edinburgh, United KingdomChristos VasiladiotisUniversity of Edinburgh, United KingdomPei MuUniversity of Edinburgh, United KingdomAmir KhordadiUniversity of Edinburgh, United KingdomBjörn FrankeUniversity of Edinburgh, United KingdomAntonio BarbalaceUniversity of Edinburgh, United Kingdomhttp://arxiv.org/abs/2601.07841v1Post-Quantum Cryptography Key Expansion Method and Anonymous Certificate Scheme Based on NTRU2026-01-02T00:18:54ZNTRU is one of the important lattice-based post-quantum cryptography methods, offering resistance against quantum computing attacks. However, a drawback of NTRU lies in its relatively low efficiency in generating key pairs. Therefore, this study proposes an NTRU-based key expansion method that enables efficient public key expansion. Furthermore, the proposed method is applied to an anonymous certificate scheme, allowing an end entity to generate a key pair only once, after which the certificate authority can expand multiple distinct public keys for anonymity. The experimental results demonstrate that the proposed key expansion method achieves significantly higher efficiency than key pair generation.2026-01-02T00:18:54Zin Chinese languageAbel C. H. Chenhttp://arxiv.org/abs/2601.11584v1Cost-Aware Logging: Measuring the Financial Impact of Excessive Log Retention in Small-Scale Cloud Deployments2026-01-01T17:33:52ZLog data plays a critical role in observability, debugging, and performance monitoring in modern cloud-native systems. In small and early-stage cloud deployments, however, log retention policies are frequently configured far beyond operational requirements, often defaulting to 90 days or more, without explicit consideration of their financial and performance implications. As a result, excessive log retention becomes a hidden and recurring cost.
This study examines the financial and operational impact of log retention window selection from a cost-aware perspective. Using synthetic log datasets designed to reflect real-world variability in log volume and access patterns, we evaluate retention windows of 7, 14, 30, and 90 days. The analysis focuses on three metrics: storage cost, operationally useful log ratio, and cost per useful log. Operational usefulness is defined as log data accessed during simulated debugging and incident analysis tasks.
The results show that reducing log retention from 90 days to 14 days can lower log storage costs by up to 78 percent while preserving more than 97 percent of operationally useful logs. Longer retention windows provide diminishing operational returns while disproportionately increasing storage cost and query overhead. These findings suggest that modest configuration changes can yield significant cost savings without compromising system reliability.
Rather than proposing new logging mechanisms, this work offers a lightweight and accessible framework to help small engineering teams reason about log retention policies through a cost-effectiveness lens. The study aims to encourage more deliberate observability configurations, particularly in resource-constrained cloud environments.2026-01-01T17:33:52Z8 pages, 3 tables. Simulation-based study on cost-aware log retentionJody Almaida Putrahttp://arxiv.org/abs/2512.13886v2OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction2025-12-31T02:49:31ZPost-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.2025-12-15T20:41:29ZMohammad MozaffariSamuel KushnirMaryam Mehri DehnaviAmir Yazdanbakhshhttp://arxiv.org/abs/2410.08618v3SwitchFS: Asynchronous Metadata Updates for Distributed Filesystems with In-Network Coordination2025-12-30T14:15:18ZDistributed filesystem metadata updates are typically synchronous. This creates inherent challenges for access efficiency, load balancing, and directory contention, especially under dynamic and skewed workloads. This paper argues that synchronous updates are overly conservative. We propose SwitchFS with asynchronous metadata updates that allow operations to return early and defer directory updates until reads, both hiding latency and amortizing overhead. The key challenge lies in efficiently maintaining the synchronous POSIX semantics of metadata updates. To address this, SwitchFS is co-designed with a programmable switch, leveraging the limited on-switch resources to track directory states with negligible overhead. This allows SwitchFS to aggregate and apply delayed updates efficiently, using batching and consolidation before directory reads. Evaluation shows that SwitchFS achieves up to 13.34$\times$ and 3.85$\times$ higher throughput, and 61.6% and 57.3% lower latency than two state-of-the-art distributed filesystems, Emulated-InfiniFS and Emulated-CFS, respectively, under skewed workloads. For real-world workloads, SwitchFS improves end-to-end throughput by 21.1$\times$, 1.1$\times$, and 0.3$\times$ over CephFS, Emulated-InfiniFS, and Emulated-CFS, respectively.2024-10-11T08:33:58ZAccepted by EuroSys'26Jingwei XuMingkai DongQiulin TianZiyi TianTong XinHaibo Chenhttp://arxiv.org/abs/2512.23494v1Optimal Configuration of API Resources in Cloud Native Computing2025-12-29T14:34:33ZThis paper presents how an existing framework for offline performance optimization can be applied to microservice applications during the Release phase of the DevOps life cycle. Optimization of resource allocation configuration parameters for CPU and memory during the Release phase remains a largely unexplored problem as most research has focused on intelligent scheduling and autoscaling of microservices during the Ops stage of the DevOps cycle. Yet horizontal auto-scaling of containers, based on CPU usage for instance, may still leave these containers with an inappropriately allocated amount of memory, if no upfront fine-tuning of both resources is applied before the Deployment phase. We evaluate the performance optimization framework using the TeaStore microservice application and statistically compare different optimization algorithms, supporting informed decisions about their trade-offs between sampling cost and distance to the optimal resource configuration. This shows that upfront factor screening, for reducing the search space, is helpful when the goal is to find the optimal resource configuration with an affordable sampling budget. When the goal is to statistically compare different algorithms, screening must also be applied to make data collection of all data points in the search space feasible. If the goal is to find a near-optimal configuration, however, it is better to run bayesian optimization without screening. 2025-12-29T14:34:33ZIn Proceedings WACA 2025, arXiv:2512.22054EPTCS 438, 2025, pp. 15-39Eddy TruyenDistriNet, KU Leuven, 3001 Leuven, BelgiumWouter JoosenDistriNet, KU Leuven, 3001 Leuven, Belgium10.4204/EPTCS.438.2http://arxiv.org/abs/2512.23434v1Local Rendezvous Hashing: Bounded Loads and Minimal Churn via Cache-Local Candidates2025-12-29T12:52:57ZConsistent hashing is fundamental to distributed systems, but ring-based schemes can exhibit high peak-to-average load ratios unless they use many virtual nodes, while multi-probe methods improve balance at the cost of scattered memory accesses. This paper introduces Local Rendezvous Hashing (LRH), which preserves a token ring but restricts Highest Random Weight (HRW) selection to a cache-local window of C distinct neighboring physical nodes. LRH locates a key by one binary search, enumerates exactly C distinct candidates using precomputed next-distinct offsets, and chooses the HRW winner (optionally weighted). Lookup cost is O(log|R| + C). Under fixed-topology liveness changes, fixed-candidate filtering remaps only keys whose original winner is down, yielding zero excess churn. In a benchmark with N=5000, V=256 (|R|=1.28M), K=50M and C=8, LRH reduces Max/Avg load from 1.2785 to 1.0947 and achieves 60.05 Mkeys/s, about 6.8x faster than multi-probe consistent hashing with 8 probes (8.80 Mkeys/s) while approaching its balance (Max/Avg 1.0697). A microbenchmark indicates multi-probe assignment is dominated by repeated ring searches and memory traffic rather than probe-generation arithmetic.2025-12-29T12:52:57Z14 pages, 10 figures. Includes appendicesYongjie GuanZhejiang University of Technology