https://arxiv.org/api/lSD2JMxGmY1+IzyVD4EI+zwGYdU2026-03-26T11:05:47Z507715015http://arxiv.org/abs/2411.02933v2P-MOSS: Scheduling Main-Memory Indexes Over NUMA Servers Using Next Token Prediction2026-01-21T03:07:41ZEver since the Dennard scaling broke down in the early 2000s and the frequency of the CPUs stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA and Chiplet processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries execute, and the location where the data resides. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to specific logical cores, and co-locates data on the corresponding NUMA node. For cross-hardware and workload adaptability, P-MOSS leverages core principles from Large Language Models, such as Next Token prediction, Generative Pre-training, and Fine-tuning. In the spirit of hardware-software synergy, P-MOSS guides its scheduling decision solely based on the low-level hardware statistics collected from the hardware Performance Monitoring Unit with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B$^+$-Tree index. Performance results demonstrate that P-MOSS offers an improvement of up to $6\times$ over traditional schedules in terms of query throughput.2024-11-05T09:23:27ZAccepted to SIGMOD'26Yeasir RayhanWalid G. Aref10.1145/3786675http://arxiv.org/abs/2601.14612v1Exploiting Spot Instances for Time-Critical Cloud Workloads Using Optimal Randomized Strategies2026-01-21T03:01:00ZThis paper addresses the challenge of deadline-aware online scheduling for jobs in hybrid cloud environments, where jobs may run on either cost-effective but unreliable spot instances or more expensive on-demand instances, under hard deadlines. We first establish a fundamental limit for existing (predominantly-) deterministic policies, proving a worst-case competitive ratio of $Ω(K)$, where $K$ is the cost ratio between on-demand and spot instances. We then present a novel randomized scheduling algorithm, ROSS, that achieves a provably optimal competitive ratio of $\sqrt{K}$ under reasonable deadlines, significantly improving upon existing approaches. Extensive evaluations on real-world trace data from Azure and AWS demonstrate that ROSS effectively balances cost optimization and deadline guarantees, consistently outperforming the state-of-the-art by up to $30\%$ in cost savings, across diverse spot market conditions.2026-01-21T03:01:00ZAccepted for publication in the 45th IEEE International Conference on Computer Communications (INFOCOM 2026). Copyright 2026 IEEENeelkamal BhuyanRandeep BhatiaMurali KodialamTV Lakshmanhttp://arxiv.org/abs/2601.13345v1FlipFlop: A Static Analysis-based Energy Optimization Framework for GPU Kernels2026-01-19T19:30:25ZArtificial Intelligence (AI) applications, such as Large Language Models, are primarily driven and executed by Graphics Processing Units (GPUs). These GPU programs (kernels) consume substantial amounts of energy, yet software developers often lack the hardware expertise and ad hoc knowledge required to optimize for power efficiency. We propose FlipFlop, a framework using static code analysis to predict energy consumption and recommend Pareto-optimal thread block configurations considering both power consumption and execution time. Our framework requires no runtime execution and analyzes PTX code, a low-level instruction set for CUDA-enabled GPUs. It is validated across a diverse set of GPUs and kernels, including multi-head attention, convolution, and matrix multiplication. FlipFlop achieves 83% accuracy in identifying locally optimal energy-efficient configurations, while also minimizing developer effort by reducing the optimization search space by 93.4%. For multi-head attention kernels, it yields up to 79% energy savings and 106% throughput gains relative to NVIDIA's occupancy heuristic. By integrating static analysis with real-time monitoring and providing explainable optimization guidance, FlipFlop empowers developers to create sustainable, high-performance GPU software which minimizes environmental and computational costs.2026-01-19T19:30:25ZSaurabhsingh RajputAlexander BrandtVadim ElisseevTushar Sharmahttp://arxiv.org/abs/2601.13220v1The Energy-Throughput Trade-off in Lossless-Compressed Source Code Storage2026-01-19T16:50:21ZRetrieving data from large-scale source code archives is vital for AI training, neural-based software analysis, and information retrieval, to cite a few. This paper studies and experiments with the design of a compressed key-value store for the indexing of large-scale source code datasets, evaluating its trade-off among three primary computational resources: (compressed) space occupancy, time, and energy efficiency. Extensive experiments on a national high-performance computing infrastructure demonstrate that different compression configurations yield distinct trade-offs, with high compression ratios and order-of-magnitude gains in retrieval throughput and energy efficiency. We also study data parallelism and show that, while it significantly improves speed, scaling energy efficiency is more difficult, reflecting the known non-energy-proportionality of modern hardware and challenging the assumption of a direct time-energy correlation. This work streamlines automation in energy-aware configuration tuning and standardized green benchmarking deployable in CI/CD pipelines, thus empowering system architects with a spectrum of Pareto-optimal energy-compression-throughput trade-offs and actionable guidelines for building sustainable, efficient storage backends for massive open-source code archival.2026-01-19T16:50:21Z8 pages, 5 figures. Camera-ready version for Greenvolve 2026 co-located at IEEE SANER 2026Paolo FerraginaFrancesco Tosonihttp://arxiv.org/abs/2601.12266v1Opportunistic Scheduling for Optimal Spot Instance Savings in the Cloud2026-01-18T05:13:45ZWe study the problem of scheduling delay-sensitive jobs over spot and on-demand cloud instances to minimize average cost while meeting an average delay constraint. Jobs arrive as a general stochastic process, and incur different costs based on the instance type. This work provides the first analytical treatment of this problem using tools from queuing theory, stochastic processes, and optimization. We derive cost expressions for general policies, prove queue length one is optimal for low target delays, and characterize the optimal wait-time distribution. For high target delays, we identify a knapsack structure and design a scheduling policy that exploits it. An adaptive algorithm is proposed to fully utilize the allowed delay, and empirical results confirm its near-optimality.2026-01-18T05:13:45ZAccepted for publication in the 45th IEEE International Conference on Computer Communications (INFOCOM 2026). Copyright 2026 IEEENeelkamal BhuyanRandeep BhatiaMurali KodialamTV Lakshmanhttp://arxiv.org/abs/2601.14302v1DDSA: Dual-Domain Strategic Attack for Spatial-Temporal Efficiency in Adversarial Robustness Testing2026-01-18T03:14:22ZImage transmission and processing systems in resource-critical applications face significant challenges from adversarial perturbations that compromise mission-specific object classification. Current robustness testing methods require excessive computational resources through exhaustive frame-by-frame processing and full-image perturbations, proving impractical for large-scale deployments where massive image streams demand immediate processing. This paper presents DDSA (Dual-Domain Strategic Attack), a resource-efficient adversarial robustness testing framework that optimizes testing through temporal selectivity and spatial precision. We introduce a scenario-aware trigger function that identifies critical frames requiring robustness evaluation based on class priority and model uncertainty, and employ explainable AI techniques to locate influential pixel regions for targeted perturbation. Our dual-domain approach achieves substantial temporal-spatial resource conservation while maintaining attack effectiveness. The framework enables practical deployment of comprehensive adversarial robustness testing in resource-constrained real-time applications where computational efficiency directly impacts mission success.2026-01-18T03:14:22ZPreprint accepted by ICASSP 2026 with minor revisionsJinwei HuShiyuan MengYi DongXiaowei Huanghttp://arxiv.org/abs/2512.23236v3KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta2026-01-16T22:31:23ZMaking deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.2025-12-29T06:31:55ZGang LiaoHongsen QinYing WangAlicia GoldenMichael KuchnikYavuz YetimJia Jiunn AngChunli FuYihan HeSamuel HsiaZewei JiangDianshi LiUladzimir PashkevichVarna PuvvadaFeng ShiMatt SteinerRuichao XiaoNathan YanXiayu YuZhou FangRoman LevensteinKunming HoHaishan ZhuAlec HammondRichard LiAjit MathewsKaustubh GondkarAbdul Zainul-AbedinKetan SinghHongtao YuWenyuan ChiBarney HuangSean ZhangNoah WellerZach MarineWyatt CookCarole-Jean WuGaoxiang Liuhttp://arxiv.org/abs/2601.11352v1Offline Reinforcement-Learning-Based Power Control for Application-Agnostic Energy Efficiency2026-01-16T15:00:17ZEnergy efficiency has become an integral aspect of modern computing infrastructure design, impacting the performance, cost, scalability, and durability of production systems. The incorporation of power actuation and sensing capabilities in CPU designs is indicative of this, enabling the deployment of system software that can actively monitor and adjust energy consumption and performance at runtime. While reinforcement learning (RL) would seem ideal for the design of such energy efficiency control systems, online training presents challenges ranging from the lack of proper models for setting up an adequate simulated environment, to perturbation (noise) and reliability issues, if training is deployed on a live system.
In this paper we discuss the use of offline reinforcement learning as an alternative approach for the design of an autonomous CPU power controller, with the goal of improving the energy efficiency of parallel applications at runtime without unduly impacting their performance. Offline RL sidesteps the issues incurred by online RL training by leveraging a dataset of state transitions collected from arbitrary policies prior to training.
Our methodology applies offline RL to a gray-box approach to energy efficiency, combining online application-agnostic performance data (e.g., heartbeats) and hardware performance counters to ensure that the scientific objectives are met with limited performance degradation. Evaluating our method on a variety of compute-bound and memory-bound benchmarks and controlling power on a live system through Intel's Running Average Power Limit, we demonstrate that such an offline-trained agent can substantially reduce energy consumption at a tolerable performance degradation cost.2026-01-16T15:00:17Z11 pages, 5 figures, 3 tables and unpublishedAkhilesh RajSwann PerarnauAniruddha GokhaleSolomon Bekele Aberahttp://arxiv.org/abs/2601.10874v1Balanced allocation: considerations from large scale service environments2026-01-15T22:00:57ZWe study d-way balanced allocation, which assigns each incoming job to the lightest loaded among d randomly chosen servers. While prior work has extensively studied the performance of the basic scheme, there has been less published work on adapting this technique to many aspects of large-scale systems. Based on our experience in building and running planet-scale cloud applications, we extend the understanding of d-way balanced allocation along the following dimensions:
(i) Bursts: Events such as breaking news can produce bursts of requests that may temporarily exceed the servicing capacity of the system. Thus, we explore what happens during a burst and how long it takes for the system to recover from such bursts. (ii) Priorities: Production systems need to handle jobs with a mix of priorities (e.g., user facing requests may be high priority while other requests may be low priority). We extend d-way balanced allocation to handle multiple priorities. (iii) Noise: Production systems are often typically distributed and thus d-way balanced allocation must work with stale or incorrect information. Thus we explore the impact of noisy information and their interactions with bursts and priorities.
We explore the above using both extensive simulations and analytical arguments. Specifically we show, (i) using simulations, that d-way balanced allocation quickly recovers from bursts and can gracefully handle priorities and noise; and (ii) that analysis of the underlying generative models complements our simulations and provides insight into our simulation results.2026-01-15T22:00:57ZAmer DiwanPrabhakar RaghavanEli Upfalhttp://arxiv.org/abs/2601.10572v1Long-term Monitoring of Kernel and Hardware Events to Understand Latency Variance2026-01-15T16:39:02ZThis paper presents our experience to understand latency variance caused by kernel and hardware events, which are often invisible at the application level. For this purpose, we have built VarMRI, a tool chain to monitor and analyze those events in the long term. To mitigate the "big data" problem caused by long-term monitoring, VarMRI selectively records a subset of events following two principles: it only records events that are affecting the requests recorded by the application; it records coarse-grained information first and records additional information only when necessary. Furthermore, VarMRI introduces an analysis method that is efficient on large amount of data, robust on different data set and against missing data, and informative to the user.
VarMRI has helped us to carry out a 3,000-hour study of six applications and benchmarks on CloudLab. It reveals a wide variety of events causing latency variance, including interrupt preemption, Java GC, pipeline stall, NUMA balancing etc.; simple optimization or tuning can reduce tail latencies by up to 31%. Furthermore, the impacts of some of these events vary significantly across different experiments, which confirms the necessity of long-term monitoring.2026-01-15T16:39:02ZFang ZhouYuyang HuangMiao YuSixiang MaTongping LiuYang Wanghttp://arxiv.org/abs/2601.10041v1Emergency Department Patient Flow Optimization with an Alternative Care Threshold Policy2026-01-15T03:38:24ZEmergency department (ED) overcrowding and patient boarding represent critical systemic challenges that compromise care quality. We propose a threshold-based admission policy that redirects non-urgent patients to alternative care pathways, such as telemedicine, during peak congestion. The ED is modeled as a two-class $M/M/c$ preemptive-priority queuing system, where high-acuity patients are prioritized and low-acuity patients are subject to state-dependent redirection. Analyzed via a level-dependent Quasi-Birth-Death (QBD) process, the model determines the optimal threshold by maximizing a long-run time-averaged objective function comprising redirection-affected revenue and costs associated with patient balking and system occupancy. Numerical analysis using national healthcare data reveals that optimal policies are highly context-dependent. While rural EDs generally optimize at lower redirection thresholds, urban EDs exhibit performance peaks at moderate thresholds. Results indicate that our optimal policy yields significant performance gains of up to $4.84\%$ in rural settings and $5.90\%$ in urban environments. This research provides a mathematically rigorous framework for balancing clinical priority with operational efficiency across diverse ED settings.2026-01-15T03:38:24Z37 pages, 14 figuresSahba BaniasadiPaul M. GriffinPrakash Chakrabortyhttp://arxiv.org/abs/2505.11594v3SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training2026-01-14T21:51:33ZThe efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.2025-05-16T18:01:54ZJintao ZhangJia WeiPengle ZhangXiaoming XuHaofeng HuangHaoxu WangKai JiangJianfei ChenJun Zhuhttp://arxiv.org/abs/2601.09527v1Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs2026-01-14T14:49:07ZSMEs increasingly seek alternatives to cloud LLM APIs, which raise data privacy concerns. Dedicated cloud GPU instances offer improved privacy but with limited guarantees and ongoing costs, while professional on-premise hardware (A100, H100) remains prohibitively expensive. We present a systematic evaluation of NVIDIA's Blackwell consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090) for production LLM inference, benchmarking four open-weight models (Qwen3-8B, Gemma3-12B, Gemma3-27B, GPT-OSS-20B) across 79 configurations spanning quantization formats (BF16, W4A16, NVFP4, MXFP4), context lengths (8k-64k), and three workloads: RAG, multi-LoRA agentic serving, and high-concurrency APIs. The RTX 5090 delivers 3.5-4.6x higher throughput than the 5060 Ti with 21x lower latency for RAG, but budget GPUs achieve the highest throughput-per-dollar for API workloads with sub-second latency. NVFP4 quantization provides 1.6x throughput over BF16 with 41% energy reduction and only 2-4% quality loss. Self-hosted inference costs $0.001-0.04 per million tokens (electricity only), which is 40-200x cheaper than budget-tier cloud APIs, with hardware breaking even in under four months at moderate volume (30M tokens/day). Our results show that consumer GPUs can reliably replace cloud inference for most SME workloads, except latency-critical long-context RAG, where high-end GPUs remain essential. We provide deployment guidance and release all benchmark data for reproducible SME-scale deployments.2026-01-14T14:49:07Z15 pages, 18 tables, 7 figures. Includes link to GitHub repository and Docker image for reproducibilityJonathan KnoopHendrik Holtmannhttp://arxiv.org/abs/2205.12139v2Extending the Network Calculus Algorithmic Toolbox for Ultimately Pseudo-Periodic Functions: Pseudo-Inverse and Composition2026-01-14T12:05:41ZNetwork Calculus (NC) is an algebraic theory that represents traffic and service guarantees as curves in a Cartesian plane, in order to compute performance guarantees for flows traversing a network. NC uses transformation operations, e.g., min-plus convolution of two curves, to model how the traffic profile changes with the traversal of network nodes. Such operations, while mathematically well-defined, can quickly become unmanageable to compute using simple pen and paper for any non-trivial case, hence the need for algorithmic descriptions. Previous work identified the class of piecewise affine functions which are ultimately pseudo-periodic (UPP) as being closed under the main NC operations and able to be described finitely. Algorithms that embody NC operations taking as operands UPP curves have been defined and proved correct, thus enabling software implementations of these operations. However, recent advancements in NC make use of operations, namely the lower pseudo-inverse, upper pseudo-inverse, and composition, that are well defined from an algebraic standpoint, but whose algorithmic aspects have not been addressed yet. In this paper, we introduce algorithms for the above operations when operands are UPP curves, thus extending the available algorithmic toolbox for NC. We discuss the algorithmic properties of these operations, providing formal proofs of correctness.2022-05-24T15:14:57ZPreprint submitted to Journal of Discrete Event Dynamic Systems; Erratum attached as ancillary fileDiscrete Event Dynamic Systems, Volume 33, pages 181-219, 2023Raffaele ZippoPaul NikolausGiovanni Stea10.1007/s10626-022-00373-5http://arxiv.org/abs/2601.09393v1AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems2026-01-14T11:32:07ZThe transition from Cloud-Native to AI-Native architectures is fundamentally reshaping software engineering, replacing deterministic microservices with probabilistic agentic services. However, this shift renders traditional black-box evaluation paradigms insufficient: existing benchmarks measure raw model capabilities while remaining blind to system-level execution dynamics. To bridge this gap, we introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards. By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities. Leveraging this benchmark across 21 system variants, we uncover critical engineering realities invisible to traditional metrics: a parameter paradox where lightweight models often surpass flagships in protocol adherence, a pervasive inference dominance that renders protocol overhead secondary, and an expensive failure pattern where self-healing mechanisms paradoxically act as cost multipliers on unviable workflows. This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems. To facilitate reproducibility and further research, we have open-sourced the benchmark and dataset.2026-01-14T11:32:07ZZirui WangGuangba YuMichael R. Lyu