https://arxiv.org/api/sWqz9yVRFMh13zKzpxeGD2thxKQ 2026-06-10T16:38:19Z 28838 285 15 http://arxiv.org/abs/2605.26671v1 RT-RkNN: Reverse k Nearest Neighbor Queries as a Graphics Ray Casting Problem 2026-05-26T08:08:51Z

Reverse k nearest neighbor (RkNN) queries are fundamental in spatial databases, location-based analytics, and recommendation systems. Existing state-of-the-art techniques rely on spatial pruning supported by R-trees and their variants. However, their pruning effectiveness degrades significantly in challenging scenarios where the number of facilities is small, the user population is dense, or the value of k is large. To overcome these limitations, this work reformulates the RkNN query problem in two-dimensional geometric spaces as a graphics ray-casting problem, where users are modeled as rays and facilities are represented as geometric primitives. Based on this formulation, the first algorithm and implementation exploiting dedicated hardware ray-tracing cores on modern GPUs are developed. This novel approach preserves strong filtering performance even for large values of k, dense user populations, and highly sparse facility distributions. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art algorithms across diverse settings, particularly in scenarios where traditional pruning strategies become inefficient.

2026-05-26T08:08:51Z 12 pages except reference Zhengyang Bai Peng Chen Mohamed Wahib http://arxiv.org/abs/2503.20507v4 Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning 2026-05-26T07:52:32Z

Modern high-performance computing (HPC) environments rely on hybrid storage systems (HSS) that combine multiple storage devices with diverse latency, bandwidth, endurance, and capacity characteristics to meet the performance, capacity, and cost requirements of data-intensive applications. The performance of an HSS highly depends on two key data-management policies: (1) data placement, which determines the most suitable storage device to store application data, and (2) data migration, which dynamically reorganizes previously-stored data across storage devices (i.e., prefetching hot data and evicting cold data) to sustain high HSS performance. These policies are tightly interdependent, and thus, improving one without considering the other leads to suboptimal HSS performance. Unfortunately, prior works focus on optimizing only one of the policies. Our goal is to design a holistic data-management technique that optimizes both data-placement and data-migration policies to fully exploit the potential of an HSS. To this end, we propose Harmonia, a multi-agent reinforcement learning (RL)-based data-management technique. Harmonia employs two lightweight autonomous RL agents, a data-placement agent and a data-migration agent, that adapt their policies for the current workload and HSS configuration while coordinating with each other. We evaluate Harmonia on real HSS configurations with up to four heterogeneous storage devices and 25 data-intensive workloads. On a performance- (cost-) optimized HSS with two heterogeneous storage devices, Harmonia outperforms the best-performing prior approach by 29.3% (44.8%) on average. On an HSS with three (four) devices, Harmonia outperforms the best-performing prior work by 38.9% (39.2%) on average. Harmonia's performance benefits come with low latency (240 ns for inference) and storage (206 KiB in DRAM for both RL agents combined) overheads.

2025-03-26T12:47:52Z Rakesh Nadig Vamanan Arulchelvan Rahul Bera Taha Shahroodi Gagandeep Singh Andreas Kakolyris Ismail Emir Yuksel Mohammad Sadrosadati Jisung Park Onur Mutlu http://arxiv.org/abs/2602.17335v3 Do GPUs Really Need New Tabular File Formats? 2026-05-26T06:48:08Z

Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet files generated with CPU-oriented defaults can severely underutilize GPU parallelism, turning GPU scans into a performance bottleneck. In this work, we systematically study how Parquet configurations affect GPU scan performance. We show that Parquet's poor GPU performance is not inherent to the format itself but rather a consequence of suboptimal configuration choices. By applying GPU-aware configurations, we increase effective read bandwidth up to 125 GB/s without modifying the Parquet specification.

2026-02-19T13:07:38Z DaMoN Camera Ready Jigao Luo Qi Chen Carsten Binnig http://arxiv.org/abs/2605.26604v1 Credibility Trilemma in Polymatroidal Service Markets 2026-05-26T06:39:56Z

Mechanism-mediated service markets with polymatroidal feasibility admit efficient, dominant-strategy incentive-compatible (DSIC) allocation, but these guarantees implicitly assume truthful execution by the marketplace operator. Modelling the operator as a strategic player, we establish a credibility trilemma: for single-parameter agents on a non-modular polymatroid, no static sealed-bid mechanism is simultaneously revenue-optimal, DSIC for agents, and credible for the operator. We introduce the Cost of Non-Credibility (CoNC) as a price-of-anarchy-style welfare-loss measure and obtain tight $Θ$-bounds across five topology classes (single-edge, series, parallel, tree, series-parallel), plus a matching upper bound $O(|\mathcal{S}|)$ on general DAGs realised by an $Ω(|\mathcal{S}|)$ witness on the SP-augmented sub-family, turning the trilemma into a structural quantity. Three structurally distinct resolutions follow: public broadcast or deferred-revelation commitment, administrative domain separation under settlement separation and four side conditions, and integrator competition orthogonal to mechanism execution under disjoint actors. An instance-level grounding over the edge-pricing market of Amin et al. confirms the trilemma's robustness on a refereed external setting. The result establishes marketplace neutrality as a first-order design constraint on polymatroidal service markets rather than an implementation detail: where the operator is a strategic player, credibility trades off against revenue optimality and agent incentive compatibility along structurally characterised lines.

2026-05-26T06:39:56Z 75 pages, 3 figures. Prepared for submission to the ACM Transactions on Economics and Computation (TEAC) Lauri Lovén Sujit Gujar Kalle Timperi Hassan Mehmood Praveen Kumar Donta Sasu Tarkoma Schahram Dustdar http://arxiv.org/abs/2605.26599v1 Reducing Internal State in Eigenvalue-Only Divide-and-Conquer Tridiagonal Eigensolvers 2026-05-26T06:32:21Z

Divide and Conquer (D&C) is a widely used algorithmic strategy for symmetric eigenvalue decomposition. Its natural parallelism makes D&C attractive on modern multicore CPUs and GPUs, but existing eigenvalue-only routines often default to QR-based methods because conventional D&C still materializes or replays large transformation matrices during the conquer phase. This paper proposes a boundary-row D&C algorithm for eigenvalue-only computation. The key observation is that the conquer phase only needs selected boundary rows/columns rather than the full accumulated eigenvector matrix. By propagating these boundary rows directly through the recursion, the proposed algorithm reduces the memory requirement from quadratic to linear space while also eliminating unnecessary matrix-vector work in the conventional lazy-replay formulation. We provide the algorithm, its time and space complexity analysis, correctness and stability arguments, optimized CPU and GPU implementations, and an evaluation against QR and D&C routines in standard numerical libraries.

2026-05-26T06:32:21Z 13 pages, 2 figures, 4 tables Ruiyi Zhan Shaoshuai Zhang http://arxiv.org/abs/2605.24217v2 Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks 2026-05-26T05:47:59Z

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an $M/G/1$ queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

2026-05-22T20:57:26Z Ashok Chandrasekar Jason Kramberger http://arxiv.org/abs/2601.02092v2 SuperSFL: Resource-Heterogeneous Federated Split Learning with Weight-Sharing Super-Networks 2026-05-26T05:24:24Z

SplitFed Learning (SFL) combines federated learning and split learning to enable collaborative training across distributed edge devices; however, it faces significant challenges in heterogeneous environments with diverse computational and communication capabilities. This paper proposes \textit{SuperSFL}, a federated split learning framework that leverages a weight-sharing super-network to dynamically generate resource-aware client-specific subnetworks, effectively mitigating device heterogeneity. SuperSFL introduces Three-Phase Gradient Fusion (TPGF), an optimization mechanism that coordinates local updates, server-side computation, and gradient fusion to accelerate convergence. In addition, a fault-tolerant client-side classifier and collaborative client--server aggregation enable uninterrupted training under intermittent communication failures. Experimental results on CIFAR-10 and CIFAR-100 with up to 100 heterogeneous clients show that SuperSFL converges $2$--$5\times$ faster in terms of communication rounds than baseline SFL while achieving higher accuracy, resulting in up to $20\times$ lower total communication cost and $13\times$ shorter training time. SuperSFL also demonstrates improved energy efficiency compared to baseline methods, making it a practical solution for federated learning in heterogeneous edge environments.

2026-01-05T13:18:47Z Accepted in 32nd International European Conference on Parallel and Distributed Computing Abdullah Al Asif Sixing Yu Juan Pablo Munoz Arya Mazaheri Ali Jannesari http://arxiv.org/abs/2512.13268v2 SPARS: A Reinforcement Learning-Enabled Simulator for Power Management in HPC Job Scheduling 2026-05-26T05:17:33Z

High-performance computing (HPC) systems consume enormous amounts of energy, with idle nodes as a major source of energy waste. Powering down idle nodes can mitigate this problem, but long boot/shutdown delays can introduce significant queueing penalties if transitions are poorly timed. To address this trade-off, we present SPARS, a reinforcement learning-enabled simulator for power management in HPC job scheduling. SPARS integrates job scheduling and node power-state management within a discrete-event simulation framework. It supports traditional scheduling policies such as First Come First Serve and EASY Backfilling, along with enhanced variants that employ reinforcement learning agents to dynamically decide when nodes should be powered on or off. Users can configure workloads and platforms in JSON format, specifying job arrivals, execution times, node power models, and transition delays. The simulator records comprehensive metrics-including energy usage, wasted power, job waiting times, and node utilization-and provides Gantt chart visualizations to analyze scheduling dynamics and power transitions. Unlike widely used Batsim-based frameworks that rely on heavy inter-process communication, SPARS provides lightweight event handling and consistent simulation results, making experiments easier to reproduce and extend. Its modular design allows new scheduling heuristics or learning algorithms to be integrated with minimal effort. By providing a flexible, reproducible, and extensible platform, SPARS enables researchers and practitioners to systematically evaluate power-aware scheduling strategies, explore the trade-offs between energy efficiency and performance, and accelerate the development of sustainable HPC operations.

2025-12-15T12:28:08Z 10 pages, 5 figures, 4 tables SoftwareX 34 (2026) Muhammad Alfian Amrizal Raka Satya Prasasta Santana Yuda Pradata Kadek Gemilang Santiyuda Reza Pulungan Hiroyuki Takizawa 10.1016/j.softx.2026.102693 http://arxiv.org/abs/2605.26523v1 StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting 2026-05-26T04:11:38Z

Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade model fidelity, while offloading to the cloud incurs unacceptable latency and bandwidth costs. Existing solutions often resort to static model compression, which fails to adapt to the runtime volatility of edge environments. To bridge this gap, we present StreamSplit, a novel framework that makes streaming CL practical across heterogeneous ARM client platforms. StreamSplit resolves the conflict between the continuous nature of ambient audio and the discrete batch requirements of models like CLAP and COLA. We introduce: (1) A distribution-based streaming framework that decouples representation quality from local batch size, using a tractable Hybrid Loss to maintain fidelity despite sparse updates; and (2) An Uncertainty-Guided Adaptive Splitter that uses a lightweight Reinforcement Learning (RL) policy to dynamically partition computation. Uniquely, this policy integrates real-time resource monitoring with embedding ambiguity to optimize the accuracy-latency trade-off on the fly. We evaluate StreamSplit on diverse hardware, from the resource-constrained Raspberry Pi 4 to the high-performance Apple M2. Results demonstrate that StreamSplit reduces per-sample latency by up to 4.7x and cuts bandwidth by 77.1% and energy by 52.3% compared to server-centric baselines. Crucially, it maintains accuracy within 2.2% of server-centric models, proving that adaptive, distributed learning is a viable path for the modern edge ecosystem.

2026-05-26T04:11:38Z Accepted at ACM MobiSys 2026 Minh K. Quan Pubudu N. Pathirana http://arxiv.org/abs/2602.02192v5 ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning 2026-05-26T02:43:34Z

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of LLMs ranging from 4B to 32B parameters under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

2026-02-02T14:57:53Z 24 pages, 7 figures Jingwei Song Meng Chen Jie Xiao Qingnan Ren Jiaqi Huang Yangshen Deng Chris Tong Wanyi Chen Suli Wang Zhisheng Chen Ziqian Bi Shuo Lu Yiqun Duan Xu Wang Rymon Yu Lynn Ai Eric Yang Tianyu Shi http://arxiv.org/abs/2605.26461v1 Characterization-Guided GPU Fault Resilience in NVIDIA MPS 2026-05-26T02:18:53Z

NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault resilience: a fault in one process can terminate all co-running processes, limiting its adoption in resilience-critical settings such as multi-tenant GPU clusters. In this work, we design fault-resilient MPS to solve this problem. Our design is guided by insights from a systematic characterization of GPU faults and a deep analysis of their end-to-end processing pipeline. Based on these insights, we design two complementary mechanisms. A fault isolation mechanism for the dominant memory-related faults that can be fully isolated by software intervention in the open GPU driver kernel module. For other faults whose process is within proprietary software, we design a practical mechanism -- fast recovery using virtual memory based GPU-resident state sharing. Our evaluation on different GPUs and workloads shows that these mechanisms can handle corresponding faults effectively with minimal overhead.

2026-05-26T02:18:53Z 16 pages, 9 figures, 5 tables Rixin Liu Xingqi Cui Kaijian Wang Xinheng Ding Zirui Liu Yuke Wang Jiarong Xing http://arxiv.org/abs/2605.26418v1 When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control 2026-05-26T01:07:42Z

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.

2026-05-26T01:07:42Z Guilin Zhang Chuanyi Sun Kai Zhao Shahryar Sarkani John Fossaceca http://arxiv.org/abs/2403.04976v4 Advancing Environmental Sustainability in Data Centers via Carbon Depreciation Models 2026-05-26T00:40:35Z

Recent improvements in energy efficiency and renewable energy integration have increased the relative importance of embodied carbon in data centers, motivating improved provisioning strategies. Conventional approaches primarily minimize operational energy, but this perspective is increasingly insufficient for sustainability. In this paper, we propose carbon depreciation models to encourage longer hardware lifetimes. Carbon depreciation assigns a larger portion of embodied carbon to newly provisioned servers, discouraging unnecessary deployment of new hardware. As a result, new servers are provisioned mainly for jobs with strict quality-of-service (QoS) constraints, while older servers, whose embodied carbon has largely been recovered, are used for other workloads. We further argue that both embodied carbon and operational carbon from server idle time should be recovered during active jobs, encouraging provisioning strategies that maintain high utilization. We show that prior carbon accounting strategies can be counterproductive: under a greedy scheduler minimizing carbon under QoS constraints, jobs are priced as 25% cheaper on new hardware than on older hardware. In contrast, our approach uses a greedy scheduler that prioritizes older hardware through non-linear carbon depreciation, promoting sustainable provisioning. Experimental results show carbon reductions of 28-57%, depending on server lifetime assumptions.

2024-03-08T01:16:26Z 7 pages, 10 figures Shixin Ji Zhuoping Yang Xingzhen Chen Alex K. Jones Peipei Zhou http://arxiv.org/abs/2605.26404v1 Configuration-Driven Dynamic API Routing for Resilient Service Integrations 2026-05-26T00:26:49Z

Modern online services rely on third-party APIs for authentication, payments, communication, identity verification, fraud detection, observability, and fulfillment. These dependencies are outside the direct operational control of the application owner and may experience regional outages, throttling, latency spikes, quota exhaustion, or behavior changes that surface as user-visible failures. This paper presents configuration-driven dynamic API routing, an architecture for resilient third-party service integration based on pluggable factor lists, real-time telemetry, circuit breakers, bulkhead isolation, and a closed-loop decision engine. A factor list defines operation-specific hard gates and weighted scoring functions that evaluate candidate providers using live metrics, regional policy constraints, quota state, latency, cost, and incident signals. The router separates routing policy from application code, allowing operators to adapt vendor selection at runtime without redeploying applications. We formalize the factor-list model, describe a request-time routing algorithm, present the event pipeline that computes sliding-window provider health metrics, and analyze failover behavior under degraded-provider scenarios. We also describe an anonymized SMS verification case study in which manual vendor switching was replaced by automated routing driven by completion-rate telemetry.

2026-05-26T00:26:49Z 11 pages, 5 figures, 2 tables, anonymized production-inspired case study Nataraj Agaram Sundar Tejas Morabia http://arxiv.org/abs/2605.26384v1 GridPilot: Real-Time Grid-Responsive Control for AI Supercomputers 2026-05-25T23:14:52Z

At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar generation. For multi-megawatt AI/HPC facilities, the key unresolved question is practical and measurable: how quickly can the software stack translate a grid request into a real change in GPU power at the facility meter, where commitments are settled? We answer this on real hardware with GridPilot, a three-tier predictive controller operating across milliseconds, seconds, and hours, augmented by a deterministic safety-island bypass for fast response. On a three-GPU NVIDIA V100 testbed, GridPilot achieves a measured end-to-end trigger-to-target response of 97.2 ms, which is 6.9x faster than the 700 ms requirement of Nordic Fast Frequency Reserve. We further incorporate an instantaneous Power Usage Effectiveness (PUE) correction so dispatched commitments remain robust at meter level rather than only at IT load level. In replay experiments across six representative European grids (from Sweden to Poland), the PUE-aware controller closes 2.5-5.8 percentage points of cooling-overhead drag. GridPilot is released as open source and serves as a proof of concept that MW-scale AI/HPC demand can be engineered as controllable, grid-responsive flexibility by design.

2026-05-25T23:14:52Z Denisa-Andreea Constantinescu David Atienza