https://arxiv.org/api/cIW2re+coDFe4tDsHiqrMsIrxCg2026-03-20T12:38:39Z277244515http://arxiv.org/abs/2603.15400v1Multi-Objective Load Balancing for Heterogeneous Edge-Based Object Detection Systems2026-03-16T15:15:56ZThe rapid proliferation of the Internet of Things (IoT) and smart applications has led to a surge in data generated by distributed sensing devices. Edge computing is a mainstream approach to managing this data by pushing computation closer to the data source, typically onto resource-constrained devices such as single-board computers (SBCs). In such environments, the unavoidable heterogeneity of hardware and software makes effective load balancing particularly challenging. In this paper, we propose a multi-objective load balancing method tailored to heterogeneous, edge-based object detection systems. We study a setting in which multiple device-model pairs expose distinct accuracy, latency, and energy profiles, while both request intensity and scene complexity fluctuate over time. To handle this dynamically varying environment, our approach uses a two-stage decision mechanism: it first performs accuracy-aware filtering to identify suitable device-model candidates that provide accuracy within the acceptable range, and then applies a weighted-sum scoring function over expected latency and energy consumption to select the final execution target. We evaluate the proposed load balancer through extensive experiments on real-world datasets, comparing against widely used baseline strategies. The results indicate that the proposed multi-objective load balancing method halves energy consumption and achieves an 80% reduction in end-to-end latency, while incurring only a modest, up to 10%, decrease in detection accuracy relative to an accuracy-centric baseline.2026-03-16T15:15:56ZDaghash K. AlqahtaniMaria A. RodriguezMuhammad Aamir CheemaAdel N. Toosihttp://arxiv.org/abs/2603.15202v1LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling2026-03-16T12:43:32ZHigh-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV$ to accelerate the request execution and whether the workload is balanced across instances. Achieving both objectives is challenging because pursuing one objective may compromise the other. Current approaches adopt various combinators (e.g., linear combinations) to compute a scheduling score combining indicators for the two objectives, which are complex in that they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, and could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators-one for KV$-aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can simultaneously achieve both objectives well without any hyperparameter tuning. The key idea is that the multiplied score considers both objectives in a manner similar to a linear combination, with a nice property that the original hyperparameters are canceled out during comparison so we don't need tuning to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics, and our extensive experiments show that this simple approach can reduce TTFT by 92% and 52%, and TPOT by 21% and 20%, compared to vLLM-v1 and a production scheduler on real-world workloads covering chatbots, API calls, and coding agents. We also mathematically derive the conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand.2026-03-16T12:43:32ZDingyan ZhangJinbo HanKaixi ZhangXingda WeiSijie ShenChenguang FangWenyuan YuJingren ZhouRong Chenhttp://arxiv.org/abs/2603.15183v1Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems2026-03-16T12:20:06ZMulti-agent LLM orchestration incurs synchronization costs scaling as O(n x S x |D|) in agents, steps, and artifact size under naive broadcast -- a regime I term broadcast-induced triply-multiplicative overhead. I argue this pathology is a structural residue of full-state rebroadcast, not an inherent property of multi-agent coordination.
The central claim: synchronization cost explosion in LLM multi-agent systems maps with formal precision onto the cache coherence problem in shared-memory multiprocessors, and MESI-protocol invalidation transfers to artifact synchronization under minimal structural modification.
I construct the Artifact Coherence System (ACS) and prove the Token Coherence Theorem: lazy invalidation attenuates cost by at least S/(n + W(d_i)) when S > n + W(d_i), converting O(n x S x |D|) to O((n + W) x |D|). A TLA+-verified protocol enforces single-writer safety, monotonic versioning, and bounded staleness across ~2,400 explored states.
Simulation across four workload configurations yields token savings of 95.0% +/- 1.3% at V=0.05, 92.3% +/- 1.4% at V=0.10, 88.3% +/- 1.5% at V=0.25, and 84.2% +/- 1.3% at V=0.50 -- each exceeding the theorem's conservative lower bounds. Savings of ~81% persist at V=0.9, contrary to the predicted collapse threshold.
Contributions: (1) formal MESI-to-artifact state mapping; (2) Token Coherence Theorem as savings lower bound; (3) TLA+-verified protocol with three proven invariants; (4) characterization of conditional artifact access semantics resolving the always-read objection; (5) reference Python implementation integrating with LangGraph, CrewAI, and AutoGen via thin adapter layers.2026-03-16T12:20:06Z25 pages. Code and reproduction scripts at https://github.com/hipvlady/agent-coherenceVladyslav Parakhinhttp://arxiv.org/abs/2511.22333v3PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel2026-03-16T10:24:27ZLLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention.
This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 53.5% on average and TPOT by 17.0-93.1% under the same configurations against state-of-the-art attention kernels. PAT's source code is publicly available at https://github.com/flashserve/PAT.2025-11-27T11:10:30ZAccepted by ASPLOS'26, code available at https://github.com/flashserve/PATJinjun YiZhixin ZhaoYitao HuKe YanWeiwei SunHao WangLaiping ZhaoYuhao ZhangWenxin LiKeqiu Li10.1145/3779212.3790200http://arxiv.org/abs/2603.14841v1Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling2026-03-16T05:37:12ZRoad crashes remain a leading cause of preventable fatalities. Existing prediction models predominantly produce binary outcomes, which offer limited actionable insights for real-time driver feedback. These approaches often lack continuous risk quantification, interpretability, and explicit consideration of vulnerable road users (VRUs), such as pedestrians and cyclists. This research introduces SafeDriver-IQ, a framework that transforms binary crash classifiers into continuous 0-100 safety scores by combining national crash statistics with naturalistic driving data from autonomous vehicles. The framework fuses National Highway Traffic Safety Administration (NHTSA) crash records with Waymo Open Motion Dataset scenarios, engineers domain-informed features, and incorporates a calibration layer grounded in transportation safety literature. Evaluation across 15 complementary analyses indicates that the framework reliably differentiates high-risk from low-risk driving conditions with strong discriminative performance. Findings further reveal that 87% of crashes involve multiple co-occurring risk factors, with non-linear compounding effects that increase the risk to 4.5x baseline. SafeDriver-IQ delivers proactive, explainable safety intelligence relevant to advanced driver-assistance systems (ADAS), fleet management, and urban infrastructure planning. This framework shifts the focus from reactive crash counting to real-time risk prevention.2026-03-16T05:37:12Z10 pages, 13 figures, and 14 tables. Submitted in EIT 2026 Conference hosted by The University of Wisconsin-La Crosse and sponsored by IEEE Region 4 (R4)Joyjit RoySamaresh Kumar Singhhttp://arxiv.org/abs/2603.14826v1Protecting Distributed Blockchain with Twin-Field Quantum Key Distribution: A Quantum Resistant Approach2026-03-16T05:07:31ZQuantum computing provides the feasible multi-layered security challenges to classical blockchain systems. Whereas, quantum-secured blockchains relied on quantum key distribution (QKD) to establish secure channels can address this potential threat. This paper presents a scalable quantum-resistant blockchain architecture designed to address the connectivity and distance limitations of the QKD integrated quantum networks. By leveraging the twin-field (TF) QKD protocol within a measurement-device-independent (MDI) topology, the proposed framework can optimize the infrastructure complexity from quadratic to linear scaling. This architecture effectively integrates information-theoretic security with distributed consensus mechanisms, allowing the system to overcome the fundamental rate-loss limits inherent in traditional point-to-point links. The proposed scheme offers a theoretically sound and feasible solution for deploying large-scale and long-distance consortium.2026-03-16T05:07:31ZXuan LiYing Guohttp://arxiv.org/abs/2603.14806v1Fold-CP: A Context Parallelism Framework for Biomolecular Modeling2026-03-16T04:20:01ZUnderstanding cellular machinery requires atomic-scale reconstruction of large biomolecular assemblies. However, predicting the structures of these systems has been constrained by hardware memory requirements of models like AlphaFold 3, imposing a practical ceiling of a few thousand residues that can be processed on a single GPU. Here we present NVIDIA BioNeMo Fold-CP, a context parallelism framework that overcomes this barrier by distributing the inference and training pipelines of co-folding models across multiple GPUs. We use the Boltz models as open source reference architectures and implement custom multidimensional primitives that efficiently parallelize both the dense triangular updates and the irregular, data-dependent pattern of window-batched local attention. Our approach achieves efficient memory scaling; for an N-token input distributed across P GPUs, per-device memory scales as $O(N^2/P)$, enabling the structure prediction of assemblies exceeding 30,000 residues on 64 NVIDIA B300 GPUs. We demonstrate the scientific utility of this approach through successful developer use cases: Fold-CP enabled the scoring of over 90% of Comprehensive Resource of Mammalian protein complexes (CORUM) database, as well as folding of disease-relevant PI4KA lipid kinase complex bound to an intrinsically disordered region without cropping. By providing a scalable pathway for modeling massive systems with full global context, Fold-CP represents a significant step toward the realization of a virtual cell.2026-03-16T04:20:01Z23 pages, 10 figuresDejun LinSimon ChuVishanth IyerYouhan LeeJohn St JohnKevin BoydBrian RolandXiaowei RenGuoqing ZhouZhonglin CaoPolina BinderYuliya ZhautouskayaJakub ZakrzewskiMaximilian StadlerKyle GionYuxing PengXi ChenTianjing ZhangPhilipp JunkMichelle DimonPaweł GniewekFabian OrtegaMcKinley PolenIvan GrubisicAli BashirGraham HoltDanny KovtunMatthias GrassLuca NaefRui WangJian PengAnthony CostaSaee PaliwalEddie CallejaTimur RvachovNeha TadimetiRoy TalEmine Kucukbenlihttp://arxiv.org/abs/2303.06324v2Comprehensive Deadlock Prevention for GPU Collective Communication2026-03-16T03:24:54ZDistributed deep neural network training necessitates efficient GPU collective communications, which are inherently susceptible to deadlocks. GPU collective deadlocks arise easily in distributed deep learning applications when multiple collectives circularly wait for each other. GPU collective deadlocks pose a significant challenge to the correct functioning and efficiency of distributed deep learning, and no general effective solutions are currently available. Only in specific scenarios, ad-hoc methods, making an application invoke collectives in a consistent order across GPUs, can be used to prevent circular collective dependency and deadlocks.
This paper presents DFCCL, a novel GPU collective communication library that provides a comprehensive approach for GPU collective deadlock prevention while maintaining high performance. DFCCL achieves preemption for GPU collectives at the bottom library level, effectively preventing deadlocks even if applications cause circular collective dependency. DFCCL ensures high performance with its execution and scheduling methods for collectives. Experiments show that DFCCL effectively prevents GPU collective deadlocks in various situations. Moreover, extensive evaluations demonstrate that DFCCL delivers performance comparable to or superior to NCCL, the state-of-the-art collective communication library highly optimized for NVIDIA GPUs.2023-03-11T06:45:47ZLichen PanJuncheng LiuYongquan FuJinhui YuanRongkai ZhangPengze LiZhen Xiao10.1145/3689031.3717466http://arxiv.org/abs/2602.21566v2Epoch-based Optimistic Concurrency Control in Geo-replicated Databases2026-03-16T02:58:29ZGeo-distribution is essential for modern online applications to ensure service reliability and high availability. However, supporting high-performance serializable transactions in geo-replicated databases remains a significant challenge. This difficulty stems from the extensive over-coordination inherent in distributed atomic commitment, concurrency control, and fault-tolerance replication protocols under high network latency.
To address these challenges, we introduce Minerva, a unified distributed concurrency control designed for highly scalable multi-leader replication. Minerva employs a novel epoch-based asynchronous replication protocol that decouples data propagation from the commitment process, enabling continuous transaction replication. Optimistic concurrency control is used to allow any replicas to execute transactions concurrently and commit without coordination. In stead of aborting transactions when conflicts are detected, Minerva uses deterministic re-execution to resolve conflicts, ensuring serializability without sacrificing performance. To further enhance concurrency, we construct a conflict graph and use a maximum weight independent set algorithm to select the optimal subset of transactions for commitment, minimizing the number of re-executed transactions. Our evaluation demonstrates that Minerva significantly outperforms state-of-the-art replicated databases, achieving over $3\times$ higher throughput in scalability experiments and $2.8\times$ higher throughput during a high network latency simulation with the TPC-C benchmark.2026-02-25T04:44:50ZYunhao MaoHarunari TakataMichail BachrasYuqiu ZhangShiquan ZhangGengrui ZhangHans-Arno Jacobsen10.1145/3802052http://arxiv.org/abs/2603.14729v1DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning2026-03-16T02:02:38ZNext-generation IoT applications increasingly span across autonomous administrative entities, necessitating silo-cooperative scheduling to leverage diverse computational resources while preserving data privacy. However, realizing efficient cooperation faces significant challenges arising from infrastructure heterogeneity, Non-IID workload shifts, and the inherent risks of adversarial environments. Existing approaches, relying predominantly on centralized coordination or independent learning, fail to address the incompatibility of state-action spaces across heterogeneous silos and lack robustness against malicious attacks. This paper proposes DeFRiS, a Decentralized Federated Reinforcement Learning framework for robust and scalable Silo-cooperative IoT application scheduling. DeFRiS integrates three synergistic innovations: (i) an action-space-agnostic policy utilizing candidate resource scoring to enable seamless knowledge transfer across heterogeneous silos; (ii) a silo-optimized local learning mechanism combining Generalized Advantage Estimation (GAE) with clipped policy updates to resolve sparse delayed reward challenges; and (iii) a Dual-Track Non-IID robust decentralized aggregation protocol leveraging gradient fingerprints for similarity-aware knowledge transfer and anomaly detection, and gradient tracking for optimization momentum. Extensive experiments on a distributed testbed with 20 heterogeneous silos and realistic IoT workloads demonstrate that DeFRiS significantly outperforms state-of-the-art baselines, reducing average response time by 6.4% and energy consumption by 7.2%, while lowering tail latency risk (CVaR$_{0.95}$) by 10.4% and achieving near-zero deadline violations. Furthermore, DeFRiS achieves over 3 times better performance retention as the system scales and over 8 times better stability in adversarial environments compared to the best-performing baseline.2026-03-16T02:02:38ZZhiyu WangMohammad GoudarziMingming GongRajkumar Buyyahttp://arxiv.org/abs/2603.14690v1Can you keep a secret? A new protocol for sender-side enforcement of causal message delivery2026-03-16T00:48:21ZProtocols for causal message delivery are widely used in distributed systems. Traditionally, causal delivery can be enforced either on the message sender's side or on the receiver's side. The traditional sender-side approach avoids the message metadata overhead of the receiver-side approach, but is more conservative than necessary. We present Cykas ("Can you keep a secret?"), a new protocol for sender-side enforcement of causal delivery that sidesteps the conservativeness of the traditional sender-side approach by allowing eager sending of messages and constraining the behavior of their recipients. We implemented the Cykas protocol in Rust and checked the safety and liveness of our implementation using the Stateright implementation-level model checker. Our experiments show that for applications involving long-running jobs, Cykas has a performance advantage: Cykas lets long-running jobs start (and end) earlier, leading to shorter overall execution time compared to the traditional sender-side approach.2026-03-16T00:48:21ZTo be presented at PaPoC 2026Yan TongNathan LiittschwagerLindsey Kuperhttp://arxiv.org/abs/2603.14630v1Towards an Adaptive Runtime System for Cloud-Native HPC2026-03-15T22:03:48ZThe ongoing convergence of HPC and cloud computing presents a fundamental challenge: HPC applications, designed for static and homogeneous supercomputers, are ill-suited for the dynamic, heterogeneous, and volatile nature of the cloud. Traditional parallel programming models like MPI struggle to leverage key cloud advantages, such as resource elasticity and low-cost spot instances, while also failing to address challenges like performance variability and processor heterogeneity. This paper demonstrates how the asynchronous, message-driven paradigm of the Charm++ parallel runtime system can bridge this gap. We present a set of tools and strategies that enable HPC applications to run efficiently and resiliently on dynamic cloud infrastructure across both CPU and GPU resources. Our work makes two key contributions. First, we demonstrate that rate-aware load balancing in Charm++ improves performance for applications running on heterogeneous CPU and GPU instances on the cloud. We further demonstrate how core Charm++ principles mitigate performance degradation from common cloud challenges like network contention and processor performance variability, which are exacerbated by the tightly coupled, globally synchronized nature of many science and engineering applications. Second, we extend an existing resource management framework to support GPU and CPU spot instances with minimal interruption overhead. Together, these contributions provide a robust framework for adapting HPC applications to achieve efficient, resilient, and cost-effective performance on the cloud.2026-03-15T22:03:48ZAditya BhosaleAdvait TahilyaniLaxmikant KaleSara Kokkila-Schumacherhttp://arxiv.org/abs/2603.14583v1Machine Learning-Driven Intelligent Memory System Design: From On-Chip Caches to Storage2026-03-15T20:02:05ZDespite the data-rich environment in which memory systems of modern computing platforms operate, many state-of-the-art architectural policies employed in the memory system rely on static, human-designed heuristics that fail to truly adapt to the workload and system behavior via principled learning methodologies. In this article, we propose a fundamentally different design approach: using lightweight and practical machine learning (ML) methods to enable adaptive, data-driven control throughout the memory hierarchy.
We present three ML-guided architectural policies: (1) Pythia, a reinforcement learning-based data prefetcher for on-chip caches, (2) Hermes, a perceptron learning-based off-chip predictor for multi-level cache hierarchies, and (3) Sibyl, a reinforcement learning-based data placement policy for hybrid storage systems. Our evaluation shows that Pythia, Hermes, and Sibyl significantly outperform the best-prior human-designed policies, while incurring modest hardware overheads. Collectively, this article demonstrates that integrating adaptive learning into memory subsystems can lead to intelligent, self-optimizing architectures that unlock performance and efficiency gains beyond what is possible with traditional human-designed approaches.2026-03-15T20:02:05ZExtended version of the IEEE Micro 2026 articleRahul BeraRakesh NadigOnur Mutlu10.1109/MM.2026.3667076http://arxiv.org/abs/2603.14577v1Covariance-Guided Resource Adaptive Learning for Efficient Edge Inference2026-03-15T19:54:08ZFor deep learning inference on edge devices, hardware configurations achieving the same throughput can differ by 2$\times$ in power consumption, yet operators often struggle to find the efficient ones without exhaustive profiling. Existing approaches often rely on inefficient static presets or require expensive offline profiling that must be repeated for each new model or device. To address this problem, we present CORAL, an online optimization method that discovers near-optimal configurations without offline profiling. CORAL leverages distance covariance to statistically capture the non-linear dependencies between hardware settings, e.g., DVFS and concurrency levels, and performance metrics. Unlike prior work, we explicitly formulate the challenge as a throughput-power co-optimization problem to satisfy power budgets and throughput targets simultaneously. We evaluate CORAL on two NVIDIA Jetson devices across three object detection models ranging from lightweight to heavyweight. In single-target scenarios, CORAL achieves 96% $\unicode{x2013}$ 100% of the optimal performance found by exhaustive search. In strict dual-constraint scenarios where baselines fail or exceed power budgets, CORAL consistently finds proper configurations online with minimal exploration.2026-03-15T19:54:08Z8 pages, 10 figuresAhmad N. L. NabhaanZaki SukmaRakandhiya D. RachmantoMuhammad Husni SantriajiByungjin ChoArief SetyantoIn Kee Kimhttp://arxiv.org/abs/2603.14445v1Committee Configuration Optimization for Parallel Byzantine Consensus in a Trusted Execution Environment2026-03-15T15:42:44ZParallel Byzantine Fault Tolerant (BFT) protocols based on committee-based sharding improve scalability but weaken safety since smaller node groups are responsible for consensus. Recent approaches integrate trusted execution environments (TEEs) into parallel BFT frameworks to enhance safety. While the scalability and safety issues are addressed by trusted parallel BFT, existing committee configuration methods often rely on randomized assignment, which can degrade performance. This paper proposes a committee configuration optimization (CCO) model based on mixed integer programming to improve transaction performance for trusted parallel BFT. The model considers communication delays and node failure rates to determine an optimal committee configuration that minimizes transaction latency under both normal operations and scenarios of trusted hardware failures. We integrate CCO into a trusted parallel BFT protocol and evaluate the performance on Microsoft virtual machines. Experimental results demonstrate 15% and 21% improved transaction throughput under normal operations and fallback process, respectively, highlighting the benefits of optimization-driven committee configuration in trusted parallel BFT systems.2026-03-15T15:42:44ZYifei XieBtissam Er-RahmadiXiao ChenTiejun MaJane Hillston