https://arxiv.org/api/+nV654qcHP22mcAhyKwGqW+A9H82026-04-10T22:06:28Z2795348015http://arxiv.org/abs/2603.07750v1Structured Gossip: A Partition-Resilient DNS for Internet-Scale Dynamic Networks2026-03-08T17:54:36ZNetwork partitions pose fundamental challenges to distributed name resolution in mobile ad-hoc networks (MANETs) and edge computing. Existing solutions either require active coordination that fails to scale, or use unstructured gossip with excessive overhead. We present \textit{Structured Gossip DNS}, exploiting DHT finger tables to achieve partition resilience through \textbf{passive stabilization}. Our approach reduces message complexity from $O(n)$ to $O(n/\log n)$ while maintaining $O(\log^2 n)$ convergence. Unlike active protocols requiring synchronous agreement, our passive approach guarantees eventual consistency through commutative operations that converge regardless of message ordering. The system handles arbitrary concurrent partitions via version vectors, eliminating global coordination and enabling billion-node deployments.2026-03-08T17:54:36ZRejected from ACM SIGMOD 2026 Demo TrackPriyanka SinhaDilys Thomashttp://arxiv.org/abs/2603.07683v1Mitigating the Memory Bottleneck with Machine Learning-Driven and Data-Aware Microarchitectural Techniques2026-03-08T15:34:25ZModern applications process massive data volumes that overwhelm the storage and retrieval capabilities of memory systems, making memory the primary performance and energy-efficiency bottleneck of computing systems. Although many microarchitectural techniques attempt to hide or tolerate long memory access latency, rapidly growing data footprints continue to outpace technology scaling, requiring more effective solutions. This dissertation shows that modern processors observe large amounts of application and system data during execution, yet many microarchitectural mechanisms make decisions largely independent of this information. Through four case studies, we demonstrate that such data-agnostic design leads to substantial missed opportunities for improving performance and energy efficiency.
To address this limitation, this dissertation advocates shifting microarchitecture design from data-agnostic to data-informed. We propose mechanisms that (1) learn policies from observed execution behavior (data-driven design) and (2) exploit semantic characteristics of application data (data-aware design). We apply lightweight machine learning techniques and previously underexplored data characteristics across four processor components: a reinforcement learning-based hardware data prefetcher that learns memory access patterns online; a perceptron predictor that identifies memory requests likely to access off-chip memory; a reinforcement learning mechanism that coordinates data prefetching and off-chip prediction; and a mechanism that exploits repeatability in memory addresses and loaded values to eliminate predictable load instructions.
Our extensive evaluation shows that the proposed techniques significantly improve performance and energy efficiency compared to prior state-of-the-art approaches.2026-03-08T15:34:25ZRahul Berahttp://arxiv.org/abs/2603.07621v1Performance Evaluation of Automated Multi-Service Deployment in Edge-Cloud Environments with the CODECO Toolkit2026-03-08T13:13:47ZContainerized microservices are widely adopted for latency-sensitive and compute-intensive applications, with Kubernetes (K8s) as the dominant orchestration platform. However, automating the deployment and management of multi-service applications remains challenging, particularly in heterogeneous Edge-Cloud environments. This paper evaluates the CODECO toolkit, an open-source framework designed to enhance container orchestration across distributed infrastructures. We compare CODECO with baseline K8s workflows using three key performance indicators: deployment time, level of manual intervention, and runtime performance with resource utilization. Experiments across diverse hardware platforms (ARM, AMD, RPi) and K8s distributions, including lightweight variants such as k3s, demonstrate that CODECO substantially reduces manual effort while maintaining competitive performance and acceptable overhead. These results validate CODECO as an effective solution for Edge-Cloud orchestration and highlight its potential to improve the flexibility and intelligence of K8s-based deployments.2026-03-08T13:13:47ZGeorgios KoukisIoannis DermentzisVassilis TsaoussidisJan LenkeFabian WolkDaniel UcedaGuillermo SanchezMiguel A. PuentesJavier SerranoPanagiotis KaramolegkosRute C. Sofiahttp://arxiv.org/abs/2603.07607v1MAS-H2: A Hierarchical Multi-Agent System for Holistic Cloud-Native Autoscaling2026-03-08T12:39:14ZAutoscaling in cloud-native platforms like Kubernetes is reactive and metric-driven, leading to a strategic void problem. This comes from the decoupling of higher-level business policies from lower-level resource provisioning. The strategic void, coupled with a fragmented coordination of pod and node scaling, can lead to significant resource waste and performance degradation under dynamic workloads. In this paper, we present MAS-H2, a new hierarchical multi-agent system that addresses the challenges of autonomic cloud resource management with a complete end-to-end solution. MAS-H2 systematically decomposes the control problem into three layers: a Strategic Agent that formalises business policies (e.g., cost vs. performance) into a global utility function; Planning Agents that produce a joint, proactive scaling plan for pods and nodes with time-series forecasting; and Execution Agents that execute the scaling plan. We built and tested a MAS-H2 prototype as a Kubernetes Operator on Google Kubernetes Engine (GKE) to benchmark it against the native Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler (CA) baselines under two realistic, spiky, and stress-inducing workload scenarios. The results show that the MAS-H2 system maintained application CPU usage under 40% for predictable Heartbeat workloads. This resulted in over 50% less sustained CPU stress than the native HPA baseline, which typically operated above 80%. The MAS-H2 system demonstrated proactive planning in a volatile Chaotic Flash Sale scenario by filtering transient noise and deploying more replicas compared to HPA. It reduced peak CPU load by 55% without under-provisioning. Beyond performance, MAS-H2 seamlessly performed a zero-downtime strategic migration between two cost- and performance-optimised infrastructures.2026-03-08T12:39:14ZHamed HamzehParisa Vahdatianhttp://arxiv.org/abs/2603.07456v1Agentic AI-Driven UAV Network Deployment: A LLM-Enhanced Exact Potential Game Approach2026-03-08T04:13:09ZUnmanned Aerial Vehicular Networks (UAVNs) are envisioned to provide flexible connectivity, wide-area coverage, and low-latency services in dynamic environments. From an agentic artificial intelligence (Agentic AI) perspective, UAVNs naturally operate as multi-agent systems, where autonomous UAVs act as intelligent agents that coordinate deployment and networking decisions to achieve global performance objectives. However, the strong coupling between discrete link decisions and continuous deployment parameters makes UAVN topology optimization a mixed-integer nonconvex problem, resulting in challenges in scalability, efficiency, and solution consistency under dynamic network conditions. This paper proposes a dual spatial-scale UAVN topology optimization framework based on exact potential games (EPGs), enhanced by Agentic AI. At the large spatial scale, a log-linear learning based EPG (L3-EPG) algorithm is developed to optimize inter-UAV link configurations, enabling sparse yet connected network topologies while reducing redundant links and interference. At the small spatial scale, an approximate gradient based EPG (AG-EPG) algorithm jointly optimizes UAV deployment, transmission power allocation, and ground user (GU) association to improve network throughput and latency. To further enhance adaptability across heterogeneous scenarios, a large language model (LLM) is incorporated as a knowledge-driven decision enhancer to automatically generate utility weights according to network characteristics, alleviating reliance on manual parameter tuning. Simulation results demonstrate that the proposed framework consistently outperforms baseline methods in terms of energy consumption, end-to-end latency, and system throughput.2026-03-08T04:13:09Z13 pages, 8 figuresXin TangQian ChenBinhan LiaoYaqi ZhangJianxin ChenChangyuan ZhaoJunchuan FanJunxi TianXiaohuan Lihttp://arxiv.org/abs/2603.07391v1Link Wars: The Semantic Crisis. Is the debate over or is it just beginning?2026-03-08T00:36:23ZFor fifty years, networking has fragmented whenever new workloads exposed hidden assumptions about time, ordering, failure, and trust. This paper argues that the current interconnect landscape -- NVLink, UALink, Ultra Ethernet, AELink/Aethernet, TTPoE, and classical RDMA -- suffers from a semantic crisis: vendor-specific divergence disguised as optimization. We trace this crisis to the Forward-In-Time-Only (FITO) category mistake embedded in every major fabric stack, and show how each pathology -- aspirational RDMA completion, fire-and-forget GPU semantics, opaque proprietary stacks, incompatible multi-cloud ordering, universal fencing -- arises from the same failure to define explicit, testable link semantics from APIs to bits on the wire. We conjecture that RDMA achieves reliability through universal fencing that collapses concurrency into serialized checkpoints, and that precise minimal semantics can maintain correctness without global barriers, as superscalar architectures separated execution from retirement. We describe how Open Atomic Ethernet (OAE) under the Open Compute Project addresses the crisis through bilateral transaction primitives with explicit ordering, completion, and failure visibility. Drawing on Helland's analysis of scalable OLTP isolation (the "BIG DEAL"), we show the crisis pervades the entire stack. We assess whether convergence on a single open standard is still possible or whether fragmentation is now structural.2026-03-08T00:36:23Z17 pages, 34 referencesPaul Borrillhttp://arxiv.org/abs/2602.19433v3Why iCloud Fails: The Category Mistake of Cloud Synchronization2026-03-07T21:42:23ZiCloud Drive presents a filesystem interface but implements cloud synchronization semantics that diverge from POSIX in fundamental ways. This divergence is not an implementation bug; it is a Category Mistake -- the same one that pervades distributed computing wherever Forward-In-Time-Only (FITO) assumptions are embedded into protocol design. Parker et al. showed in 1983 that network partitioning destroys mutual consistency; iCloud adds a user interface that conceals this impossibility behind a facade of seamlessness. This document presents a unified analysis of why iCloud fails when composed with Time Machine, git, automated toolchains, and general-purpose developer workflows, supported by direct evidence including documented corruption events and a case study involving 366 GB of divergent state accumulated through normal use. We show that the failures arise from five interlocking incompatibilities rooted in a single structural error: the projection of a distributed causal graph onto a linear temporal chain. We then show how the same Category Mistake, when it occurs in network fabrics as link flapping, destroys topology knowledge through epistemic collapse. Finally, we argue that Open Atomic Ethernet (OAE) transactional semantics -- bilateral, reversible, and conservation-preserving -- provide the structural foundation for resolving these failures, not by defeating physics, but by aligning protocol behavior with physical reality.2026-02-23T02:03:03Z28 pages, 7 figures, 36 referencesPaul Borrillhttp://arxiv.org/abs/2603.07345v1Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure2026-03-07T21:13:09ZOperating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare "full-peak" failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact is significant: UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from ~20% to ~30% while sustaining 99.97% availability. To date, UFA has hardened over 4,000 unsafe dependencies, eliminated over one million CPU cores from a baseline of about four million cores.2026-03-07T21:13:09ZMayank BansalMilind ChabbiKenneth BoghSrikanth ProdduturiKevin XuAmit KumarDavid BellRanjib DeyYufei RenSachin SharmaJuan MarcanoShriniket KaleSubhav PradhanIvan BeschastnikhMiguel CovarrubiasChien-Chih LiaoSandeep Koushik SheshadriWen LuoKai SongAshish SamantSahil RihanNimish ShethUday Kiran Medisettyhttp://arxiv.org/abs/2512.22364v2Cost Trade-offs of Reasoning and Non-Reasoning Large Language Models in Text-to-SQL2026-03-07T21:01:20ZWhile Text-to-SQL systems achieve high accuracy, existing efficiency metrics like the Valid Efficiency Score prioritize execution time, a metric we show is fundamentally decoupled from consumption-based cloud billing. This paper evaluates cloud query execution cost trade-offs between reasoning and non-reasoning Large Language Models by performing 180 Text-to-SQL query executions across six LLMs on Google BigQuery using the 230 GB StackOverflow dataset. Our analysis reveals that reasoning models process 44.5% fewer bytes than non-reasoning counterparts while maintaining equivalent correctness at 96.7% to 100%, and that execution time correlates weakly with query cost at $r=0.16$, indicating that speed optimization does not imply cost efficiency. Non-reasoning models also exhibit extreme cost variance of up to 3.4$\times$, producing outliers exceeding 36 GB per query, over 20$\times$ the best model's 1.8 GB average, due to missing partition filters and inefficient joins. We identify these prevalent inefficiency patterns and provide deployment guidelines to mitigate financial risks in cost-sensitive enterprise environments.2025-12-26T19:51:35ZSaurabh DeochakeDebajyoti Mukhopadhyayhttp://arxiv.org/abs/2602.04652v2Six Times to Spare: Characterizing GPU-Accelerated 5G LDPC Decoding for Edge-RSU Communications2026-03-07T17:58:18ZUltra-reliable low-latency vehicular communications (URLLC) require sufficient physical-layer (PHY) compute headroom at the network edge, where roadside units (RSUs) and compact next-generation base stations (gNBs) must meet strict timing constraints while co-hosting higher-layer services. In 5G New Radio (5G NR), low-density parity-check code (LDPC) decoding is a latency-sensitive iterative PHY workload whose cost scales with both workload parallelism and decoder iteration budget, making it a potential bottleneck on general-purpose central processing units (CPUs). This paper presents a reproducible, telemetry-backed microbenchmark derived from the Sionna LDPC5G baseline to characterize the compute headroom obtained through graphics processing unit (GPU) offload on compact heterogeneous edge platforms. We evaluate decoder behavior across multiple processor architectures and a wide range of batch sizes and iteration counts, with emphasis on dense operating regimes relevant to edge provisioning. Results show that GPU acceleration substantially increases LDPC throughput, reduces amortized decode service time, and shifts compute pressure away from the CPU, thereby improving the feasibility of meeting edge-RSU timing budgets under heavy parallel workloads. These findings indicate that GPU offload can provide substantial spare PHY compute margin for compact vehicular edge platforms, making dense decode workloads more practical within realistic edge power and timing constraints.2026-02-04T15:28:45Z7 pages, 3 figures, 2 tablesRyan BarkerJulia BooneTolunay SeyfiAlireza Ebrahimi DorchehFatemeh AfghahJoseph Boccuzzihttp://arxiv.org/abs/2603.07041v1AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling2026-03-07T05:25:53ZFailures in clusters running large-scale AI workloads can result in decreased utilization. Because the cost of a failure in such AI workloads is high (as it requires restarting the entire job from a previous checkpoint), there are many mechanisms in place to ensure that the failures are mitigated, and the impact of a failure is minimized. However, these mechanisms have many knobs and parameters, all of which must be carefully tuned based on the system and cluster's characteristics. We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload.
AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which knobs or parameters are important in order to prioritize the investment of effort in improving the system. AIReSim also allows tuning of the knobs for achieving different tradeoffs in the system, as well as to consider various ``what-if'' scenarios. We present a case study of applying AIReSim for capacity planning for large-scale clusters running AI workloads.2026-03-07T05:25:53Zunder submission; submitted versionKarthik PattabiramanMihir PatelFred Linhttp://arxiv.org/abs/2603.06980v1Configurable Runtime Orchestration for Dynamic Data Retrieval in Distributed Systems2026-03-07T01:45:18ZModern enterprise platforms increasingly depend on distributed microservices, analytical data platforms, and external APIs to construct composite responses for applications. Orchestrating data retrieval across these heterogeneous systems is challenging because many workflow platforms rely on predefined workflows or state-machine definitions. Systems such as Apache Airflow, AWS Step Functions, and Temporal provide powerful orchestration capabilities but typically assume workflows are defined prior to execution. This paper presents a configuration-driven runtime orchestration framework for dynamic data retrieval in distributed systems. The framework generates execution graphs dynamically from configuration at request time, enabling low-latency orchestration without redeploying workflow code when integrations evolve. The execution planner performs dependency-aware scheduling and parallel execution of independent tasks, allowing efficient aggregation across distributed services. The paper describes the architecture, execution model, and operational tradeoffs of this framework, and presents a representative enterprise case study for Customer 360 retrieval. The approach demonstrates how runtime configuration can enable flexible and scalable orchestration in rapidly evolving integration environments.2026-03-07T01:45:18Z11 pages, 1 architecture diagram, execution pseudocode, and system comparison tableAbhiram Kandirajuhttp://arxiv.org/abs/2510.05109v5Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices2026-03-06T20:25:12ZLarge Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.2025-09-25T22:28:44ZYilong LiShuai ZhangYijing ZengHao ZhangXinmiao XiongJingyu LiuPan HuSuman Banerjeehttp://arxiv.org/abs/2603.06798v1NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning2026-03-06T19:02:31ZThe growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: https://github.com/scai-tech/Nest2026-03-06T19:02:31ZAccepted to MLSys 2026Irene WangVishnu Varma VenkataArvind KrishnamurthyDivya Mahajanhttp://arxiv.org/abs/2505.09764v3FAST: An Efficient Scheduler for All-to-All GPU Communication2026-03-06T16:48:51ZAll-to-All(v) communication is a critical primitive in modern machine learning workloads, particularly mixture-of-experts (MoE) models. Unfortunately, efficient scheduling is challenging due to workload skew, heterogeneous two-tier fabrics, and incast congestion, compounded by the dynamic nature of MoE workloads, where traffic shifts every few hundred milliseconds. Existing schedulers are hardly scalable, incurring seconds to hours of synthesis time, making them impractical. We present FAST, an efficient All-to-All(v) scheduler. FAST addresses skew through intra-server rebalancing and enforces balanced, one-to-one scale-out transfers that avoid incast. Evaluated extensively on both NVIDIA H200 and AMD MI300X clusters, FAST consistently outperforms state-of-the-art solutions on skewed workloads while reducing synthesis time by orders of magnitude.2025-05-14T19:51:53ZAccepted to 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 2026)Yiran LeiDongjoo LeeLiangyu ZhaoDaniar KurniawanChanmyeong KimHeetaek JeongChangsu KimHyeonseong ChoiLiangcheng YuArvind KrishnamurthyJustine SherryEriko Nurvitadhi