https://arxiv.org/api/zOtIRDtcF7SgC2mmtv0b4G8qneo2026-03-22T11:38:10Z2772410515http://arxiv.org/abs/2603.11571v1Subtime: Reversible Information Exchange and the Emergence of Classical Time2026-03-12T05:50:47ZWe formalize the concept of subtime -- a reversible mode of information interchange within entangled systems -- and show how classical time emerges as an asymptotic limit through decoherence. Building on the photon clock model, in which a single photon confined between two ideal mirrors creates an alternating causality regime, we develop a process-theoretic formalization using the Oreshkov--Costa--Brukner framework extended with an explicit time-reversal duality condition. We introduce Perfect Information Feedback (PIF) as the information-theoretic realization of this reversibility, demonstrating that mutual information is conserved in any closed causal loop and that entropy quantifies the degree of unreflected causality. We define the Reversible Causal Principle (RCP): every causal relation possesses a conjugate dual, and entropy, energy dissipation, and the classical arrow of time appear only when these alternating components decohere or fail to reflect perfectly. The framework unifies Wheeler--Feynman absorber theory, Bennett's reversible computation, Shannon's communication theory, and the process matrix formalism under a single symmetry principle, and identifies experimentally accessible signatures in reversible digital links and quantum switch experiments. The arrow of time, in this picture, records the universe's imperfect causal echo.2026-03-12T05:50:47Z15 pages, 33 referencesPaul L. Borrillhttp://arxiv.org/abs/2505.09258v3Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration2026-03-12T03:55:13ZGraph embeddings map graph nodes to continuous vectors and are foundational to community detection, recommendation, and many scientific applications. At billion-scale, however, existing graph embedding systems face a trade-off: they either rely on large in-memory footprints across many GPUs (limited scalability) or repeatedly stream data from disk (incurring severe I/O overhead and low GPU utilization). In this paper, we propose Legend, a lightweight heterogeneous system for graph embedding that systematically redesigns data management across CPU, GPU, and NVMe SSD resources. Legend combines three practical ideas: (1) a prefetch-friendly embedding-loading order that lets GPUs efficiently prefetch necessary embeddings directly from NVMe SSD with low I/O amplification; (2) a high-throughput GPU-SSD direct-access driver tuned for the access patterns of embedding training; and (3) a customized parallel execution strategy that maximizes GPU utilization. Together, these components let Legend store and stream vast embedding data without overprovisioning GPU memory or suffering I/O stalls. Extensive experiments on billion-scale graphs demonstrate that Legend speeds up end-to-end workloads by up to 4.8x versus state-of-the-art systems, and matches their performance on the largest workloads while using only one quarter of the GPUs.2025-05-14T10:13:40ZAccepted by The VLDB Journal 2026Zhonggen LiXiangyu KeYifan ZhuYunjun GaoFeifei Lihttp://arxiv.org/abs/2603.07949v2RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA Models2026-03-12T03:21:07ZVision Language Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) inference offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Mainstream environment-oriented edge-cloud partitioning methods are susceptible to interference from visual noise; (2) Existing edge-cloud partitioning methods overlook the step-wise redundancy unique to embodied tasks, thereby disrupting the physical continuity of motion. To address these issues, we propose a novel ECC inference framework, termed RAPID. Specifically, we developed an implementation tailored to the proposed framework. Experiments demonstrate this achieves a speedup of up to 1.73x with only 5%~7% overhead.2026-03-09T04:30:57ZZihao ZhengSicheng TianHangyu CaoChenyue LiJiayu ChenMaoliang LiXinhao SunHailong ZouGuojie LuoXiang Chenhttp://arxiv.org/abs/2603.11438v1NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication2026-03-12T02:03:55ZNCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unverified native code within NCCL's address space, risking job crashes, silent state corruption, and downtime from restarts during policy updates. Inspired by kernel extensibility models, we introduce NCCLbpf, a verified, high-performance extension framework embedding a userspace eBPF runtime directly into NCCL's existing plugin interfaces, without modifying NCCL itself. NCCLbpf offers load-time static verification to prevent unsafe plugin execution, structured cross-plugin maps enabling composable policies and closed-loop adaptation, and atomic policy hot-reloads eliminating downtime previously required for policy updates. Evaluations on 8x NVIDIA B300 GPUs connected via NVLink demonstrate that NCCLbpf imposes just 80-130 ns overhead per tuner decision (less than 0.03% of collective latency), prevents all tested unsafe plugin behaviors at load-time, and enables a message-size-aware eBPF policy that improves AllReduce throughput by up to 27% over NCCL's default in the 4-128 MiB range.2026-03-12T02:03:55ZYusheng Zhenghttp://arxiv.org/abs/2603.10902v1Data Augmentation and Convolutional Network Architecture Influence on Distributed Learning2026-03-11T15:46:53ZConvolutional Neural Networks (CNNs) have proven to be highly effective in solving a broad spectrum of computer vision tasks, such as classification, identification, and segmentation. These methods can be deployed in both centralized and distributed environments, depending on the computational demands of the task. While much of the literature has focused on the explainability of CNNs, which is essential for building trust and confidence in their predictions, there remains a gap in understanding their impact on computational resources, particularly in distributed training contexts. In this study, we analyze how CNN architectures primarily influence model accuracy and investigate additional factors that affect computational efficiency in distributed systems. Our findings contribute valuable insights for optimizing the deployment of CNNs in resource-intensive scenarios, paving the way for further exploration of variables critical to distributed learning.2026-03-11T15:46:53ZVictor Forattini JansenEmanuel Teixeira MartinsYasmin Souza LimaFlavio de Oliveira SilvaRodrigo MoreiraLarissa Ferreira Rodrigues Moreira10.22456/2175-2745.143508http://arxiv.org/abs/2603.10850v1Topological Analysis for Identifying Anomalies in Serverless Platforms2026-03-11T15:00:02ZThe information flows in serverless platforms are complex and non-conservative. This is a direct result of how independently deployed functions interact under the platform coarse-grained control mechanisms. To manage this complexity, we introduce a topological model for serverless services. Using Hodge decomposition, we can separate observed operational flows into two distinct categories. They include components that can be corrected locally and harmonic modes that persist at any scale. Our analysis reveals that these harmonic flows emerge naturally from different types of inter-function interactions. They should be understood as structural properties of serverless systems, not as configuration errors. Building on this insight, we present an iterative method for analyzing inter-function flows. This method helps deriving practical remediation strategies. One such strategy is the introduction of "dumping effects" to contain harmonic inefficiencies, offering an alternative to completely restructuring the service's topological model. Our experimental results confirm that this approach can uncover latent architectural structures.2026-03-11T15:00:02ZSubmitted for journal publicationGianluca RealiMauro Femminellahttp://arxiv.org/abs/2603.10768v1Aceso: Carbon-Aware and Cost-Effective Microservice Placement for Small and Medium-sized Enterprises2026-03-11T13:45:58ZMicroservices are a dominant architecture in cloud computing, offering scalability and modularity, but also posing complex deployment challenges. As data centers contribute significantly to global carbon emissions, carbon-aware scheduling has emerged as a promising mitigation strategy. However, most existing solutions target batch, high-performance, or serverless workloads and assume access to global-scale infrastructure. Such an assumption does not hold for many national or regional small to medium-sized enterprises (SMEs) with microservice applications, which represent the real-world majority. In this paper, we present Aceso, an Adaptive Carbon- and Efficiency-aware placement for microservices that considers carbon, cost, and latency constraints. Aceso dynamically places microservices across geographically constrained regions using a scalable optimization strategy that leverages insight-based search space pruning techniques. Evaluation on a real-world deployment shows that Aceso quickly adapts to real-time changes in workload and carbon intensity and reduces carbon emissions by 37.4% and operational cost by 3.6%, on average, compared to a static deployment within a single country, while consistently meeting SLOs. In this way, Aceso enables carbon- and cost-aware microservice deployment for latency-sensitive applications in regionally limited infrastructures for SMEs.2026-03-11T13:45:58ZGeorgia ChristofidiFrancisco Álvarez-TerribasIoannis RoumposNicolas KourtellisJesus Omaña IglesiasThaleia Dimitra Doudalihttp://arxiv.org/abs/2511.14664v2Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance2026-03-11T13:37:26ZAs is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.2025-11-18T17:04:28Z15 Pages, 5 Figures, In press at Computer Physics CommunicationsW. Michael BrownAnurag RameshThomas LubinskiThien NguyenDavid E. Bernal Neirahttp://arxiv.org/abs/2603.10726v1CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems2026-03-11T12:59:12ZLarge Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user's request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents CacheSolidarity, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. CacheSolidarity monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that CacheSolidarity enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. CacheSolidarity's lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.2026-03-11T12:59:12ZPanagiotis Georgios PennasKonstantinos PapaioannouMarco GuarnieriThaleia Dimitra Doudalihttp://arxiv.org/abs/2312.09877v2Optimal Transport Aggregation for Distributed Mixture-of-Experts2026-03-11T12:36:20ZMixture-of-experts (MoE) models provide a flexible statistical framework for modeling heterogeneity and nonlinear relationships. In many modern applications, however, datasets are naturally distributed across multiple machines due to storage, computational, or governance constraints. We consider a distributed model aggregation setting in which local MoE models are trained independently on decentralized datasets and subsequently combined into a global estimator. Aggregating MoE models is challenging because standard averaging produces models that do not preserve the MoE structure, and therefore do not yield estimates of the global model parameters. To address this issue, we propose a principled aggregation framework based on optimal transport that constructs a reduced global MoE estimator by minimizing a transportation divergence between the collection of local estimators and the aggregated model. An efficient majorization--minimization (MM) algorithm is derived to solve the resulting optimization problem. The method requires only a single communication step from local machines to a central server, making it a frugal distributed learning approach particularly attractive for large-scale settings where communication costs are a major bottleneck. We further establish statistical guarantees for the aggregated estimator, including consistency under standard assumptions on the local estimators. Experiments on synthetic and real datasets demonstrate that the approach achieves performance comparable to centralized training while significantly reducing computation time. The source codes are publicly available on Github.2023-12-15T15:26:13ZFaïcel ChamroukhiNhat Thien Phamhttp://arxiv.org/abs/2603.10634v1Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization2026-03-11T10:49:01ZIn high-performance computing (HPC) applications, FP64 arithmetic remains indispensable for ensuring numerical accuracy and stability. However, in recent hardware generations, improvements in FP64 arithmetic performance have been relatively modest. Consequently, achieving sustained performance gains for FP64 computations necessitates the effective utilization of high-throughput low-precision arithmetic, such as INT8 and FP8. In several recent architectures, such as NVIDIA Blackwell Ultra and NVIDIA Rubin, INT8 performance has been significantly reduced, making reliance on INT8 alone insufficient. The use of FP8 arithmetic is thus increasingly important. In this paper, we propose a method for emulating double-precision (FP64) general matrix--matrix multiplication (DGEMM), a fundamental and performance-critical kernel in many HPC applications, using FP8 matrix multiply-accumulate (MMA) units. The Ozaki-I and Ozaki-II schemes are well established as foundational approaches for emulating DGEMM via low-precision arithmetic. For DGEMM emulation via the Ozaki-I scheme, implementations using INT8, FP8, and FP16 MMA units have been proposed, all of which can be realized based on the same underlying algorithmic structure. In contrast, although implementations of DGEMM emulation via the Ozaki-II scheme using INT8 MMA units have been reported, the original algorithm cannot be directly adapted to exploit FP8 MMA units. In this work, we introduce a novel technique to overcome this limitation and demonstrate FP64 matrix multiplication emulation based on the Ozaki-II scheme that operates on FP8 MMA units. Compared to FP8-based emulation via the Ozaki-I scheme, our method significantly reduces the number of required FP8 matrix multiplications and enables efficient FP64 emulation on emerging GPU architectures.2026-03-11T10:49:01Z11 pages, 8 figuresYuki UchinoKatsuhisa OzakiToshiyuki Imamurahttp://arxiv.org/abs/2503.22452v3On the Solvability of Byzantine-tolerant Reliable Communication in Dynamic Networks2026-03-11T09:49:07ZA reliable communication primitive guarantees the delivery, integrity, and authorship of messages exchanged between correct processes of a distributed system. We investigate the necessary and sufficient conditions for reliable communication in dynamic networks, where the network topology evolves over time despite the presence of a limited number of Byzantine faulty processes that may behave arbitrarily (i.e., in the globally bounded Byzantine failure model). We identify classes of dynamic networks where such conditions are satisfied, and extend our analysis to message losses, local computation with unbounded finite delay, and authenticated messages.2025-03-28T14:05:33ZSilvia BonomiDIAG UNIROMAGiovanni FarinaUNICUSANOSébastien TixeuilNPAhttp://arxiv.org/abs/2603.10555v1CD-Raft: Reducing the Latency of Distributed Consensus in Cross-Domain Sites2026-03-11T09:04:41ZToday's massive AI computation loads push heavy data synchronization across sites, i.e., nodes in data centers. Any reduction in such consensus latency can significantly improve the overall performance of desired systems. This consensus challenge explosively peaks at cross-domain sites. In this paper, we proposed CD-Raft to address the cross-domain latency challenge, an optimized Raft protocol for strong consistency in cross-domain sites. CD-Raft can significantly reduce consensus latency by optimizing cross-domain round-trip time (RTT) for reads and writes, as well as carefully positioning the leader node. We verified the correctness of CD-Raft in a formal specification using the TLA+ specification, guaranteeing the strong consistency across sites. We have prototyped CD-Raft and evaluated it using the YCSB benchmark. Empirical results show that compared to the classic Raft, CD-Raft reduces the average latency by 32.90% and (99th percentile) tail latency by 49.24% for renown traces across multiple sites.2026-03-11T09:04:41ZYangyang WangZiqian ChengYucong DongZichen Xuhttp://arxiv.org/abs/2603.10514v1Estimating the condition number of Chebyshev filtered vectors with application to the ChASE library2026-03-11T08:10:31ZChebyshev filtered subspace iteration is a well-known algorithm for the solution of (symmetric/Hermitian) algebraic eigenproblems which has been implemented in several application codes~\cite{Kronik:2006ff, abinit:2020} or in stand alone libraries~\cite{ChASE}. An essential part of the algorithm is the QR-factorization of the array of vectors spanning the active subspace that have been filtered by the Chebyshev filter. Typically such an array has an a-priori unknown high condition number that directly influences the choice of QR-factorization algorithm. In this work we show how such condition number can be bound from above with precise and inexpensive estimates. We then proceed to use these estimates to implement a mechanism for the choice of QR-factorization in the ChASE library. We show how such mechanism enhance the performance of the library without compromising on its accuracy.2026-03-11T08:10:31Z20 pages, 3 figures. Journal paper to be submitted to SIAM SIMAXEdoardo Di NapoliXinzhe Wuhttp://arxiv.org/abs/2603.10436v1COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints2026-03-11T05:38:00ZLarge deep neural networks (DNNs), especially transformer-based and multimodal architectures, are computationally demanding and challenging to deploy on resource-constrained edge platforms like field robots. These challenges intensify in mission-critical scenarios (e.g., disaster response), where robots must collaborate under tight constraints on bandwidth, latency, and battery life, often without infrastructure or server support. To address these limitations, we present COHORT, a collaborative DNN inference and task-execution framework for multi-robot systems built on the Robotic Operating System (ROS). COHORT employs a hybrid offline-online reinforcement learning (RL) strategy to dynamically schedule and distribute DNN module execution across robots. Our key contributions are threefold: (a) Offline RL policy learning combined with Advantage-Weighted Regression (AWR), trained on auction-based task allocation data from heterogeneous DNN workloads across distributed robots, (b) Online policy adaptation via Multi-Agent PPO (MAPPO), initialized from the offline policy and fine-tuned in real time, and (c) comprehensive evaluation of COHORT on vision-language model (VLM) inference tasks such as CLIP and SAM, analyzing scalability with increasing robot/workload and robustness under . We benchmark COHORT against genetic algorithms and multiple RL baselines. Experimental results demonstrate that COHORT reduces battery consumption by 15.4% and increases GPU utilization by 51.67%, while satisfying frame-rate and deadline constraints 2.55 times of the time.2026-03-11T05:38:00ZRecently accepted at 27th IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks ( IEEE WoWMoM 2026)Mohammad Saeid AnwarAnuradha RaviIndrajeet GhoshGaurav ShindeCarl BusartNirmalya Roy