https://arxiv.org/api/QXCXC55q2WpwE5Jqab4aytS+Ev0 2026-04-10T18:56:16Z 27953 450 15 http://arxiv.org/abs/2603.08163v2 Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet 2026-03-10T09:34:33Z

Recently, there has been increased interest in globally distributed training, which has the promise to both reduce training costs and democratize participation in building large-scale foundation models. However, existing models trained in a globally distributed manner are relatively small in scale and have only been trained with whitelisted participants. Therefore, they do not yet realize the full promise of democratized participation. In this report, we describe Covenant-72B, an LLM produced by the largest collaborative globally distributed pre-training run (in terms of both compute and model scale), which simultaneously allowed open, permissionless participation supported by a live blockchain protocol. We utilized a state-of-the-art communication-efficient optimizer, SparseLoCo, supporting dynamic participation with peers joining and leaving freely. Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run.

2026-03-09T09:44:13Z 26 pages, 6 figures, 4 tables; minor update, no content changes Joel Lidin Amir Sarfi Erfan Miahi Quentin Anthony Shivam Chauhan Evangelos Pappas Benjamin Thérien Eugene Belilovsky Samuel Dare http://arxiv.org/abs/2603.07685v2 Scalable Training of Mixture-of-Experts Models with Megatron Core 2026-03-10T06:23:58Z

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

2026-03-08T15:42:43Z Technical Report. 88 pages. 42 figures Zijie Yan NVIDIA Hongxiao Bai NVIDIA Xin Yao NVIDIA Dennis Liu NVIDIA Tong Liu NVIDIA Hongbin Liu NVIDIA Pingtian Li NVIDIA Evan Wu NVIDIA Shiqing Fan NVIDIA Li Tao NVIDIA Robin Zhang NVIDIA Yuzhong Wang NVIDIA Shifang Xu NVIDIA Jack Chang NVIDIA Xuwen Chen NVIDIA Kunlun Li NVIDIA Yan Bai NVIDIA Gao Deng NVIDIA Nan Zheng NVIDIA Vijay Anand Korthikanti NVIDIA Abhinav Khattar NVIDIA Ethan He NVIDIA Soham Govande NVIDIA Sangkug Lym NVIDIA Zhongbo Zhu NVIDIA Qi Zhang NVIDIA Haochen Yuan NVIDIA Xiaowei Ren NVIDIA Deyu Fu NVIDIA Tailai Ma NVIDIA Shunkang Zhang NVIDIA Jiang Shao NVIDIA Ray Wang NVIDIA Vasudevan Rengasamy NVIDIA Rachit Garg NVIDIA Santosh Bhavani NVIDIA Xipeng Li NVIDIA Chandler Zhou NVIDIA David Wu NVIDIA Yingcan Wei NVIDIA Ashwath Aithal NVIDIA Michael Andersch NVIDIA Mohammad Shoeybi NVIDIA Jiajie Yao NVIDIA June Yang NVIDIA http://arxiv.org/abs/2603.09229v1 Flash-KMeans: Fast and Memory-Efficient Exact K-Means 2026-03-10T05:54:52Z

$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.

2026-03-10T05:54:52Z Shuo Yang Haocheng Xi Yilong Zhao Muyang Li Xiaoze Fan Jintao Zhang Han Cai Yujun Lin Xiuyu Li Kurt Keutzer Song Han Chenfeng Xu Ion Stoica http://arxiv.org/abs/2603.09216v1 PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies 2026-03-10T05:39:03Z

On-device deployments of large language models (LLMs) are rapidly proliferating across mobile and edge platforms. LLM inference comprises a compute-intensive prefill phase and a memory bandwidth-intensive decode phase, and the decode phase has been widely recognized as well-suited to processing-in-memory (PIM) in both academia and industry. However, practical PIM-enabled systems face two obstacles between these phases, a memory attribute inconsistency in which prefill favors placing weights in a cacheable region for reuse whereas decode requires weights in a non-cacheable region to reliably trigger PIM, and a weight layout inconsistency between host-friendly and PIM-aware layouts. To address these problems, we introduce \textit{PIM-SHERPA}, a software-only method for efficient on-device LLM inference by resolving PIM memory attribute and layout inconsistencies. PIM-SHERPA provides two approaches, DRAM double buffering (DDB), which keeps a single PIM-aware weights in the non-cacheable region while prefetching the swizzled weights of the next layer into small cacheable buffers, and online weight rearrangement with swizzled memory copy (OWR), which performs the on-demand swizzled memory copy immediately before GEMM. Compared to a baseline PIM emulation system, PIM-SHERPA achieves approximately 47.8 - 49.7\% memory capacity savings while maintaining comparable performance to the theoretical maximum on the Llama 3.2 model. To the best of our knowledge, this is the first work to identify the memory attribute inconsistency and propose effective solutions on product-level PIM-enabled systems.

2026-03-10T05:39:03Z 13 pages, 13 figures Sunjung Lee Sanghoon Cha Hyeonsu Kim Seungwoo Seo Yuhwan Ro Sukhan Lee Byeongho Kim Yongjun Park Kyomin Sohn Seungwon Lee Jaehoon Yu http://arxiv.org/abs/2603.09191v1 Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges 2026-03-10T05:03:06Z

Unmanned aerial vehicle (UAV) swarms are increasingly explored for their potentials in various applications such as surveillance, disaster response, and military. However, UAV swarms face significant challenges of implementing effective and rapid decisions under dynamic and uncertain environments. The traditional decision-making frameworks, mainly relying on centralized control and rigid architectures, are limited by their adaptability and scalability especially in complex environments. To overcome these challenges, in this paper, we propose a hierarchical Observe-Orient-Decide-Act (H-OODA) loop based framework for the UAV swarm operation in uncertain environments, which is implemented by embedding the classical OODA loop across the cloud-edge-terminal layers, and leveraging the network function virtualization (NFV) technology to provide flexible and scalable decision-making functions. In addition, based on the proposed H-OODA framework, we joint autonomous decision-making and cooperative control to enhance the adaptability and efficiency of UAV swarms. Furthermore, we present some typical case studies to verify the improvement and efficiency of the proposed framework. Finally, the potential challenges and possible directions are analyzed to provide insights for the future H-OODA enabled UAV swarms.

2026-03-10T05:03:06Z Ziye Jia Yao Wu Qihui Wu Lijun He Qiuming Zhu Fuhui Zhou Zhu Han http://arxiv.org/abs/2603.09122v1 Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration 2026-03-10T02:55:37Z

Distributed key-value stores are widely adopted to support elastic big data applications, leveraging purpose-built consensus algorithms like Raft to ensure data consistency. However, through systematic analysis, we reveal a critical performance issue in such consistent stores, i.e., overlapping persistence operations between consensus protocols and underlying storage engines result in significant I/O overhead. To address this issue, we present Nezha, a prototype distributed storage system that innovatively integrates key-value separation with Raft to provide scalable throughput in a strong consistency guarantee. Nezha redesigns the persistence strategy at the operation level and incorporates leveled garbage collection, significantly improving read and write performance while preserving Raft's safety properties. Experimental results demonstrate that, on average, Nezha achieves throughput improvements of 460.2%, 12.5%, and 72.6% for put, get, and scan operations, respectively.

2026-03-10T02:55:37Z Accepted to ICDE 2026 (main research track). The main paper is 12 pages excluding references Yangyang Wang Yucong Dong Ziqian Cheng Zichen Xu http://arxiv.org/abs/2407.00011v2 Enhancing Computational Efficiency in Multiscale Systems Using Deep Learning of Coordinates and Flow Maps 2026-03-10T00:44:02Z

Complex systems often show macroscopic coherent behavior due to the interactions of microscopic agents like molecules, cells, or individuals in a population with their environment. However, simulating such systems poses several computational challenges during simulation as the underlying dynamics vary and span wide spatiotemporal scales of interest. To capture the fast-evolving features, finer time steps are required while ensuring that the simulation time is long enough to capture the slow-scale behavior, making the analyses computationally unmanageable. This paper showcases how deep learning techniques can be used to develop a precise time-stepping approach for multiscale systems using the joint discovery of coordinates and flow maps. While the former allows us to represent the multiscale dynamics on a representative basis, the latter enables the iterative time-stepping estimation of the reduced variables. The resulting framework achieves state-of-the-art predictive accuracy while incurring lesser computational costs. We demonstrate this ability of the proposed scheme on the large-scale Fitzhugh Nagumo neuron model and the 1D Kuramoto-Sivashinsky equation in the chaotic regime.

2024-04-28T14:05:13Z The submission needs revision Asif Hamid Danish Rafiq Shahkar Ahmad Nahvi Mohammad Abid Bazaz http://arxiv.org/abs/2603.09038v1 Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores 2026-03-10T00:12:47Z

Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.

2026-03-10T00:12:47Z Jiqun Tu Ian Karlin John Camier Veselin Dobrev Tzanio Kolev Stefan Henneking Omar Ghattas http://arxiv.org/abs/2603.09032v1 Two Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning 2026-03-10T00:07:38Z

Scientific machine learning (SciML) is increasingly applied to in-field processing, controlling, and monitoring; however, wide-area sensing, real-time demands, and strict energy and reliability constraints make centralized SciML implementation impractical. Most SciML models assume raw data aggregation at a central node, incurring prohibitively high communication latency and energy costs; yet, distributing models developed for general-purpose ML often breaks essential physical principles, resulting in degraded performance. To address these challenges, we introduce EPIC, a hardware- and physics-co-guided distributed SciML framework, using full-waveform inversion (FWI) as a representative task. EPIC performs lightweight local encoding on end devices and physics-aware decoding at a central node. By transmitting compact latent features rather than high-volume raw data and by using cross-attention to capture inter-receiver wavefield coupling, EPIC significantly reduces communication cost while preserving physical fidelity. Evaluated on a distributed testbed with five end devices and one central node, and across 10 datasets from OpenFWI, EPIC reduces latency by 8.9$\times$ and communication energy by 33.8$\times$, while even improving reconstruction fidelity on 8 out of 10 datasets.

2026-03-10T00:07:38Z 7 pages, 9 figures. Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC 2026), Long Beach, CA, July 2026 Yuchen Yuan Junhuan Yang Hao Wan Yipei Liu Hanhan Wu Youzuo Lin Lei Yang http://arxiv.org/abs/2603.09025v1 Lockbox -- A Zero Trust Architecture for Secure Processing of Sensitive Cloud Workloads 2026-03-09T23:45:00Z

Enterprises increasingly rely on cloud-based applications to process highly sensitive data artifacts. Although cloud adoption improves agility and scalability, it also introduces new security challenges such as expanded attack surfaces, a wider radius of attack from credential compromise, and challenges maintaining strict access controls across users, services, and workflows. These challenges are especially acute for applications that handle privileged data and execute security-critical analysis, where traditional trust boundaries and ad hoc safeguards are insufficient. This paper presents Lockbox; a Zero Trust architecture designed for secure processing of sensitive cloud workloads under strict enterprise security and governance requirements. Lockbox applies explicit trust verification, strong isolation, least-privilege access, and policy-driven enforcement throughout the entire application lifecycle, from user authentication and document ingestion to analysis execution and result storage. The system incorporates modern cloud security primitives including; role-based access control, centralized key management, encryption in transit and at rest, and controlled integration with cloud-based data processing services, ensuring that sensitive artifacts remain protected and accessible only to authorized users. We discuss the usage of Lockbox in processing highly sensitive cybersecurity reports and demonstrate how this architecture enables organizations to safely adopt advanced capabilities, including AI-assisted processing, without weakening their security posture.

2026-03-09T23:45:00Z Vamshi Krishna Thotempudi Mahima Agarwal Raghav Batta Anjali Mangal http://arxiv.org/abs/2603.08960v1 The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference 2026-03-09T21:48:04Z

Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the $qs$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ($s$), the fraction of parameters activated per token, and the quality-equivalence factor ($q$), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable. Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment.

2026-03-09T21:48:04Z 10 pages, 6 tables Vignesh Adhinarayanan Nuwan Jayasena http://arxiv.org/abs/2603.08954v1 A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations 2026-03-09T21:40:17Z

The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.

2026-03-09T21:40:17Z Accepted to CAC: Applied Computing & Automation Conferences 2026. 16 pages, 6 figures Joshua Castillo Ravi Mukkamala http://arxiv.org/abs/2603.08911v1 FedLECC: Cluster- and Loss-Guided Client Selection for Federated Learning under Non-IID Data 2026-03-09T20:28:17Z

Federated Learning (FL) enables distributed Artificial Intelligence (AI) across cloud-edge environments by allowing collaborative model training without centralizing data. In cross-device deployments, FL systems face strict communication and participation constraints, as well as strong non-independent and identically distributed (non-IID) data that degrades convergence and model quality. Since only a subset of devices (a.k.a clients) can participate per training round, intelligent client selection becomes a key systems challenge. This paper proposes FedLECC (Federated Learning with Enhanced Cluster Choice), a lightweight, cluster-aware, and loss-guided client selection strategy for cross-device FL. FedLECC groups clients by label-distribution similarity and prioritizes clusters and clients with higher local loss, enabling the selection of a small yet informative and diverse set of clients. Experimental results under severe label skew show that FedLECC improves test accuracy by up to 12%, while reducing communication rounds by approximately 22% and overall communication overhead by up to 50% compared to strong baselines. These results demonstrate that informed client selection improves the efficiency and scalability of FL workloads in cloud-edge systems.

2026-03-09T20:28:17Z Accepted to the IEEE International Workshop on Intelligent Cloud Computing and Networking (ICCN) from the IEEE International Conference on Computer Communications (INFOCOM) 2026 Daniel M. Jimenez-Gutierrez Giovanni Giunta Mehrdad Hassanzadeh Aris Anagnostopoulos Ioannis Chatzigiannakis Andrea Vitaletti http://arxiv.org/abs/2603.08854v1 DeZent: Decentralized z-Anonymity with Privacy-Preserving Coordination 2026-03-09T19:14:23Z

Analyzing large volumes of sensor network data, such as electricity consumption measurements from smart meters, is essential for modern applications but raises significant privacy concerns. Privacy-enhancing technologies like z-anonymity offer efficient anonymization for continuous data streams by suppressing rare values that could lead to re-identification, making it particularly suited for resource-constrained environments. Originally designed for centralized architectures, z-anonymity assumes a trusted central entity. In this paper, we introduce deZent, a decentralized implementation of z-anonymity that minimizes trust in the central entity by realizing local z-anonymity with lightweight coordination. We develop deZent using a stochastic counting structure and secure sum to coordinate private anonymization across the network. Our results show that deZent achieves comparable performance to centralized z-anonymity in terms of publication ratio, while reducing the communication overhead towards the central entity. Thus, deZent presents a promising approach for enhancing privacy in sensor networks while preserving system efficiency.

2026-03-09T19:14:23Z 8 pages + appendix, 5 figures Carolin Brunn Florian Tschorsch http://arxiv.org/abs/2603.08797v1 Serving Compound Inference Systems on Datacenter GPUs 2026-03-09T18:01:10Z

Applications in emerging domains such as XR are being built as compound inference systems, where multiple ML models are composed in the form of a task graph to service each request. Serving these compound systems efficiently raises two questions: how to apportion end-to-end latency and accuracy budgets between different tasks in a compound inference system, and how to allocate resources effectively for different models with varying resource requirements. We present JigsawServe, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system. Analytical evaluation of a system with a large number of GPUs shows that JigsawServe can increase the maximum serviceable demand (in requests per second) by 11.3x when compared to the closest prior work. Our empirical evaluation shows that for a large range of scenarios, JigsawServe consumes only 43.3% of the available GPU resources while meeting accuracy SLOs with less than 0.6% latency SLO violations. All of the features in JigsawServe contribute to this high efficiency -- sacrificing any one feature of accuracy scaling, GPU spatial partitioning, or task-graph-informed resource budgeting significantly reduces efficiency.

2026-03-09T18:01:10Z Extended version of work that will be presented at the 5th HCDS workshop (co-located with ASPLOS 2026) Sriram Devata Rahul Singh Sarita Adve