https://arxiv.org/api/QXCXC55q2WpwE5Jqab4aytS+Ev0
2026-04-10T18:56:16Z
27953
450
15
http://arxiv.org/abs/2603.08163v2
Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet
2026-03-10T09:34:33Z
Recently, there has been increased interest in globally distributed training, which has the promise to both reduce training costs and democratize participation in building large-scale foundation models. However, existing models trained in a globally distributed manner are relatively small in scale and have only been trained with whitelisted participants. Therefore, they do not yet realize the full promise of democratized participation. In this report, we describe Covenant-72B, an LLM produced by the largest collaborative globally distributed pre-training run (in terms of both compute and model scale), which simultaneously allowed open, permissionless participation supported by a live blockchain protocol. We utilized a state-of-the-art communication-efficient optimizer, SparseLoCo, supporting dynamic participation with peers joining and leaving freely. Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run.
2026-03-09T09:44:13Z
26 pages, 6 figures, 4 tables; minor update, no content changes
Joel Lidin
Amir Sarfi
Erfan Miahi
Quentin Anthony
Shivam Chauhan
Evangelos Pappas
Benjamin Thérien
Eugene Belilovsky
Samuel Dare
http://arxiv.org/abs/2603.07685v2
Scalable Training of Mixture-of-Experts Models with Megatron Core
2026-03-10T06:23:58Z
Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack.
We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs.
This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.
2026-03-08T15:42:43Z
Technical Report. 88 pages. 42 figures
Zijie Yan
NVIDIA
Hongxiao Bai
NVIDIA
Xin Yao
NVIDIA
Dennis Liu
NVIDIA
Tong Liu
NVIDIA
Hongbin Liu
NVIDIA
Pingtian Li
NVIDIA
Evan Wu
NVIDIA
Shiqing Fan
NVIDIA
Li Tao
NVIDIA
Robin Zhang
NVIDIA
Yuzhong Wang
NVIDIA
Shifang Xu
NVIDIA
Jack Chang
NVIDIA
Xuwen Chen
NVIDIA
Kunlun Li
NVIDIA
Yan Bai
NVIDIA
Gao Deng
NVIDIA
Nan Zheng
NVIDIA
Vijay Anand Korthikanti
NVIDIA
Abhinav Khattar
NVIDIA
Ethan He
NVIDIA
Soham Govande
NVIDIA
Sangkug Lym
NVIDIA
Zhongbo Zhu
NVIDIA
Qi Zhang
NVIDIA
Haochen Yuan
NVIDIA
Xiaowei Ren
NVIDIA
Deyu Fu
NVIDIA
Tailai Ma
NVIDIA
Shunkang Zhang
NVIDIA
Jiang Shao
NVIDIA
Ray Wang
NVIDIA
Vasudevan Rengasamy
NVIDIA
Rachit Garg
NVIDIA
Santosh Bhavani
NVIDIA
Xipeng Li
NVIDIA
Chandler Zhou
NVIDIA
David Wu
NVIDIA
Yingcan Wei
NVIDIA
Ashwath Aithal
NVIDIA
Michael Andersch
NVIDIA
Mohammad Shoeybi
NVIDIA
Jiajie Yao
NVIDIA
June Yang
NVIDIA
http://arxiv.org/abs/2603.09229v1
Flash-KMeans: Fast and Memory-Efficient Exact K-Means
2026-03-10T05:54:52Z
$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.
2026-03-10T05:54:52Z
Shuo Yang
Haocheng Xi
Yilong Zhao
Muyang Li
Xiaoze Fan
Jintao Zhang
Han Cai
Yujun Lin
Xiuyu Li
Kurt Keutzer
Song Han
Chenfeng Xu
Ion Stoica
http://arxiv.org/abs/2603.09216v1
PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies
2026-03-10T05:39:03Z
On-device deployments of large language models (LLMs) are rapidly proliferating across mobile and edge platforms. LLM inference comprises a compute-intensive prefill phase and a memory bandwidth-intensive decode phase, and the decode phase has been widely recognized as well-suited to processing-in-memory (PIM) in both academia and industry. However, practical PIM-enabled systems face two obstacles between these phases, a memory attribute inconsistency in which prefill favors placing weights in a cacheable region for reuse whereas decode requires weights in a non-cacheable region to reliably trigger PIM, and a weight layout inconsistency between host-friendly and PIM-aware layouts. To address these problems, we introduce \textit{PIM-SHERPA}, a software-only method for efficient on-device LLM inference by resolving PIM memory attribute and layout inconsistencies. PIM-SHERPA provides two approaches, DRAM double buffering (DDB), which keeps a single PIM-aware weights in the non-cacheable region while prefetching the swizzled weights of the next layer into small cacheable buffers, and online weight rearrangement with swizzled memory copy (OWR), which performs the on-demand swizzled memory copy immediately before GEMM. Compared to a baseline PIM emulation system, PIM-SHERPA achieves approximately 47.8 - 49.7\% memory capacity savings while maintaining comparable performance to the theoretical maximum on the Llama 3.2 model. To the best of our knowledge, this is the first work to identify the memory attribute inconsistency and propose effective solutions on product-level PIM-enabled systems.
2026-03-10T05:39:03Z
13 pages, 13 figures
Sunjung Lee
Sanghoon Cha
Hyeonsu Kim
Seungwoo Seo
Yuhwan Ro
Sukhan Lee
Byeongho Kim
Yongjun Park
Kyomin Sohn
Seungwon Lee
Jaehoon Yu
http://arxiv.org/abs/2603.09191v1
Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges
2026-03-10T05:03:06Z
Unmanned aerial vehicle (UAV) swarms are increasingly explored for their potentials in various applications such as surveillance, disaster response, and military. However, UAV swarms face significant challenges of implementing effective and rapid decisions under dynamic and uncertain environments. The traditional decision-making frameworks, mainly relying on centralized control and rigid architectures, are limited by their adaptability and scalability especially in complex environments. To overcome these challenges, in this paper, we propose a hierarchical Observe-Orient-Decide-Act (H-OODA) loop based framework for the UAV swarm operation in uncertain environments, which is implemented by embedding the classical OODA loop across the cloud-edge-terminal layers, and leveraging the network function virtualization (NFV) technology to provide flexible and scalable decision-making functions. In addition, based on the proposed H-OODA framework, we joint autonomous decision-making and cooperative control to enhance the adaptability and efficiency of UAV swarms. Furthermore, we present some typical case studies to verify the improvement and efficiency of the proposed framework. Finally, the potential challenges and possible directions are analyzed to provide insights for the future H-OODA enabled UAV swarms.
2026-03-10T05:03:06Z
Ziye Jia
Yao Wu
Qihui Wu
Lijun He
Qiuming Zhu
Fuhui Zhou
Zhu Han
http://arxiv.org/abs/2603.09122v1
Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration
2026-03-10T02:55:37Z
Distributed key-value stores are widely adopted to support elastic big data applications, leveraging purpose-built consensus algorithms like Raft to ensure data consistency. However, through systematic analysis, we reveal a critical performance issue in such consistent stores, i.e., overlapping persistence operations between consensus protocols and underlying storage engines result in significant I/O overhead. To address this issue, we present Nezha, a prototype distributed storage system that innovatively integrates key-value separation with Raft to provide scalable throughput in a strong consistency guarantee. Nezha redesigns the persistence strategy at the operation level and incorporates leveled garbage collection, significantly improving read and write performance while preserving Raft's safety properties. Experimental results demonstrate that, on average, Nezha achieves throughput improvements of 460.2%, 12.5%, and 72.6% for put, get, and scan operations, respectively.
2026-03-10T02:55:37Z
Accepted to ICDE 2026 (main research track). The main paper is 12 pages excluding references
Yangyang Wang
Yucong Dong
Ziqian Cheng
Zichen Xu
http://arxiv.org/abs/2407.00011v2
Enhancing Computational Efficiency in Multiscale Systems Using Deep Learning of Coordinates and Flow Maps
2026-03-10T00:44:02Z
Complex systems often show macroscopic coherent behavior due to the interactions of microscopic agents like molecules, cells, or individuals in a population with their environment. However, simulating such systems poses several computational challenges during simulation as the underlying dynamics vary and span wide spatiotemporal scales of interest. To capture the fast-evolving features, finer time steps are required while ensuring that the simulation time is long enough to capture the slow-scale behavior, making the analyses computationally unmanageable. This paper showcases how deep learning techniques can be used to develop a precise time-stepping approach for multiscale systems using the joint discovery of coordinates and flow maps. While the former allows us to represent the multiscale dynamics on a representative basis, the latter enables the iterative time-stepping estimation of the reduced variables. The resulting framework achieves state-of-the-art predictive accuracy while incurring lesser computational costs. We demonstrate this ability of the proposed scheme on the large-scale Fitzhugh Nagumo neuron model and the 1D Kuramoto-Sivashinsky equation in the chaotic regime.
2024-04-28T14:05:13Z
The submission needs revision
Asif Hamid
Danish Rafiq
Shahkar Ahmad Nahvi
Mohammad Abid Bazaz
http://arxiv.org/abs/2603.09038v1
Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores
2026-03-10T00:12:47Z
Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.
2026-03-10T00:12:47Z
Jiqun Tu
Ian Karlin
John Camier
Veselin Dobrev
Tzanio Kolev
Stefan Henneking
Omar Ghattas
http://arxiv.org/abs/2603.09032v1
Two Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning
2026-03-10T00:07:38Z
Scientific machine learning (SciML) is increasingly applied to in-field processing, controlling, and monitoring; however, wide-area sensing, real-time demands, and strict energy and reliability constraints make centralized SciML implementation impractical. Most SciML models assume raw data aggregation at a central node, incurring prohibitively high communication latency and energy costs; yet, distributing models developed for general-purpose ML often breaks essential physical principles, resulting in degraded performance. To address these challenges, we introduce EPIC, a hardware- and physics-co-guided distributed SciML framework, using full-waveform inversion (FWI) as a representative task. EPIC performs lightweight local encoding on end devices and physics-aware decoding at a central node. By transmitting compact latent features rather than high-volume raw data and by using cross-attention to capture inter-receiver wavefield coupling, EPIC significantly reduces communication cost while preserving physical fidelity. Evaluated on a distributed testbed with five end devices and one central node, and across 10 datasets from OpenFWI, EPIC reduces latency by 8.9$\times$ and communication energy by 33.8$\times$, while even improving reconstruction fidelity on 8 out of 10 datasets.
2026-03-10T00:07:38Z
7 pages, 9 figures. Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC 2026), Long Beach, CA, July 2026
Yuchen Yuan
Junhuan Yang
Hao Wan
Yipei Liu
Hanhan Wu
Youzuo Lin
Lei Yang
http://arxiv.org/abs/2603.09025v1
Lockbox -- A Zero Trust Architecture for Secure Processing of Sensitive Cloud Workloads
2026-03-09T23:45:00Z
Enterprises increasingly rely on cloud-based applications to process highly sensitive data artifacts. Although cloud adoption improves agility and scalability, it also introduces new security challenges such as expanded attack surfaces, a wider radius of attack from credential compromise, and challenges maintaining strict access controls across users, services, and workflows. These challenges are especially acute for applications that handle privileged data and execute security-critical analysis, where traditional trust boundaries and ad hoc safeguards are insufficient. This paper presents Lockbox; a Zero Trust architecture designed for secure processing of sensitive cloud workloads under strict enterprise security and governance requirements. Lockbox applies explicit trust verification, strong isolation, least-privilege access, and policy-driven enforcement throughout the entire application lifecycle, from user authentication and document ingestion to analysis execution and result storage. The system incorporates modern cloud security primitives including; role-based access control, centralized key management, encryption in transit and at rest, and controlled integration with cloud-based data processing services, ensuring that sensitive artifacts remain protected and accessible only to authorized users. We discuss the usage of Lockbox in processing highly sensitive cybersecurity reports and demonstrate how this architecture enables organizations to safely adopt advanced capabilities, including AI-assisted processing, without weakening their security posture.
2026-03-09T23:45:00Z
Vamshi Krishna Thotempudi
Mahima Agarwal
Raghav Batta
Anjali Mangal
http://arxiv.org/abs/2603.08960v1
The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference
2026-03-09T21:48:04Z
Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths.
We introduce the $qs$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ($s$), the fraction of parameters activated per token, and the quality-equivalence factor ($q$), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable.
Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment.
2026-03-09T21:48:04Z
10 pages, 6 tables
Vignesh Adhinarayanan
Nuwan Jayasena
http://arxiv.org/abs/2603.08954v1
A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations
2026-03-09T21:40:17Z
The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.
2026-03-09T21:40:17Z
Accepted to CAC: Applied Computing & Automation Conferences 2026. 16 pages, 6 figures
Joshua Castillo
Ravi Mukkamala
http://arxiv.org/abs/2603.08911v1
FedLECC: Cluster- and Loss-Guided Client Selection for Federated Learning under Non-IID Data
2026-03-09T20:28:17Z
Federated Learning (FL) enables distributed Artificial Intelligence (AI) across cloud-edge environments by allowing collaborative model training without centralizing data. In cross-device deployments, FL systems face strict communication and participation constraints, as well as strong non-independent and identically distributed (non-IID) data that degrades convergence and model quality. Since only a subset of devices (a.k.a clients) can participate per training round, intelligent client selection becomes a key systems challenge. This paper proposes FedLECC (Federated Learning with Enhanced Cluster Choice), a lightweight, cluster-aware, and loss-guided client selection strategy for cross-device FL. FedLECC groups clients by label-distribution similarity and prioritizes clusters and clients with higher local loss, enabling the selection of a small yet informative and diverse set of clients. Experimental results under severe label skew show that FedLECC improves test accuracy by up to 12%, while reducing communication rounds by approximately 22% and overall communication overhead by up to 50% compared to strong baselines. These results demonstrate that informed client selection improves the efficiency and scalability of FL workloads in cloud-edge systems.
2026-03-09T20:28:17Z
Accepted to the IEEE International Workshop on Intelligent Cloud Computing and Networking (ICCN) from the IEEE International Conference on Computer Communications (INFOCOM) 2026
Daniel M. Jimenez-Gutierrez
Giovanni Giunta
Mehrdad Hassanzadeh
Aris Anagnostopoulos
Ioannis Chatzigiannakis
Andrea Vitaletti
http://arxiv.org/abs/2603.08854v1
DeZent: Decentralized z-Anonymity with Privacy-Preserving Coordination
2026-03-09T19:14:23Z
Analyzing large volumes of sensor network data, such as electricity consumption measurements from smart meters, is essential for modern applications but raises significant privacy concerns. Privacy-enhancing technologies like z-anonymity offer efficient anonymization for continuous data streams by suppressing rare values that could lead to re-identification, making it particularly suited for resource-constrained environments. Originally designed for centralized architectures, z-anonymity assumes a trusted central entity. In this paper, we introduce deZent, a decentralized implementation of z-anonymity that minimizes trust in the central entity by realizing local z-anonymity with lightweight coordination. We develop deZent using a stochastic counting structure and secure sum to coordinate private anonymization across the network. Our results show that deZent achieves comparable performance to centralized z-anonymity in terms of publication ratio, while reducing the communication overhead towards the central entity. Thus, deZent presents a promising approach for enhancing privacy in sensor networks while preserving system efficiency.
2026-03-09T19:14:23Z
8 pages + appendix, 5 figures
Carolin Brunn
Florian Tschorsch
http://arxiv.org/abs/2603.08797v1
Serving Compound Inference Systems on Datacenter GPUs
2026-03-09T18:01:10Z
Applications in emerging domains such as XR are being built as compound inference systems, where multiple ML models are composed in the form of a task graph to service each request. Serving these compound systems efficiently raises two questions: how to apportion end-to-end latency and accuracy budgets between different tasks in a compound inference system, and how to allocate resources effectively for different models with varying resource requirements. We present JigsawServe, the first serving framework that jointly optimizes for latency, accuracy, and cost in terms of GPU resources by adaptively choosing model variants and performing fine-grained resource allocation by spatially partitioning the GPUs for each task of a compound inference system. Analytical evaluation of a system with a large number of GPUs shows that JigsawServe can increase the maximum serviceable demand (in requests per second) by 11.3x when compared to the closest prior work. Our empirical evaluation shows that for a large range of scenarios, JigsawServe consumes only 43.3% of the available GPU resources while meeting accuracy SLOs with less than 0.6% latency SLO violations. All of the features in JigsawServe contribute to this high efficiency -- sacrificing any one feature of accuracy scaling, GPU spatial partitioning, or task-graph-informed resource budgeting significantly reduces efficiency.
2026-03-09T18:01:10Z
Extended version of work that will be presented at the 5th HCDS workshop (co-located with ASPLOS 2026)
Sriram Devata
Rahul Singh
Sarita Adve