https://arxiv.org/api/7JjEFXjspePBtXult69V3OIkzkM 2026-06-10T19:56:14Z 28838 330 15 http://arxiv.org/abs/2503.11367v4 Efficient Distributed MLLM Training with Cornstarch 2026-05-24T17:29:55Z

Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. While there are a few works that have attempted to address the heterogeneity in MLLM training, their approaches are limited to only superficially considering the characteristics of MLLMs. In this paper, we present Cornstarch, an efficient distributed MLLM training framework that contemplates MLLM's unique characteristics in both model and data parallelization. Cornstarch introduces frozen-aware pipeline parallelism and token workload-balanced context parallelism to improve MLLM training throughput. Our extensive evaluation shows that Cornstarch outperforms state-of-the-art solutions by $2.26\times$ on average in terms of MLLM training throughput. Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.

2025-03-14T13:07:45Z ICML'26 Insu Jang Runyu Lu Nikhil Bansal Ang Chen Mosharaf Chowdhury http://arxiv.org/abs/2505.07417v2 LA-IMR: Latency-Aware, Predictive In-Memory Routing and Proactive Autoscaling for Tail-Latency-Sensitive Cloud Robotics 2026-05-24T14:55:51Z

Hybrid cloud-edge infrastructures now support latency-critical workloads ranging from autonomous vehicles and surgical robotics to immersive AR/VR. However, they continue to experience crippling long-tail latency spikes whenever bursty request streams exceed the capacity of heterogeneous edge and cloud tiers. To address these long-tail latency issues, we present Latency-Aware, Predictive In-Memory Routing and Proactive Autoscaling (LA-IMR). This control layer integrates a closed-form, utilization-driven latency model with event-driven scheduling, replica autoscaling, and edge-to-cloud offloading to mitigate 99th-percentile (P99) delays. Our analytic model decomposes end-to-end latency into processing, network, and queuing components, expressing inference latency as an affine power-law function of instance utilization. Once calibrated, it produces two complementary functions that drive: (i) millisecond-scale routing decisions for traffic offloading, and (ii) capacity planning that jointly determines replica pool sizes. LA-IMR enacts these decisions through a quality-differentiated, multi-queue scheduler and a custom-metric Kubernetes autoscaler that scales replicas proactively -- before queues build up -- rather than reactively based on lagging CPU metrics. Across representative vision workloads (YOLOv5m and EfficientDet) and bursty arrival traces, LA-IMR reduces P99 latency by up to 20.7 percent compared to traditional latency-only autoscaling, laying a principled foundation for next-generation, tail-tolerant cloud-edge inference services.

2025-05-12T10:12:24Z v2: Bibliography audited after citation-verification review; unverifiable references removed or replaced with traceable sources; related-work contextual discussion revised accordingly. Core method, algorithms, experiments, and reported results unchanged Eunil Seo Chanh Nguyen Erik Elmroth http://arxiv.org/abs/2605.10718v2 An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum 2026-05-24T14:23:35Z

Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault's Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.

2026-05-11T15:28:37Z Suvi De Silva Alfreds Lapkovskis Alaa Saleh Sasu Tarkoma Praveen Kumar Donta http://arxiv.org/abs/2507.15233v3 A Multi-Armed Bandit-Based Participant Selection Method for Federated Recommendation Systems 2026-05-24T11:43:51Z

Federated Recommendation Systems (FRS) enable privacy-preserving model training by keeping user data on edge devices. However, the practical deployment of FRS in Edge-Cloud environments faces significant challenges due to system and statistical heterogeneity. Existing FRS participant selection strategies struggle to dynamically balance the trade-off between model convergence speed and recommendation quality in such volatile environments. To address this, we formulate the FRS participant selection problem as a normalized utility cost addressing the model quality and system efficiency. Next, we propose a dynamic participant selection framework incorporating a Multi-Armed Bandit (MAB)-based solver for multimodal FRS. We design a client-utility function that jointly evaluates historical Client Performance Reputation, data quality, and real-time system latency. By leveraging an Upper Confidence Bound strategy, our framework effectively balances the exploration of under-sampled clients with the exploitation of high-performing ones. We validate the proposed approach on a realistic edge-cloud testbed implementation using a multimodal movie-recommendation task. Experimental results demonstrate that our MAB-driven approach outperforms other baselines across eight different data-skew scenarios. Specifically, it improves training efficiency by 32-50% while improving model quality metrics such as Recall@50 by up to around 5%

2025-07-21T04:28:55Z Accepted in IEEE/ACM CCGRID 2026 Jintao Liu Mohammad Goudarzi Adel Nadjaran Toosi http://arxiv.org/abs/2605.24832v1 Optimus: Elastic Decoding for Efficient Diffusion LLM Serving 2026-05-24T02:56:46Z

Large language model (LLM) serving is fundamentally limited by inefficient hardware utilization. Autoregressive (AR) decoding underutilizes GPUs due to its strictly sequential execution, while diffusion LLMs (DLLMs) improve throughput by decoding multiple tokens per iteration. However, fixed block-size diffusion decoding exhibits strong load sensitivity: large blocks exploit idle GPU resources under low load, but saturate early and incur substantial redundant computation under high load. As a result, throughput gains vanish beyond saturation, and no single decoding granularity performs well across dynamic serving workloads. We present Optimus, a serving system that enables elastic decoding for diffusion LLMs by dynamically adapting decoding granularity to runtime load. The key idea is to treat decoding granularity as a runtime control variable, balancing GPU utilization and token efficiency. Optimus combines chunked decoding, which enables fine-grained execution without retraining, with saturation-aware scheduling, a closed-loop mechanism that selects chunk sizes based on runtime conditions. Together with system-level optimizations and customized attention kernels, Optimus achieves significant performance improvements while preserving model accuracy. Experiments show that Optimus delivers up to 6.1x throughput improvement over AR decoding and 4.3x improvement over fixed-block diffusion LLM, while maintaining stable performance across diverse load regimes and improving end-to-end serving capacity under latency constraints. The source code is available at https://github.com/dubcyfor3/Optimus.

2026-05-24T02:56:46Z Chiyue Wei Cong Guo Bowen Duan Junyao Zhang Haoxuan Shan Yifei Wang Yangjie Zhou Hai "Helen" Li Danyang Zhuo Yiran Chen http://arxiv.org/abs/2602.21247v2 PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing 2026-05-24T00:23:16Z

The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from. PiPNN's core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory. PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.

2026-02-17T02:18:17Z To appear at KDD'26 Tobias Rubel Richard Wen Laxman Dhulipala Lars Gottesbüren Rajesh Jayaram Jakub Łącki http://arxiv.org/abs/2605.27446v1 Context-aware Simopt-Power: Using structural data with simulation metadata to optimise FPGA designs 2026-05-23T22:03:00Z

Pre-implementation behavioural simulation routinely validates functional correctness, yet it also produces rich switching-activity traces that are typically discarded by FPGA computer-aided design (CAD) flows. Prior simulation-guided and power-aware FPGA optimisations demonstrate the promise of exploiting this metadata, but many rely on fixed thresholds, narrow decision heuristics, or limited design awareness, often incurring substantial area overhead. This paper presents Context-aware Simopt-Power, a simulator-guided optimisation framework that combines activity metadata with lightweight structural features (sequential proximity, logic-depth proxies, and fan-out estimates) to more precisely target high-impact regions of the netlist. We additionally remove empirically tuned constants, replacing them with architecture-aware parameters such as LUT size and mapping constraints, and evaluate trade-offs using power, delay, and a more useful metrics, area-delay product (AD) and power-delay product (PD). Implemented in an open-source Yosys/ABC flow and evaluated on the complex Koios deep-learning accelerator benchmarks, Context-aware Simopt-Power achieves an average 6.8% dynamic-power reduction while limiting LUT overhead to 11.2%, thus enabling a holistic design optimisation.

2026-05-23T22:03:00Z SMACD 2026 IEEE conference Eashan Wadhwa Georgios Floros Shanker Shreejith http://arxiv.org/abs/2605.24569v1 Energy-Aware Computing in the Year 2026 2026-05-23T13:04:23Z

High-Performance Computing (HPC) has recently entered the Exascale era, and considerable efforts are being made to fully harness this potential power for large-scale applications, such as cutting-edge generative AI (training and exploitation). The corresponding energy consumption is very high, and forecasts are alarming, making this metric a critical systemic bottleneck. Addressing this issue presents a genuine challenge for the entire cloud-edge-HPC continuum at all scales, from low-power IoT microcontrollers to multi-megawatt data centers. Beyond financial costs, green computing is driven by considerations related to climate change and environmental concerns such as carbon footprint ($CO_2e$), as well as constraints on energy production and supply, leading to a real need to regulate {\em information and communication technology} (ICT) activities. This article presents a comprehensive overview of energy-efficient computing, taking into account the most recent and significant contributions. Based on this exploration of the state of the art, we design and describe a holistic taxonomy of the aforementioned publications, structured around various perspectives, including {\em hardware and software aspects, measurement instrumentation, software optimizations, dynamic task scheduling, voltage scaling, workload consolidation, federated learning}, and {\em cooling}. Particular emphasis is placed on large-scale AI, which receives significant attention due to its considerable resource requirements. We conclude with an analysis of a forward-looking roadmap that considers the main perspectives of sustainable computing.

2026-05-23T13:04:23Z 26 pages Roblex Nana Tchakoute Claude Tadonki http://arxiv.org/abs/2605.24326v1 ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training 2026-05-23T01:11:19Z

The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.

2026-05-23T01:11:19Z 28 pages, 27 figures Minghao Li Alicia Golden Samuel Hsia Michael Kuchnik Adi Gangidi Xu Zhang Ashmitha Jeevaraj Shetty Zachary DeVito Weiwei Chu Dong He Haoci Zhang Yuchen Hao Ruoming Pang James Hongyi Zeng Ying Zhang Minlan Yu Carole-Jean Wu http://arxiv.org/abs/2602.01086v2 MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI 2026-05-22T22:49:28Z

Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Clinical Agents" remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a "Context Mismatch": AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent-native data infrastructure where clinical events are immutable "Beads"--nodes in a Merkle Directed Acyclic Graph (DAG)--cryptographically referencing causal predecessors. This "write-once, read-many" architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React-based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR-to-DAG conversion transformed flat resources into a causally-linked graph. Our Breadth-First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real-time decision support. Tamper-evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the "Context Mismatch" by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for "Trustworthy Medical AI." It guarantees the context the AI receives is deterministic and tamper-evident, while the LLM determines interpretation. The structured Bead format serves as a token-efficient "AI-native language." We release MedBeads as open-source software to accelerate agent-native data standards.

2026-02-01T08:03:20Z 19 pages, 5 figures. Code available at https://github.com/medbeads/medbeads Takahito Nakajima http://arxiv.org/abs/2605.24259v1 Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure 2026-05-22T22:25:31Z

KV-cache reuse mechanisms increasingly expose priority, duration, offload, routing hints, scheduler modes, and event streams. These mechanisms help preserve reusable prefixes, but they do not by themselves define a portable contract for accepted future-reuse state when resident KV and active live KV cannot both fit. We introduce resident KV claims, a conformance contract that binds future-reuse intent to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. In controlled vLLM allocator probes, a 60-block resident claim and a 70-block active prefill exceed an 80-block usable KV pool. Write no-admit prevents the active request from becoming future reusable state, but it still allows active allocation to evict residents from the shared pool. A minimal vLLM prototype shows that hard protected resident claims convert this failure mode into scheduler-visible active refusal with direct blocking-claim attribution. The result is not a production speedup or a new cache-replacement algorithm. It is a runtime contract that turns unreported resident loss into reconstructable active/resident arbitration. A companion MicroRuntime and vLLM litmus suite distinguish ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, active refusal, and trace-level outcome reconstruction.

2026-05-22T22:25:31Z 20 pages, 4 figures; reproducibility artifacts linked in Appendix A Lukas Stepanek http://arxiv.org/abs/2506.09199v2 FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models 2026-05-22T21:44:16Z

Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups.

2025-06-10T19:36:36Z 21 pages, 12 figures Ninth Conference on Machine Learning and Systems (MLSys 2026) Hariharan Ramesh Jyotikrishna Dass http://arxiv.org/abs/2505.15988v2 An Ecosystem of Services for FAIR Computational Workflows 2026-05-22T21:40:35Z

Computational workflows represent major investments of effort and expertise. As first-class, publishable research objects of their own, they are key to sharing methodological know-how for reuse, reproducibility, and transparency. Thus, the application of the FAIR Principles to workflows is inevitable to enable them to be Findable, Accessible, Interoperable, and Reusable. Making workflows FAIR reduces duplication of effort, assists in the reuse of best practice approaches and community-supported standards, and ensures that workflows as digital objects can support reproducible, robust science. FAIR workflows draw from both FAIR data and software principles, and they help ensure and support data FAIRification. The FAIR Principles emphasize the association of persistent identifiers and machine-actionable metadata with workflows. Implementing the Principles requires a framework with appropriate programmatic protocols and an accompanying ecosystem of services, tools, policies, and best practices, as well the buy-in of existing workflow systems. The European EOSC-Life Workflow Collaboratory is an example of such a digital infrastructure for the Biosciences. It includes a metadata standards framework for describing workflows that is managed and used by dedicated new FAIR workflow services and programmatic APIs for interoperability and metadata access. It includes the WorkflowHub registry and LifeMonitor workflow testing service, and it incorporates existing workflow systems and packaging solutions. Here, we introduce the FAIR Principles for workflows and connect FAIR workflows with the FAIR ecosystems they inhabit with the EOSC-Life Collaboratory as a concrete example. We also introduce other community efforts that are easing the ways that workflows are shared and reused by others, and we discuss how the variations in different workflow settings impact their FAIR perspectives.

2025-05-21T20:11:58Z Chapter 4 in "Workflow Systems for Large-Scale Scientific Data Analysis", eds. Ulf Leser, Marcus Hilbrich, Sean R. Wilkinson, Rafael Ferreira da Silva Sean R. Wilkinson Johan Gustafsson Finn Bacall Khalid Belhajjame Salvador Capella Jose Maria Fernandez Gonzalez Jacob Fosso Tande Luiz Gadelha Daniel Garijo Patricia Grubel Bjorn Grüning Farah Zaib Khan Sehrish Kanwal Simone Leo Stuart Owen Luca Pireddu Line Pouchard Laura Rodríguez-Navas Beatriz Serrano-Solano Stian Soiland-Reyes Baiba Vilne Alan Williams Merridee Ann Wouters Frederik Coppens Carole Goble 10.14279/depositonce-25818 http://arxiv.org/abs/2605.24220v1 Polar: Agentic RL on Any Harness at Scale 2026-05-22T21:06:12Z

Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, porting these harnesses into RL environment interfaces remains difficult and often loses important training signals. We bridge this gap with polar, a rollout framework for scalable asynchronous RL over arbitrary agent harnesses. Polar treats the agent harness as a black box: it proxies LLM API calls, records token-level model interactions, and reconstructs token-faithful trajectories for training. Each rollout node efficiently manages runtime prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous service endpoints that can be consumed by independent trainers at scale. This decoupled design makes Polar agnostic to agent harnesses, training infrastructure, and RL algorithms while improving compute utilization for long-running agent workloads. We validate polar by training agents on software-engineering tasks with popular coding harnesses. Using simple GRPO, polar improves Qwen3.5-4B by 22.6, 4.8, 0.6 and 6.2 points on SWE-Bench Verified with the Codex, Claude Code, Qwen Code and Pi harnesses, respectively. We further demonstrate Polar for offline data generation over custom harnesses and ablate trajectory reconstruction strategies. Polar rewrites its preceding work, Prorl Agent, and has been registered as one of NeMo Gym environments.

2026-05-22T21:06:12Z 17 pages, 6 figures. 2 tables Binfeng Xu Hao Zhang Shaokun Zhang Songyang Han Mingjie Liu Jian Hu Shizhe Diao Zhenghui Jin Yunheng Zou Michael Demoret Jan Kautz Yi Dong http://arxiv.org/abs/2601.20273v2 SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs 2026-05-22T19:27:59Z

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).

2026-01-28T05:42:07Z Jiacheng Yang Jun Wu Yaoyao Ding Zhiying Xu Yida Wang Gennady Pekhimenko