https://arxiv.org/api/7JjEFXjspePBtXult69V3OIkzkM2026-06-10T19:56:14Z2883833015http://arxiv.org/abs/2503.11367v4Efficient Distributed MLLM Training with Cornstarch2026-05-24T17:29:55ZMultimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. While there are a few works that have attempted to address the heterogeneity in MLLM training, their approaches are limited to only superficially considering the characteristics of MLLMs.
In this paper, we present Cornstarch, an efficient distributed MLLM training framework that contemplates MLLM's unique characteristics in both model and data parallelization. Cornstarch introduces frozen-aware pipeline parallelism and token workload-balanced context parallelism to improve MLLM training throughput. Our extensive evaluation shows that Cornstarch outperforms state-of-the-art solutions by $2.26\times$ on average in terms of MLLM training throughput.
Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.2025-03-14T13:07:45ZICML'26Insu JangRunyu LuNikhil BansalAng ChenMosharaf Chowdhuryhttp://arxiv.org/abs/2505.07417v2LA-IMR: Latency-Aware, Predictive In-Memory Routing and Proactive Autoscaling for Tail-Latency-Sensitive Cloud Robotics2026-05-24T14:55:51ZHybrid cloud-edge infrastructures now support latency-critical workloads ranging from autonomous vehicles and surgical robotics to immersive AR/VR. However, they continue to experience crippling long-tail latency spikes whenever bursty request streams exceed the capacity of heterogeneous edge and cloud tiers. To address these long-tail latency issues, we present Latency-Aware, Predictive In-Memory Routing and Proactive Autoscaling (LA-IMR). This control layer integrates a closed-form, utilization-driven latency model with event-driven scheduling, replica autoscaling, and edge-to-cloud offloading to mitigate 99th-percentile (P99) delays. Our analytic model decomposes end-to-end latency into processing, network, and queuing components, expressing inference latency as an affine power-law function of instance utilization. Once calibrated, it produces two complementary functions that drive: (i) millisecond-scale routing decisions for traffic offloading, and (ii) capacity planning that jointly determines replica pool sizes. LA-IMR enacts these decisions through a quality-differentiated, multi-queue scheduler and a custom-metric Kubernetes autoscaler that scales replicas proactively -- before queues build up -- rather than reactively based on lagging CPU metrics. Across representative vision workloads (YOLOv5m and EfficientDet) and bursty arrival traces, LA-IMR reduces P99 latency by up to 20.7 percent compared to traditional latency-only autoscaling, laying a principled foundation for next-generation, tail-tolerant cloud-edge inference services.2025-05-12T10:12:24Zv2: Bibliography audited after citation-verification review; unverifiable references removed or replaced with traceable sources; related-work contextual discussion revised accordingly. Core method, algorithms, experiments, and reported results unchangedEunil SeoChanh NguyenErik Elmrothhttp://arxiv.org/abs/2605.10718v2An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum2026-05-24T14:23:35ZGrey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault's Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.2026-05-11T15:28:37ZSuvi De SilvaAlfreds LapkovskisAlaa SalehSasu TarkomaPraveen Kumar Dontahttp://arxiv.org/abs/2507.15233v3A Multi-Armed Bandit-Based Participant Selection Method for Federated Recommendation Systems2026-05-24T11:43:51ZFederated Recommendation Systems (FRS) enable privacy-preserving model training by keeping user data on edge devices. However, the practical deployment of FRS in Edge-Cloud environments faces significant challenges due to system and statistical heterogeneity. Existing FRS participant selection strategies struggle to dynamically balance the trade-off between model convergence speed and recommendation quality in such volatile environments. To address this, we formulate the FRS participant selection problem as a normalized utility cost addressing the model quality and system efficiency. Next, we propose a dynamic participant selection framework incorporating a Multi-Armed Bandit (MAB)-based solver for multimodal FRS. We design a client-utility function that jointly evaluates historical Client Performance Reputation, data quality, and real-time system latency. By leveraging an Upper Confidence Bound strategy, our framework effectively balances the exploration of under-sampled clients with the exploitation of high-performing ones. We validate the proposed approach on a realistic edge-cloud testbed implementation using a multimodal movie-recommendation task. Experimental results demonstrate that our MAB-driven approach outperforms other baselines across eight different data-skew scenarios. Specifically, it improves training efficiency by 32-50% while improving model quality metrics such as Recall@50 by up to around 5%2025-07-21T04:28:55ZAccepted in IEEE/ACM CCGRID 2026Jintao LiuMohammad GoudarziAdel Nadjaran Toosihttp://arxiv.org/abs/2605.24832v1Optimus: Elastic Decoding for Efficient Diffusion LLM Serving2026-05-24T02:56:46ZLarge language model (LLM) serving is fundamentally limited by inefficient hardware utilization. Autoregressive (AR) decoding underutilizes GPUs due to its strictly sequential execution, while diffusion LLMs (DLLMs) improve throughput by decoding multiple tokens per iteration. However, fixed block-size diffusion decoding exhibits strong load sensitivity: large blocks exploit idle GPU resources under low load, but saturate early and incur substantial redundant computation under high load. As a result, throughput gains vanish beyond saturation, and no single decoding granularity performs well across dynamic serving workloads.
We present Optimus, a serving system that enables elastic decoding for diffusion LLMs by dynamically adapting decoding granularity to runtime load. The key idea is to treat decoding granularity as a runtime control variable, balancing GPU utilization and token efficiency. Optimus combines chunked decoding, which enables fine-grained execution without retraining, with saturation-aware scheduling, a closed-loop mechanism that selects chunk sizes based on runtime conditions. Together with system-level optimizations and customized attention kernels, Optimus achieves significant performance improvements while preserving model accuracy. Experiments show that Optimus delivers up to 6.1x throughput improvement over AR decoding and 4.3x improvement over fixed-block diffusion LLM, while maintaining stable performance across diverse load regimes and improving end-to-end serving capacity under latency constraints. The source code is available at https://github.com/dubcyfor3/Optimus.2026-05-24T02:56:46ZChiyue WeiCong GuoBowen DuanJunyao ZhangHaoxuan ShanYifei WangYangjie ZhouHai "Helen" LiDanyang ZhuoYiran Chenhttp://arxiv.org/abs/2602.21247v2PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing2026-05-24T00:23:16ZThe fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from.
PiPNN's core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory.
PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.2026-02-17T02:18:17ZTo appear at KDD'26Tobias RubelRichard WenLaxman DhulipalaLars GottesbürenRajesh JayaramJakub Łąckihttp://arxiv.org/abs/2605.27446v1Context-aware Simopt-Power: Using structural data with simulation metadata to optimise FPGA designs2026-05-23T22:03:00ZPre-implementation behavioural simulation routinely validates functional correctness, yet it also produces rich switching-activity traces that are typically discarded by FPGA computer-aided design (CAD) flows. Prior simulation-guided and power-aware FPGA optimisations demonstrate the promise of exploiting this metadata, but many rely on fixed thresholds, narrow decision heuristics, or limited design awareness, often incurring substantial area overhead. This paper presents Context-aware Simopt-Power, a simulator-guided optimisation framework that combines activity metadata with lightweight structural features (sequential proximity, logic-depth proxies, and fan-out estimates) to more precisely target high-impact regions of the netlist. We additionally remove empirically tuned constants, replacing them with architecture-aware parameters such as LUT size and mapping constraints, and evaluate trade-offs using power, delay, and a more useful metrics, area-delay product (AD) and power-delay product (PD). Implemented in an open-source Yosys/ABC flow and evaluated on the complex Koios deep-learning accelerator benchmarks, Context-aware Simopt-Power achieves an average 6.8% dynamic-power reduction while limiting LUT overhead to 11.2%, thus enabling a holistic design optimisation.2026-05-23T22:03:00ZSMACD 2026 IEEE conferenceEashan WadhwaGeorgios FlorosShanker Shreejithhttp://arxiv.org/abs/2605.24569v1Energy-Aware Computing in the Year 20262026-05-23T13:04:23ZHigh-Performance Computing (HPC) has recently entered the Exascale era, and considerable efforts are being made to fully harness this potential power for large-scale applications, such as cutting-edge generative AI (training and exploitation). The corresponding energy consumption is very high, and forecasts are alarming, making this metric a critical systemic bottleneck. Addressing this issue presents a genuine challenge for the entire cloud-edge-HPC continuum at all scales, from low-power IoT microcontrollers to multi-megawatt data centers. Beyond financial costs, green computing is driven by considerations related to climate change and environmental concerns such as carbon footprint ($CO_2e$), as well as constraints on energy production and supply, leading to a real need to regulate {\em information and communication technology} (ICT) activities. This article presents a comprehensive overview of energy-efficient computing, taking into account the most recent and significant contributions. Based on this exploration of the state of the art, we design and describe a holistic taxonomy of the aforementioned publications, structured around various perspectives, including {\em hardware and software aspects, measurement instrumentation, software optimizations, dynamic task scheduling, voltage scaling, workload consolidation, federated learning}, and {\em cooling}. Particular emphasis is placed on large-scale AI, which receives significant attention due to its considerable resource requirements. We conclude with an analysis of a forward-looking roadmap that considers the main perspectives of sustainable computing.2026-05-23T13:04:23Z26 pagesRoblex Nana TchakouteClaude Tadonkihttp://arxiv.org/abs/2605.24326v1ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training2026-05-23T01:11:19ZThe rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.2026-05-23T01:11:19Z28 pages, 27 figuresMinghao LiAlicia GoldenSamuel HsiaMichael KuchnikAdi GangidiXu ZhangAshmitha Jeevaraj ShettyZachary DeVitoWeiwei ChuDong HeHaoci ZhangYuchen HaoRuoming PangJames Hongyi ZengYing ZhangMinlan YuCarole-Jean Wuhttp://arxiv.org/abs/2602.01086v2MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI2026-05-22T22:49:28ZBackground: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Clinical Agents" remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a "Context Mismatch": AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent-native data infrastructure where clinical events are immutable "Beads"--nodes in a Merkle Directed Acyclic Graph (DAG)--cryptographically referencing causal predecessors. This "write-once, read-many" architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React-based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR-to-DAG conversion transformed flat resources into a causally-linked graph. Our Breadth-First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real-time decision support. Tamper-evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the "Context Mismatch" by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for "Trustworthy Medical AI." It guarantees the context the AI receives is deterministic and tamper-evident, while the LLM determines interpretation. The structured Bead format serves as a token-efficient "AI-native language." We release MedBeads as open-source software to accelerate agent-native data standards.2026-02-01T08:03:20Z19 pages, 5 figures. Code available at https://github.com/medbeads/medbeadsTakahito Nakajimahttp://arxiv.org/abs/2605.24259v1Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure2026-05-22T22:25:31ZKV-cache reuse mechanisms increasingly expose priority, duration, offload, routing hints, scheduler modes, and event streams. These mechanisms help preserve reusable prefixes, but they do not by themselves define a portable contract for accepted future-reuse state when resident KV and active live KV cannot both fit. We introduce resident KV claims, a conformance contract that binds future-reuse intent to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. In controlled vLLM allocator probes, a 60-block resident claim and a 70-block active prefill exceed an 80-block usable KV pool. Write no-admit prevents the active request from becoming future reusable state, but it still allows active allocation to evict residents from the shared pool. A minimal vLLM prototype shows that hard protected resident claims convert this failure mode into scheduler-visible active refusal with direct blocking-claim attribution. The result is not a production speedup or a new cache-replacement algorithm. It is a runtime contract that turns unreported resident loss into reconstructable active/resident arbitration. A companion MicroRuntime and vLLM litmus suite distinguish ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, active refusal, and trace-level outcome reconstruction.2026-05-22T22:25:31Z20 pages, 4 figures; reproducibility artifacts linked in Appendix ALukas Stepanekhttp://arxiv.org/abs/2506.09199v2FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models2026-05-22T21:44:16ZIntegrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups.2025-06-10T19:36:36Z21 pages, 12 figuresNinth Conference on Machine Learning and Systems (MLSys 2026)Hariharan RameshJyotikrishna Dasshttp://arxiv.org/abs/2505.15988v2An Ecosystem of Services for FAIR Computational Workflows2026-05-22T21:40:35ZComputational workflows represent major investments of effort and expertise. As first-class, publishable research objects of their own, they are key to sharing methodological know-how for reuse, reproducibility, and transparency. Thus, the application of the FAIR Principles to workflows is inevitable to enable them to be Findable, Accessible, Interoperable, and Reusable. Making workflows FAIR reduces duplication of effort, assists in the reuse of best practice approaches and community-supported standards, and ensures that workflows as digital objects can support reproducible, robust science. FAIR workflows draw from both FAIR data and software principles, and they help ensure and support data FAIRification.
The FAIR Principles emphasize the association of persistent identifiers and machine-actionable metadata with workflows. Implementing the Principles requires a framework with appropriate programmatic protocols and an accompanying ecosystem of services, tools, policies, and best practices, as well the buy-in of existing workflow systems. The European EOSC-Life Workflow Collaboratory is an example of such a digital infrastructure for the Biosciences. It includes a metadata standards framework for describing workflows that is managed and used by dedicated new FAIR workflow services and programmatic APIs for interoperability and metadata access. It includes the WorkflowHub registry and LifeMonitor workflow testing service, and it incorporates existing workflow systems and packaging solutions.
Here, we introduce the FAIR Principles for workflows and connect FAIR workflows with the FAIR ecosystems they inhabit with the EOSC-Life Collaboratory as a concrete example. We also introduce other community efforts that are easing the ways that workflows are shared and reused by others, and we discuss how the variations in different workflow settings impact their FAIR perspectives.2025-05-21T20:11:58ZChapter 4 in "Workflow Systems for Large-Scale Scientific Data Analysis", eds. Ulf Leser, Marcus Hilbrich, Sean R. Wilkinson, Rafael Ferreira da SilvaSean R. WilkinsonJohan GustafssonFinn BacallKhalid BelhajjameSalvador CapellaJose Maria Fernandez GonzalezJacob Fosso TandeLuiz GadelhaDaniel GarijoPatricia GrubelBjorn GrüningFarah Zaib KhanSehrish KanwalSimone LeoStuart OwenLuca PiredduLine PouchardLaura Rodríguez-NavasBeatriz Serrano-SolanoStian Soiland-ReyesBaiba VilneAlan WilliamsMerridee Ann WoutersFrederik CoppensCarole Goble10.14279/depositonce-25818http://arxiv.org/abs/2605.24220v1Polar: Agentic RL on Any Harness at Scale2026-05-22T21:06:12ZReinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, porting these harnesses into RL environment interfaces remains difficult and often loses important training signals. We bridge this gap with polar, a rollout framework for scalable asynchronous RL over arbitrary agent harnesses. Polar treats the agent harness as a black box: it proxies LLM API calls, records token-level model interactions, and reconstructs token-faithful trajectories for training. Each rollout node efficiently manages runtime prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous service endpoints that can be consumed by independent trainers at scale. This decoupled design makes Polar agnostic to agent harnesses, training infrastructure, and RL algorithms while improving compute utilization for long-running agent workloads. We validate polar by training agents on software-engineering tasks with popular coding harnesses. Using simple GRPO, polar improves Qwen3.5-4B by 22.6, 4.8, 0.6 and 6.2 points on SWE-Bench Verified with the Codex, Claude Code, Qwen Code and Pi harnesses, respectively. We further demonstrate Polar for offline data generation over custom harnesses and ablate trajectory reconstruction strategies. Polar rewrites its preceding work, Prorl Agent, and has been registered as one of NeMo Gym environments.2026-05-22T21:06:12Z17 pages, 6 figures. 2 tablesBinfeng XuHao ZhangShaokun ZhangSongyang HanMingjie LiuJian HuShizhe DiaoZhenghui JinYunheng ZouMichael DemoretJan KautzYi Donghttp://arxiv.org/abs/2601.20273v2SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs2026-05-22T19:27:59ZDiffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).2026-01-28T05:42:07ZJiacheng YangJun WuYaoyao DingZhiying XuYida WangGennady Pekhimenko