https://arxiv.org/api/e7/vkoZaocwP5JOPGEnRBqWnQGo2026-03-22T10:02:56Z277249015http://arxiv.org/abs/2603.12440v1KernelFoundry: Hardware-aware evolutionary GPU kernel optimization2026-03-12T20:40:04ZOptimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance profiling outputs. Most existing LLM-based approaches to kernel generation rely on simple prompting and feedback loops, incorporating hardware awareness only indirectly through profiling feedback. We introduce KernelFoundry, an evolutionary framework that efficiently explores the GPU kernel design space through three key mechanisms: (1) MAP-Elites quality-diversity search with kernel-specific behavioral dimensions to sustain exploration across diverse optimization strategies; (2) meta-prompt evolution, which co-evolves prompts with kernels to uncover task-specific optimization strategies, and (3) template-based parameter optimization to tune kernels to inputs and hardware. We evaluate this framework on KernelBench, robust-kbench, and custom tasks, generating SYCL kernels as a cross-platform GPU programming model and CUDA kernels for comparison to prior work. Our approach consistently outperforms the baseline methods, achieving an average speedup of 2.3x on KernelBench for SYCL. Moreover, KernelFoundry is implemented as a distributed framework with remote access to diverse hardware, enabling rapid benchmarking and featuring a flexible user input layer that supports kernel generation for a wide range of real-world use cases beyond benchmarking.2026-03-12T20:40:04ZNina WiedemannQuentin LeboutetMichael PaulitschDiana WofkBenjamin Ummenhoferhttp://arxiv.org/abs/2602.09604v3High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors2026-03-12T19:07:19ZARM SVE and RISC-V RVV are emerging vector architectures in high-end processors that support vectorization of flexible vector length. In this work, we leverage an important workload for quantum computing, quantum state-vector simulations, to understand whether high-performance portability can be achieved in a vector-length agnostic (VLA) design. We propose a VLA design and optimization techniques critical for achieving high performance, including VLEN-adaptive memory layout adjustment, load buffering, fine-grained loop control, and gate fusion-based arithmetic intensity adaptation. We provide an implementation in Google's Qsim and evaluate five quantum circuits of up to 36 qubits on three ARM processors, including NVIDIA Grace, AWS Graviton3, and Fujitsu A64FX. By defining new metrics and PMU events to quantify vectorization activities, we draw generic insights for future VLA designs. Our single-source implementation of VLA quantum simulations achieves up to 4.5x speedup on A64FX, 2.5x speedup on Grace, and 1.5x speedup on Graviton.2026-02-10T09:55:12ZTo be published in IPDPS2026Ruimin ShiGabin SchiefferPei-Hung LinMaya GokhaleAndreas HertenIvy Penghttp://arxiv.org/abs/2603.12381v1OpenDC-STEAM: Realistic Modeling and Systematic Exploration of Composable Techniques for Sustainable Datacenters2026-03-12T18:59:30ZThe need to reduce datacenter carbon footprint is urgent. While many sustainability techniques have been proposed, they are often evaluated in isolation, using limited setups or analytical models that overlook real-world dynamics and interactions between methods. This makes it challenging for researchers and operators to understand the effectiveness and trade-offs of combining such techniques. We design OpenDC-STEAM, an open-source customizable datacenter simulator, to investigate the individual and combined impact of sustainability techniques on datacenter operational and embodied carbon emissions, and their trade-off with performance. Using STEAM, we systematically explore three representative techniques - horizontal scaling, leveraging batteries, and temporal shifting - with diverse representative workloads, datacenter configurations, and carbon-intensity traces. Our analysis highlights that datacenter dynamics can influence their effectiveness and that combining strategies can significantly lower emissions, but introduces complex cost-emissions-performance trade-offs that STEAM can help navigate. STEAM supports the integration of new models and techniques, making it a foundation framework for holistic, quantitative, and reproducible research in sustainable computing. Following open-science principles, STEAM is available as FOSS: https://github.com/atlarge-research/OpenDC-STEAM.2026-03-12T18:59:30ZThis is an extended version of a paper published at CCGRID 2026Dante NiewenhuisSacheendra TalluriAlexandru IosupTiziano de Matteishttp://arxiv.org/abs/2504.21620v2Deterministic Distributed DFS and Other Problems via Cycle Separators in Planar Graphs2026-03-12T17:35:07ZOne of the most basic techniques in algorithm design consists of breaking a problem into subproblems and then proceeding recursively. In the case of graph algorithms, one way to implement this approach is through separator sets. Given a graph $G=(V,E)$, a subset of nodes $S \subseteq V$ is called a separator set of $G$ if the size of each connected component of $G-S$ is at most $2/3 \cdot |V|$. The most useful separator sets are those that satisfy certain restrictions of cardinality or structure. For over 40 years, various efficient algorithms have been developed for computing separators of different kinds, particularly in planar graphs. Separator sets, combined with a divide and conquer approach, have been fundamental in the design of efficient algorithms in various settings.
In this work, we present the first deterministic algorithm in the distributed CONGEST model that recursively computes a cycle separator in planar graphs in $\tilde{\mathcal{O}}(D)$ rounds. This result, as in the centralized setting, has significant implications for distributed planar algorithms. In fact, from this result, we can construct a deterministic algorithm that computes a DFS tree in $\tilde{\mathcal{O}}(D)$ rounds. This matches both the best-known randomized algorithm of Ghaffari and Parter (DISC'17) and, up to polylogarithmic factors, the trivial lower bound of $Ω(D)$ rounds.
Besides DFS, our deterministic cycle separator algorithm can be used to derandomize several planar-graph algorithms whose only randomized ingredient is the computation of a cycle separator, such as maximum flow (Abd-Elhaleem, Dory, Parter and Weimann, PODC'25), single-source shortest path (Li and Parter, STOC'19), and reachability (Parter, DISC'20).2025-04-30T13:21:12ZPODC 2025Benjamin JaureguiPedro MontealegreIvan Rapaporthttp://arxiv.org/abs/2603.10970v2Reference Architecture of a Quantum-Centric Supercomputer2026-03-12T17:16:15ZQuantum computers have demonstrated utility in simulating quantum systems beyond brute-force classical approaches. As the community builds on these demonstrations to explore using quantum computing for applied research, algorithms and workflows have emerged that require leveraging both quantum computers and classical high-performance computing (HPC) systems to scale applications, especially in chemistry and materials, beyond what either system can simulate alone. Today, these disparate systems operate in isolation, forcing users to manually orchestrate workloads, coordinate job scheduling, and transfer data between systems -- a cumbersome process that hinders productivity and severely limits rapid algorithmic exploration. These challenges motivate the need for flexible and high-performance Quantum-Centric Supercomputing (QCSC) systems that integrate Quantum Processing Units (QPUs), Graphics Processing Units (GPUs), and Central Processing Units (CPUs) to accelerate discovery of such algorithms across applications. These systems will be co-designed across quantum and classical HPC infrastructure, middleware, and application layers to accelerate the adoption of quantum computing for solving critical computational problems. We envision QCSC evolution through three distinct phases: (1) quantum systems as specialized compute offload engines within existing HPC complexes; (2) heterogeneous quantum and classical HPC systems coupled through advanced middleware, enabling seamless execution of hybrid quantum-classical algorithms; and (3) fully co-designed heterogeneous quantum-HPC systems for hybrid computational workflows. This article presents a reference architecture and roadmap for these QCSC systems.2026-03-11T16:55:21Z20 pages, 5 figures, minor fixesSeetharami SeelamJerry M. ChowAntonio CórcolesSarah SheldonTushar MittalAbhinav KandalaSean DagueIan HincksHiroshi HoriiBlake JohnsonMichael LeHani JamjoomJay M. Gambettahttp://arxiv.org/abs/2603.12118v1Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models2026-03-12T16:20:35ZAny-to-Any models are an emerging class of multimodal models that accept combinations of multimodal data (e.g., text, image, video, audio) as input and generate them as output. Serving these models are challenging; different requests with different input and output modalities traverse different paths through the model computation graph, and each component of the model have different scaling characteristics.
We present Cornserve, a distributed serving system for generic Any-to-Any models. Cornserve provides a flexible task abstraction for expressing Any-to-Any model computation graphs, enabling component disaggregation and independent scaling. The distributed runtime dispatches compute to the data plane via an efficient record-and-replay execution model that keeps track of data dependencies, and forwards tensor data between components directly from the producer to the consumer. Built on Kubernetes with approximately 23K new lines of Python, Cornserve supports diverse Any-to-Any models and delivers up to 3.81$\times$ higher throughput and 5.79$\times$ lower tail latency. Cornserve is open-source, and the demo video is available on YouTube.2026-03-12T16:20:35ZOpen source https://github.com/cornserve-ai/cornserve / Demo video https://www.youtube.com/watch?v=nb8R-vztLRgJae-Won ChungJeff J. MaJisang AhnYizhuo LiangAkshay JajooMyungjin LeeMosharaf Chowdhuryhttp://arxiv.org/abs/2405.21025v4On Reduction and Synthesis of Petri's Cycloids2026-03-12T16:00:19ZCycloids are particular Petri nets for modelling processes of actions and events, belonging to the fundaments of Petri's general systems theory. Defined by four parameters they provide an algebraic formalism to describe strongly synchronized sequential processes. To further investigate their structure, reduction systems of cycloids are defined in the style of rewriting systems and properties of irreducible cycloids are proved. In particular the synthesis of cycloid parameters from their Petri net structure is derived, leading to an efficient method for a decision procedure for cycloid isomorphism.2024-05-31T17:13:44ZRüdiger ValkDaniel Moldthttp://arxiv.org/abs/2603.12044v1HPC Containers for EBRAINS: Towards Portable Cross-Domain Software Environment2026-03-12T15:19:45ZDeploying complex, distributed scientific workflows across diverse HPC sites is often hindered by site-specific dependencies and complex build environments. This paper investigates the design and performance of portable HPC container images capable of encapsulating MPI- and CUDA-enabled software stacks without sacrificing bare-metal performance. This work is part of recent work performed within the EBRAINS Research Infrastructure, to evaluate the implementation of portable HPC (Apptainer-based) container images targeting the EBRAINS Software Distribution (ESD) -- a Spack-based software ecosystem comprising approximately 80 top-level packages (and 800 dependencies). We evaluate a hybrid, PMIx-based containerization strategy using Apptainer that seamlessly bypasses the need for site-specific builds by dynamically leveraging host-level specialized hardware, such as network interfaces and GPUs, on two production HPC clusters: Karolina and Jureca-DC. We demonstrate the feasibility of building portable, MPI- and CUDA-enabled scientific software into container images that correctly leverage site-installed drivers and hardware to reproduce bare-metal communication behavior. Using communication microbenchmarks (e.g., OSU and NCCL) alongside performance metrics of applications from neuroscience, we measure and verify their performance against bare-metal deployments. Crucially, our verification approach extends beyond top-level runtime measurements; we highlight the analysis of underlying debug logs to actively detect misbehavior and misconfigurations, such as suboptimal transport pathways. Ultimately, this investigation demonstrates the feasibility of a simple and reproducible methodology for decoupling software environments from underlying infrastructures, paving the way for automated pipelines that ensure optimized, performance-verified execution across varied HPC architectures.2026-03-12T15:19:45ZKrishna Kant SinghEric MüllerEleni MathioulakiWouter KlijnLena Odenhttp://arxiv.org/abs/2603.12031v1AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling2026-03-12T15:09:48ZState-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.2026-03-12T15:09:48ZHamed Hamzehhttp://arxiv.org/abs/2603.12001v1Decentralized Orchestration Architecture for Fluid Computing: A Secure Distributed AI Use Case2026-03-12T14:49:12ZDistributed AI and IoT applications increasingly execute across heterogeneous resources spanning end devices, edge/fog infrastructure, and cloud platforms, often under different administrative domains. Fluid Computing has emerged as a promising paradigm for enhancing massive resource management across the computing continuum by treating such resources as a unified fabric, enabling optimal service-agnostic deployments driven by application requirements. However, existing solutions remain largely centralized and often do not explicitly address multi-domain considerations. This paper proposes an agnostic multi-domain orchestration architecture for fluid computing environments. The orchestration plane enables decentralized coordination among domains that maintain local autonomy while jointly realizing intent-based deployment requests from tenants, ensuring end-to-end placement and execution. To this end, the architecture elevates domain-side control services as first-class capabilities to support application-level enhancement at runtime. As a representative use case, we consider a multi-domain Decentralized Federated Learning (DFL) deployment under Byzantine threats. We leverage domain-side capabilities to enhance Byzantine security by introducing FU-HST, an SDN-enabled multi-domain anomaly detection mechanism that complements Byzantine-robust aggregation. We validate the approach via simulation in single- and multi-domain settings, evaluating anomaly detection, DFL performance, and computation/communication overhead.2026-03-12T14:49:12Z19 pages, 9 figures and 1 table. Under peer reviewDiego Cajaraville-AboyAna Fernández-VilasRebeca P. Díaz-RedondoManuel Fernández-VeigaPablo Picallo-Lópezhttp://arxiv.org/abs/2603.11850v1Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning2026-03-12T12:17:17ZImpaction of the mandibular third molar in proximity to the mandibular canal increases the risk of inferior alveolar nerve injury. Panoramic radiography is routinely used to assess this relationship. Automated classification of molar-canal overlap could support clinical triage and reduce unnecessary CBCT referrals, while federated learning (FL) enables multi-center collaboration without sharing patient data. We compared Local Learning (LL), FL, and Centralized Learning (CL) for binary overlap/no-overlap classification on cropped panoramic radiographs partitioned across eight independent labelers. A pretrained ResNet-34 was trained under each paradigm and evaluated using per-client metrics with locally optimized thresholds and pooled test performance with a global threshold. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-based metrics, alongside training dynamics, Grad-CAM visualizations, and server-side aggregate monitoring signals. On the test set, CL achieved the highest performance (AUC 0.831; accuracy = 0.782), FL showed intermediate performance (AUC 0.757; accuracy = 0.703), and LL generalized poorly across clients (AUC range = 0.619-0.734; mean = 0.672). Training curves suggested overfitting, particularly in LL models, and Grad-CAM indicated more anatomically focused attention in CL and FL. Overall, centralized training provided the strongest performance, while FL offers a privacy-preserving alternative that outperforms LL.2026-03-12T12:17:17ZJohan Andreas Balle RubakSara HaghighatSanyam JainMostafa AldesokiAkhilanand ChaurasiaSarah Sadat EhsaniFaezeh Dehghan GhanatkamanAhmad Badruddin GhazaliJulien IssaBasel KhalilRishi RamaniRuben Pauwelshttp://arxiv.org/abs/2603.11797v1The Carnot Bound: Limits and Possibilities for Bandwidth-Efficient Consensus2026-03-12T10:59:35ZIn leader-based protocols for State Machine Replication (SMR), the leader's outgoing bandwidth is a natural throughput bottleneck. Erasure coding can alleviate this by allowing the leader to send each processor a single fragment of each block, rather than a full copy. The \emph{data expansion rate}, the ratio of total data sent to payload size, determines how close throughput can get to the underlying network bandwidth. We investigate the fundamental limits and possibilities for bandwidth-efficient leader-based consensus. On the negative side, we prove that protocols with 2-round finality (one round of voting) cannot achieve a data expansion rate below approximately 2.5, a bound that is matched by existing protocols. On the positive side, we show that protocols with 3-round finality (two rounds of voting) can push the data expansion rate arbitrarily close to 1. The key insight is that the second voting round provides a recovery mechanism: leaders can attempt aggressive erasure codes and safely fall back to more conservative ones when reconstruction fails, without compromising consistency.
We present two protocols with 3-round finality realising this approach. Carnot 1 assumes $n \geq 4f+1$ processors (of which at most $f$ may be Byzantine) and achieves a clean design requiring no additional fragment dissemination beyond the initial protocol messages. Carnot 2 operates under the optimal resilience assumption $n \geq 3f+1$, at the cost of additional fragment dissemination when Byzantine processors interfere. Both protocols can incorporate stable leaders and optimistic proposals to maximise throughput and minimise latency. Under favourable conditions, with correct leaders and few actual faults, both protocols allow leaders to use data expansion rates approaching 1; under adversarial conditions, leaders can revert to safe expansion rates of approximately $1.33$ and $1.5$, respectively.2026-03-12T10:59:35ZAndrew Lewis-PyePatrick O'Gradyhttp://arxiv.org/abs/2601.01209v2OrchestrRL: Dynamic Compute and Network Orchestration for Disaggregated RL2026-03-12T09:31:22ZDisaggregating the generation and training stages in RL is widely adopted to scale LLM post-training. There are two critical challenges here. First, the generation stage often becomes a bottleneck due to dynamic workload shifts and severe execution imbalances. Second, the decoupled stages result in diverse and dynamic network traffic patterns that strain the conventional static fabric.
We build OrchestrRL to orchestrate dynamically both compute and network in disaggregated RL. OrchestrRL employs an adaptive compute scheduler that adjusts parallelism configuration to match changing workload characteristics within and across generation steps. OrchestrRL adopts a reconfigurable optical-electrical fabric called RFabric: It leverages optical circuit switches to reconfigure the aggregation and core layers of the topology on demand, tailoring bandwidth resources to the unique communication patterns across various phases of training, generation, and weight synchronization. Evaluated on a 64-H800 GPU testbed, OrchestrRL demonstrates up to a 1.42x throughput improvement over static baselines. Using a high-fidelity simulator, we also show that RFabric achieves superior performance-cost efficiency at scale over static Fat-Tree networks.2026-01-03T15:27:24ZXin TanYicheng FengYu ZhouYimin JiangYibo ZhuHong Xuhttp://arxiv.org/abs/2603.11645v1Beyond BFS: A Comparative Study of Rooted Spanning Tree Algorithms on GPUs2026-03-12T08:13:36ZRooted spanning trees (RSTs) are a core primitive in parallel graph analytics, underpinning algorithms such as biconnected components and planarity testing. On GPUs, RST construction has traditionally relied on breadth-first search (BFS) due to its simplicity and work efficiency. However, BFS incurs an O(D) step complexity, which severely limits parallelism on high-diameter and power-law graphs. We present a comparative study of alternative RST construction strategies on modern GPUs. We introduce a GPU adaptation of the Path Reversal RST (PR-RST) algorithm, optimizing its pointer-jumping and broadcast operations for modern GPU architecture. In addition, we evaluate an integrated approach that combines a state-of-the-art connectivity framework (GConn) with Eulerian tour-based rooting. Across more than 10 real-world graphs, our results show that the GConn-based approach achieves up to 300x speedup over optimized BFS on high-diameter graphs. These findings indicate that the O(log n) step complexity of connectivity-based methods can outweigh their structural overhead on modern hardware, motivating a rethinking of RST construction in GPU graph analytics.2026-03-12T08:13:36ZAbhijeet SahuSrikar Vilas Donurhttp://arxiv.org/abs/2503.23830v3OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training2026-03-12T07:24:41ZMultimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs.
To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6\%$ when training an 84B MLLM with three modalities on $2560$ H100 GPUs, outperforming Megatron-LM by up to $3.1\times$ in throughput.2025-03-31T08:24:23ZYijie ZhengBangjun XiaoLei ShiXiaoyang LiFaming WuTianyu LiXuefeng XiaoYang ZhangYuxuan WangShouda Liu