https://arxiv.org/api/Vbs99gOcBivRyvgVE6m3pIU58oI 2026-06-10T13:35:22Z 28838 240 15 http://arxiv.org/abs/2409.00876v2 Rapid GPU-Based Pangenome Graph Layout 2026-05-27T20:07:32Z Computational Pangenomics is an emerging field that studies genetic variation using a graph structure encompassing multiple genomes. Visualizing pangenome graphs is vital for understanding genome diversity. Yet, handling large graphs can be challenging due to the high computational demands of the graph layout process. In this work, we conduct a thorough performance characterization of a state-of-the-art pangenome graph layout algorithm, revealing significant data-level parallelism, which makes GPUs a promising option for compute acceleration. However, irregular data access and the algorithm's memory-bound nature present significant hurdles. To overcome these challenges, we develop a solution implementing three key optimizations: a cache-friendly data layout, coalesced random states, and warp merging. Additionally, we propose a quantitative metric for scalable evaluation of pangenome layout quality. Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline without layout quality loss, reducing execution time from hours to minutes. 2024-09-02T00:05:20Z Accepted and presented on SC 2024: https://dl.acm.org/doi/10.1109/SC41406.2024.00035 Jiajie Li Jan-Niklas Schmelzle Yixiao Du Simon Heumos Andrea Guarracino Giulia Guidi Pjotr Prins Erik Garrison Zhiru Zhang 10.1109/SC41406.2024.00035 http://arxiv.org/abs/2603.00357v3 SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs 2026-05-27T19:50:56Z In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x, even under high redundancy where traditional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize time-to-train. At extreme scale with up to 600k GPUs, SPARe reduces time-to-train by 40~50% compared to traditional replication. 2026-02-27T22:44:27Z Forty-Third International Conference on Machine Learning (ICML 2026) Jin Lee Zhonghao Chen Xuhang He Robert Underwood Bogdan Nicolae Franck Cappello Xiaoyi Lu Sheng Di Zheng Zhang http://arxiv.org/abs/2605.29006v1 IORM: Hierarchical I/O Governance for Thousands of Consolidated Databases on Oracle Exadata 2026-05-27T19:02:45Z Oracle Exadata consolidates thousands of tenant databases onto shared storage infrastructure deployed at hundreds of customer sites worldwide. Oracle Multitenant architecture enables this extreme density, with thousands of tenant databases sharing a single Exadata storage system -- but this creates a multi-level resource hierarchy (container databases, tenant databases, and workloads within tenants) that commodity block-layer schedulers cannot govern, as they lack visibility into database semantics and tenant boundaries. This paper presents the I/O Resource Manager (IORM), a storage-side scheduler built on three mechanisms: I/O Tagging, which propagates semantic context from the database kernel to the storage scheduler; Hierarchical Resource Profiles, which express compositional allocation policies across consolidation tiers using shares and limits; and Unified Storage Governance, which applies these policies consistently across all tiers of the storage hierarchy -- persistent memory, flash, and hard disk -- including cache placement decisions. IORM enables successful cloud deployments where thousands of tenants coexist on shared storage: production OLTP workloads run alongside concurrent analytical workloads from the same or different databases without noisy-neighbor interference. Evaluation on production Exadata systems demonstrates that IORM dramatically improves latency consistency, virtually eliminating tail latency outliers and delivering several-fold improvements in average read latency under mixed workloads. Hierarchical limits compose correctly across all three levels, and proportional share allocation tracks configured ratios closely even under highly skewed demand. 2026-05-27T19:02:45Z 13 pages, 4 figures, 6 tables. Accepted to appear in Proceedings of the VLDB Endowment (PVLDB), 2026 Rajarshi Chowdhury Akshay Shah Zakaria Alrmaih Chenhao Guo Anubhav Singh Sue Lee http://arxiv.org/abs/2605.29002v1 FedQHD: Closed-Form Function-Space Federated Reinforcement Learning 2026-05-27T18:59:40Z Federated reinforcement learning enables decentralized agents to collaboratively improve policies or value estimates without exchanging raw trajectories. However, FedAvg-style parameter averaging is not function-space consistent: when clients use heterogeneous encoders or even identical nonlinear networks, averaged parameters need not correspond to the weighted average of client value functions in any common function space. We propose FedQHD, a federated Q-learning method using hyperdimensional (random-feature) state encoders with a linear readout, so that Q-functions are nonlinear in state yet linear in trainable parameters. This linear structure enables closed-form aggregation. With a shared encoder, the function-space consensus update coincides exactly with weighted averaging of local readout matrices. With heterogeneous encoders, the server constructs a global teacher by averaging client Q-values on a shared anchor-state set, and each client compiles this teacher into its local representation via a single ridge projection. We formalize the federation gap -- the error incurred when compiling a federated teacher into a heterogeneous client representation -- relative to a client-specific oracle projection. We show that this gap decomposes into subspace misalignment, anchor-set conditioning, and regularization bias. We further identify the anchor-to-dimension ratio $m \geq D_i$ as the well-conditioned regime in which the gap reduces to a multiple of the encoder heterogeneity floor. On four continuous-state, discrete-action control benchmarks, FedQHD matches or outperforms FedAvg-style baselines and distillation-based alternatives while requiring substantially less computation, and the empirical dependence of the federation gap on encoder dimension matches our theoretical analysis. 2026-05-27T18:59:40Z Yuchen Hou Yongshan Chen Zhuowen Zou Calvin Yeung Mohsen Imani Tian Lan Mahdi Imani http://arxiv.org/abs/2604.00736v2 Is RISC-V Ready for Machine Learning? Portable Gaussian Processes Using Asynchronous Tasks 2026-05-27T18:12:14Z Gaussian processes are widely used in machine learning domains but remain computationally demanding, limiting their efficient scalability across emerging hardware platforms. The GPRat library addresses these challenges using the HPX asynchronous many-task runtime system. In this work, we extend GPRat to enable portability across multiple hardware architectures and evaluate its performance on representative x86-64, ARM, and RISC-V chips. We conduct node-level strong scaling and problem size scaling benchmarks for Gaussian process prediction and hyperparameter optimization to assess single-core performance, parallel scalability, and architectural efficiency. Our results show that while the x86-64 Zen 2 chip achieves a 58% single-core performance advantage over the ARM-based Fujitsu A64FX, superior parallel scaling allows the 48-core ARM chip to outperform the 64-core Zen 2 by 9% at full node utilization. The evaluated SOPHON SG2042 RISC-V chip exhibits substantially lower performance and weaker scalability, with single-core performance lagging by up to a factor of 14 and large-scale parallel workloads showing slowdowns of up to a factor of 24. For problem size scaling, ARM and x86-64 systems demonstrate comparable performance within 23%. These findings highlight the growing competitiveness of purpose-built ARM chips. Furthermore, they underscore the importance of wide-register vectorization support and improvements to the memory subsystem for upcoming RISC-V platforms, especially when targeted by many-task runtimes. 2026-04-01T11:03:34Z 12 pages, 4 figures, 1 table, accepted at the International Workshop on RISC-V for HPC at ISC High Performance 2026 Alexander Strack Patrick Diehl Dirk Pflüger http://arxiv.org/abs/2506.11483v4 Capsule: Efficient Player Isolation for Datacenters 2026-05-27T17:57:47Z We introduce Capsule, a mechanism for seamlessly sharing datacenter resources across multiple players. It decouples player-local and global states to achieve isolation and to maximize cross-player sharing. Our evaluations show that Capsule increases datacenter resource utilization by accommodating up to 2.25x more players without degrading the user experience. This improvement stems from Capsule consuming up to 1.43x less GPU, 3.11x less VRAM, 3.7x less CPU, and 3.87x less RAM compared to the baseline. We evaluated Capsule across four applications and various hardware configurations, including three distinct servers and a multi-server cluster. These results demonstrate that the Capsule design is portable to other game engines. 2025-06-13T06:12:31Z 4 main pages, 6 more appendix pages, 8 figures; an extended version of EUROGRAPHICS 2026 short paper Zhouheng Du Nima Davari Li Li Wei Sen Loi Nodir Kodirov 10.2312/egs.20261014 http://arxiv.org/abs/2605.23955v2 From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems 2026-05-27T17:26:09Z Deploying machine learning in regulated financial environments -- credit risk, fraud detection, and anti-money laundering -- exposes critical vulnerabilities in algorithmic reproducibility. While early financial ML addressed statistical challenges such as backtest overfitting, deep neural networks and Generative AI have introduced mechanical nondeterminism rooted in hardware and architecture. This survey provides a systems perspective on reproducibility failures across three modalities now dominant in financial AI: tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). We supplement the literature analysis with first-party experiments on public financial datasets -- quantifying explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. We propose a layered evaluation framework linking modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness, and empirically validate the complementarity of logit-level and semantic-level determinism measures. 2026-05-11T17:46:38Z Ruizhe Zhou Xiaoyang Liu Gaoyuan Du Yi Zheng Shouxi Ren Deepayan Chakrabarti Dengdu Jiang http://arxiv.org/abs/2605.28764v1 SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks 2026-05-27T17:23:00Z Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive-aligned protocol exists for their owners to share them safely and profitably. Existing approaches either require a trusted central coordinator (cloud marketplaces), demand heavy blockchain infrastructure (Golem, BrokerChain), or lack an incentive layer entirely (BOINC, Petals). We propose SwarmHarness, a decentralised protocol in which HarnessAPI skill nodes self-organise into a compute swarm without any central authority. SwarmHarness has three interlocking components: a SwarmRegistry built on a Distributed Hash Table (DHT) for peer discovery and capability advertisement; a SwarmRouter that dispatches tasks to nodes using a utility function over capability, load, latency, and trust; and SwarmCredit, an incentive mechanism that attributes compute-credit rewards to contributing nodes via a Shapley-value approximation. Nodes earn credits by serving tasks and spend credits to submit them; idle nodes that never contribute drain credits and lose routing priority, creating a self-regulating participation economy. As nodes specialise toward high-reward skills and routing signals act as digital pheromones, the network exhibits emergent collective intelligence analogous to biological swarms. Beyond compute sharing, SwarmHarness is a foundational primitive for autonomous distributed AI agent networks in which agents hire compute, route subtasks, and settle credits without human intermediation. 2026-05-27T17:23:00Z Edwin Jose http://arxiv.org/abs/2605.28426v1 Fault Tolerance of Accelerated Asynchronous Fixed-Point Iterations on Flexible Computing Infrastructure 2026-05-27T12:55:32Z Asynchronous iterative methods tolerate straggling processors by allowing workers to proceed with stale data, but at a cost: the iterates become inconsistent, potentially degrading convergence. We investigate whether convergence accelerators such as Anderson acceleration compensate for this degradation. We experimentally study three fixed-point iterations: the Jacobi method for sparse linear systems, value iteration for the Bellman equation, and the Hartree--Fock self-consistent field (SCF) iteration. The experiments are conducted using a high-performance execution framework Ray, which abstracts the complexity of distributed systems and enables code parallelization and fault injection with minimal changes. We establish two main results. First, straggler tolerance is universal: asynchronous execution provides wall-clock speedups of $2.9\times$ (Jacobi), $7.7\times$ (VI), and $16.9\times$ (SCF) over synchronous execution with a 100\,ms-delayed worker, independent of whether acceleration is used. Second, Anderson acceleration's effectiveness under asynchrony depends on where staleness enters the computation. We identify two staleness mechanisms: iterate-level corruption, where stale worker returns directly overwrite portions of the accelerated iterate (as in block Jacobi), and evaluation-level perturbation, where staleness acts as a bounded perturbation to the fixed-point map evaluation (as in VI and SCF). Anderson acceleration fails categorically under the first mechanism but retains its benefits under the second, consistent with the perturbation analysis of Toth et al.\ (2017). This distinction, rather than the contraction norm or smoothness of the map, is the primary determinant of whether acceleration survives asynchronous execution. 2026-05-27T12:55:32Z Evan Coleman Masha Sosonkina 10.1145/3806645.3816236 http://arxiv.org/abs/2605.28400v1 TrioSeq: A Novel Approach to Accelerate Triplet Sequence Alignment on GPUs 2026-05-27T12:37:09Z State-of-the-art multiple sequence alignment (MSA) algorithms are based on progressive approaches that rely on pairwise sequence alignment (PSA) to generate guide trees to align all sequences. Given an evidenced explosion in genomic data availability, research efforts have focused on accelerating PSA on massively-parallel architectures (e.g., GPUs) and specialized hardware (e.g., FPGAs). However, there is increasing evidence that starting from exact 3-way alignments could provide more robust, accurate MSAs, and improve genomic analysis. While the current literature has shown that PSA algorithms can be extended to align sequence triplets, the existent state-of-the-art on hardware acceleration of exact 3-way alignments is still scarce. In particular, current GPU methods are still inefficient due to lacking support for novel hardware features (e.g., cross-thread intrinsics), while being closed-source and vendor-specific. In this paper, TrioSeq is proposed as a fine-grained strategy to efficiently implement 3-way alignments on GPUs, leveraging novel levels of GPU parallelism and synchronization to achieve high throughput in aligning sequence triplets. Evaluation on NVIDIA and AMD GPUs shows that TrioSeq outperforms state-of-the-art GPU progressive methods on 3-way alignment by at least 20% on simulated genomic datasets. 2026-05-27T12:37:09Z published on IPDPS '26 (2026 International Parallel & Distributed Processing Symposium) Miguel Graça Aleksandar Ilic http://arxiv.org/abs/2605.28333v1 High-Quality Multi-Constraint Hypergraph Partitioning via Greedy Rebalancing 2026-05-27T11:36:57Z Multi-constraint hypergraph partitioning is a generalization of balanced partitioning, where the vertex set of a hypergraph is partitioned such that the inter-block connectivity of hyperedges is minimized while balancing the vertices with regard to $d$ distinct constraints. A prominent class of applications is data distribution tasks, where this allows to achieve good load balance for $d$ different kinds of resources and simultaneously minimize the communication volume. Although the best approaches for single-constraint partitioning are usually complex (multilevel) algorithms with many components, we show that replacing only one component already leads to high-quality multi-constraint partitions: the rebalancing step, which restores balance for a partition that has (hopefully) small connectivity but violates the constraints. We design a multi-constraint rebalancing algorithm based on greedy local search, proving that balance is always restored for $d=2$ and bounded maximum weight. The key is to ensure monotonically decreasing global imbalance by choosing an imbalance metric where there is always a balance-improving move available. Integrating our algorithm into the state-of-the-art partitioner Mt-KaHyPar, we demonstrate an 11.5\,\% geometric mean connectivity reduction compared to the next best competitor (Metis) and better reliability regarding partition balance, even though the majority of inputs is outside of the theoretical guarantee. 2026-05-27T11:36:57Z Submitted to ESA 2026 Nikolai Maas http://arxiv.org/abs/2602.03515v2 Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation 2026-05-27T11:36:21Z Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. We trace this pathology to a specific property of the optimization landscape: the misalignment between the Hessian eigenbasis and the standard coordinate basis, which triggers oscillations in the update trajectories of coordinate-wise adaptive optimizers. We identify that these oscillations cause delayed updates to diverge from their true counterparts, invalidating their use for current iterations. This insight is formalized through theoretical analysis, including a convergence bound showing that basis misalignment amplifies the delay penalty, and substantiated with empirical evaluation. To address this, we propose basis rotation, a framework that rotates the optimizer's coordinate system to align with the Hessian eigenbasis, keeping delayed updates useful. We theoretically demonstrate that basis rotation minimizes basis misalignment, thereby counteracting the conditions that amplify delay penalties. Empirically, in training up to a 3B-parameter LLM, basis rotation reduces the required iterations by 81.7\% compared to the best-performing asynchronous baseline. 2026-02-03T13:31:51Z ICML 2026 Hyunji Jung Sungbin Shin Namhoon Lee http://arxiv.org/abs/2605.28302v1 How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving 2026-05-27T10:55:57Z Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure. 2026-05-27T10:55:57Z Hanjiang Wu Abhimanyu Rajeshkumar Bambhaniya Sarbartha Banerjee Tuhin Khare Sudarshan Srinivasan Suvinay Subramanian Souvik Kundu Madhu Kumar Midhilesh Elavazhagan William Won Amir Yazdanbakhsh Tushar Krishna http://arxiv.org/abs/2605.28205v1 Resource Allocation in HyperX Networks 2026-05-27T09:27:13Z As high-performance computing systems scale in size and complexity, efficient resource management is essential to minimize communication overhead. The HyperX is a richly connected, low-diameter network that offers a scalable and cost-effective alternative to traditional topologies. However, resource allocation in HyperX remains underexplored, and strategies designed for networks like Torus, Fat-tree, or Dragonfly do not directly transfer. In this work, we propose and formalize several resource allocation strategies for HyperX networks, categorized into linear, geometric, and stochastic functions. We characterize these strategies theoretically by analyzing their topological properties, including dilation, convexity, and partition bandwidth.Furthermore, we conduct an exhaustive experimental evaluation using synthetic traffic and application communication kernels to assess the impact of these strategies on performance under different routing algorithms. Our results indicate that partition bandwidth and switch locality are decisive factors in mitigating interferences. Notably, the Diagonal allocation strategy, which is not convex, consistently outperforms traditional approaches in most scenarios. Finally, we provide a set of lessons learned to guide the implementation of resource allocation policies in HPC systems based on HyperX networks. 2026-05-27T09:27:13Z Alejandro Cano Cristóbal Camarero Carmen Martínez Ramón Beivide http://arxiv.org/abs/2512.09800v2 Ariel-ML: Computing Parallelization with Embedded Rust for Neural Networks on Heterogeneous Multi-core Microcontrollers 2026-05-27T09:16:32Z Low-power microcontroller (MCU) hardware is currently evolving from single-core architectures to predominantly multi-core architectures. In parallel, new embedded software building blocks are more and more written in Rust, while C/C++ dominance fades in this domain. On the other hand, small artificial neural networks (ANN) of various kinds are increasingly deployed in edge AI use cases, thus deployed and executed directly on low-power MCUs. In this context, both incremental improvements and novel innovative services will have to be continuously retrofitted using ANNs execution in software embedded on sensing/actuating systems already deployed in the field. However, there was so far no Rust embedded software platform automating parallelization for inference computation on multi-core MCUs executing arbitrary TinyML models. This paper thus fills this gap by introducing Ariel-ML, a novel toolkit we designed combining a generic TinyML pipeline and an embedded Rust software platform which can take full advantage of multi-core capabilities of various 32bit microcontroller families (Arm Cortex-M, RISC-V, ESP-32). We published the full open source code of its implementation, which we used to benchmark its capabilities using a zoo of various TinyML models. We show that Ariel-ML outperforms prior art in terms of inference latency as expected, and we show that, compared to pre-existing toolkits using embedded C/C++, Ariel-ML achieves comparable memory footprints. Ariel-ML thus provides a useful basis for TinyML practitioners and resource-constrained embedded Rust developers. 2025-12-10T16:13:29Z Zhaolan Huang Kaspar Schleiser Gyungmin Myung Emmanuel Baccelli