https://arxiv.org/api/Mlk0QdFNMLcs1+NG0YjEfv/P9GI 2026-04-01T10:32:09Z 5090 270 15 http://arxiv.org/abs/2405.14430v4 PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference 2025-12-03T10:59:23Z

This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and the model layers across multiple GPUs. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently. By capitalizing on the high similarity between inputs from successive diffusion steps, PipeFusion reuses one-step stale feature maps to provide context for the current pipeline step. This approach notably reduces communication costs compared to existing DiTs inference parallelism, including tensor parallel, sequence parallel and DistriFusion. PipeFusion enhances memory efficiency through parameter distribution across devices, ideal for large DiTs like Flux.1. Experimental results demonstrate that PipeFusion achieves state-of-the-art performance on 8$\times$L40 PCIe GPUs for Pixart, Stable-Diffusion 3, and Flux.1 models. Our source code is available at https://github.com/xdit-project/xDiT.

2024-05-23T11:00:07Z Jiarui Fang Jinzhe Pan Aoyu Li Xibo Sun Jiannan Wang http://arxiv.org/abs/2512.03565v1 Tuning of Vectorization Parameters for Molecular Dynamics Simulations in AutoPas 2025-12-03T08:42:44Z

Molecular Dynamics simulations can help scientists to gather valuable insights for physical processes on an atomic scale. This work explores various techniques for SIMD vectorization to improve the pairwise force calculation between molecules in the scope of the particle simulation library AutoPas. The focus lies on the order in which particle values are loaded into vector registers to achieve the most optimal performance regarding execution time or energy consumption. As previous work indicates that the optimal MD algorithm can change during runtime, this paper investigates simulation-specific parameters like particle density and the impact of the neighbor identification algorithms, which distinguishes this work from related projects. Furthermore, AutoPas' dynamic tuning mechanism is extended to choose the optimal vectorization order during runtime. The benchmarks show that considering different particle interaction orders during runtime can lead to a considerable performance improvement for the force calculation compared to AutoPas' previous approach.

2025-12-03T08:42:44Z 20 pages, 8 figures. Submitted to the 5th International Conference on Computational Engineering (ICCE 2024). No changes were made after the peer review process Luis Gall Samuel James Newcome Fabio Alexander Gratl Markus Mühlhäußer Manish Kumar Mishra Hans-Joachim Bungartz http://arxiv.org/abs/2509.24091v3 PerfBench: Can Agents Resolve Real-World Performance Bugs? 2025-12-02T20:55:19Z

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown promise in automated bug fixing, existing benchmarks primarily focus on functional correctness and fail to evaluate agents' abilities to identify and resolve non-functional issues like performance bugs. We introduce PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks from popular .NET repositories on GitHub. Unlike existing benchmarks that rely on pre-existing test suites, PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks and validates fixes by comparing execution metrics collected for developer fix and agent fix. Each task in PerfBench is derived from actual developer fixes linked to performance-related issues, which are then verified by human experts, ensuring real-world relevance. Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks, with baseline OpenHands agent achieving only a ~3% success rate on our benchmark. We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a ~20% success rate on the benchmark. We show that by ensuring the agent has proper instructions to benchmark its changes and tooling for benchmark output processing, we can improve the agent performance significantly, but room for improvement still remains. PerfBench provides a challenging test set for furthering the capabilities of agents in fixing performance issues.

2025-09-28T22:00:33Z Spandan Garg Roshanak Zilouchian Moghaddam Neel Sundaresan http://arxiv.org/abs/2511.09956v2 Optimizing CPU Cache Utilization in Cloud VMs with Accurate Cache Abstraction 2025-12-02T01:24:46Z

This paper shows that cache-based optimizations are often ineffective in cloud virtual machines (VMs) due to limited visibility into and control over provisioned caches. In public clouds, CPU caches can be partitioned or shared among VMs, but a VM is unaware of cache provisioning details. Moreover, a VM cannot influence cache usage via page placement policies, as memory-to-cache mappings are hidden. The paper proposes a novel solution, CacheX, which probes accurate and fine-grained cache abstraction within VMs using eviction sets without requiring hardware or hypervisor support, and showcases the utility of the probed information with two new techniques: LLC contention-aware task scheduling and virtual color-aware page cache management. Our evaluation of CacheX's implementation in x86 Linux kernel demonstrates that it can effectively improve cache utilization for various workloads in public cloud VMs.

2025-11-13T04:37:52Z Mani Tofigh Edward Guo Weiwei Jia Xiaoning Ding Zirui Neil Zhao Jianchen Shan http://arxiv.org/abs/2512.02090v1 Scalable, Cloud-Based Simulations of Blood Flow and Targeted Drug Delivery in Retinal Capillaries 2025-12-01T16:28:54Z

We investigate the capabilities of cloud computing for large-scale,tightly-coupled simulations of biological fluids in complex geometries, traditionally performed in supercomputing centers. We demonstrate scalable and efficient simulations in the public cloud. We perform meso-scale simulations of blood flow in image-reconstructed capillaries, and examine targeted drug delivery by artificial bacterial flagella (ABFs). The simulations deploy dissipative particle dynamics (DPD) with two software frameworks, Mirheo (developed by our team) and LAMMPS. Mirheo exhibits remarkable weak scalability for up to 512 GPUs. Similarly, LAMMPS demonstrated excellent weak scalability for pure solvent as well as for blood suspensions and ABFs in reconstructed retinal capillaries. In particular, LAMMPS maintained weak scaling above 90% on the cloud for up to 2,000 cores. Our findings demonstrate that cloud computing can support tightly coupled, large-scale scientific simulations with competitive performance.

2025-12-01T16:28:54Z Lucas Amoudruz Sergey Litvinov Riccardo Murri Volker Eyrich Jens Zudrop Costas Bekas Petros Koumoutsakos http://arxiv.org/abs/2504.18583v4 PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation 2025-11-30T08:28:00Z

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a promising solution, adopting a draft-then-verify strategy to accelerate token generation. While the EAGLE series achieves strong acceleration, its requirement of training a separate draft head for each target model introduces substantial adaptation costs. In this work, we propose \textbf{PARD (PARallel Draft)}, a novel speculative decoding method featuring \textit{target-independence} and \textit{parallel token prediction}. Specifically, PARD enables a single draft model to be applied across an entire family of target models without requiring separate training for each variant, thereby minimizing adaptation costs. Meanwhile, PARD substantially accelerates inference by predicting multiple future tokens within a single forward pass of the draft phase. To further reduce the training adaptation cost of PARD, we propose a COnditional Drop-token (COD) mechanism based on the integrity of prefix key-value states, enabling autoregressive draft models to be adapted into parallel draft models at low-cost. Our experiments show that the proposed COD method improves draft model training efficiency by \textbf{3$\times$} compared with traditional masked prediction training. On the \texttt{vLLM} inference framework, PARD achieves up to \textbf{3.67$\times$} speedup on LLaMA3.1-8B, reaching \textbf{264.88} tokens per second, which is \textbf{1.15$\times$} faster than EAGLE-3. Our code is available at https://github.com/AMD-AIG-AIMA/PARD.

2025-04-23T12:27:43Z Submitted for possible publication Zihao An Huajun Bai Ziqiong Liu Dong Li Emad Barsoum http://arxiv.org/abs/2512.00639v1 Doppler-Enhanced Deep Learning: Improving Thyroid Nodule Segmentation with YOLOv5 Instance Segmentation 2025-11-29T21:24:36Z

The increasing prevalence of thyroid cancer globally has led to the development of various computer-aided detection methods. Accurate segmentation of thyroid nodules is a critical first step in the development of AI-assisted clinical decision support systems. This study focuses on instance segmentation of thyroid nodules using YOLOv5 algorithms on ultrasound images. We evaluated multiple YOLOv5 variants (Nano, Small, Medium, Large, and XLarge) across two dataset versions, with and without doppler images. The YOLOv5-Large algorithm achieved the highest performance with a dice score of 91\% and mAP of 0.87 on the dataset including doppler images. Notably, our results demonstrate that doppler images, typically excluded by physicians, can significantly improve segmentation performance. The YOLOv5-Small model achieved 79\% dice score when doppler images were excluded, while including them improved performance across all model variants. These findings suggest that instance segmentation with YOLOv5 provides an effective real-time approach for thyroid nodule detection, with potential clinical applications in automated diagnostic systems.

2025-11-29T21:24:36Z Mahmoud El Hussieni http://arxiv.org/abs/2512.00288v1 PORTAL: Controllable Landscape Generator for Continuous Optimization-Part I: Framework 2025-11-29T02:57:13Z

Benchmarking is central to optimization research, yet existing test suites for continuous optimization remain limited: classical collections are fixed and rigid, while previous generators cover only narrow families of landscapes with restricted variability and control over details. This paper introduces PORTAL (Platform for Optimization Research, Testing, Analysis, and Learning), a general benchmark generator that provides fine-grained, independent control over basin curvature, conditioning, variable interactions, and surface ruggedness. PORTAL's layered design spans from individual components to block-wise compositions of multi-component landscapes with controllable partial separability and imbalanced block contributions. It offers precise control over the shape of each component in every dimension and direction, and supports diverse transformation patterns through both element-wise and coupling operators with compositional sequencing. All transformations preserve component centers and local quadratic structure, ensuring stability and interpretability. A principled neutralization mechanism prevents unintended component domination caused by exponent or scale disparities, which addresses a key limitation of prior landscape generators. On this foundation, transformations introduce complex landscape characteristics, such as multimodality, asymmetry, and heterogeneous ruggedness, in a controlled and systematic way. PORTAL enables systematic algorithm analysis by supporting both isolation of specific challenges and progressive difficulty scaling. It also facilitates the creation of diverse datasets for meta-algorithmic research, tailored benchmark suite design, and interactive educational use. The complete Python and MATLAB source code for PORTAL is publicly available at [https://github.com/EvoMindLab/PORTAL].

2025-11-29T02:57:13Z 15 pages, 1 figure Danial Yazdani Mai Peng Delaram Yazdani Shima F. Yazdi Mohammad Nabi Omidvar Yuan Sun Trung Thanh Nguyen Changhe Li Xiaodong Li http://arxiv.org/abs/2511.22551v1 3RSeT: Read Disturbance Rate Reduction in STT-MRAM Caches by Selective Tag Comparison 2025-11-27T15:39:45Z

Recent development in memory technologies has introduced Spin-Transfer Torque Magnetic RAM (STT-MRAM) as the most promising replacement for SRAMs in on-chip cache memories. Besides its lower leakage power, higher density, immunity to radiation-induced particles, and non-volatility, an unintentional bit flip during read operation, referred to as read disturbance error, is a severe reliability challenge in STT-MRAM caches. One major source of read disturbance error in STT-MRAM caches is simultaneous accesses to all tags for parallel comparison operation in a cache set, which has not been addressed in previous work. This paper first demonstrates that high read accesses to tag array extremely increase the read disturbance rate and then proposes a low-cost scheme, so-called Read Disturbance Rate Reduction in STT-MRAM Caches by Selective Tag Comparison (3RSeT), to reduce the error rate by eliminating a significant portion of tag reads. 3RSeT proactively disables the tags that have no chance for hit, using low significant bits of the tags on each access request. Our evaluations using gem5 full-system cycle-accurate simulator show that 3RSeT reduces the read disturbance rate in the tag array by 71.8%, which results in 3.6x improvement in Mean Time To Failure (MTTF). In addition, the energy consumption is reduced by 62.1% without compromising performance and with less than 0.4% area overhead.

2025-11-27T15:39:45Z Elham Cheshmikhani Hamed Farbeh Hossein Asad http://arxiv.org/abs/2511.22467v1 Motion-to-Motion Latency Measurement Framework for Connected and Autonomous Vehicle Teleoperation 2025-11-27T13:58:27Z

Latency is a key performance factor for the teleoperation of Connected and Autonomous Vehicles (CAVs). It affects how quickly an operator can perceive changes in the driving environment and apply corrective actions. Most existing work focuses on Glass-to-Glass (G2G) latency, which captures delays only in the video pipeline. However, there is no standard method for measuring Motion-to-Motion (M2M) latency, defined as the delay between the physical steering movement of the remote operator and the corresponding steering motion in the vehicle. This paper presents an M2M latency measurement framework that uses Hall-effect sensors and two synchronized Raspberry Pi~5 devices. The system records interrupt-based timestamps on both sides to estimate M2M latency, independently of the underlying teleoperation architecture. Precision tests show an accuracy of 10--15~ms, while field results indicate that actuator delays dominate M2M latency, with median values above 750~ms.

2025-11-27T13:58:27Z François Provost Faisal Hawlader Mehdi Testouri Raphaël Frank http://arxiv.org/abs/2506.17084v2 JANUS: Resilient and Adaptive Data Transmission for Enabling Timely and Efficient Cross-Facility Scientific Workflows 2025-11-26T19:13:42Z

In modern science, the growing complexity of large-scale scientific projects has led to an increasing reliance on cross-facility scientific workflows, where resources and expertise from multiple institutions and geographic locations are leveraged to accelerate scientific discovery. These workflows often require transmitting huge amounts of scientific data through wide-area networks. Although high-speed networks like ESnet and transfer services such as Globus have improved data mobility, several challenges remain. The sheer volume of data can overwhelm network bandwidth, widely used transport protocols such as TCP suffer from inefficiencies due to retransmissions triggered by packet loss, and existing fault-tolerance mechanisms like erasure coding introduce substantial overhead. In this paper, we propose JANUS, a resilient and adaptable data transmission approach designed for cross-facility scientific workflows. Unlike traditional TCP-based methods, JANUSleverages UDP, integrates erasure coding for fault tolerance, and combines it with error-bounded lossy compression to reduce overhead. This novel design allows users to balance data transmission time and accuracy, optimizing transfer performance based on specific scientific requirements. Additionally, JANUS dynamically adjusts erasure coding parameters in response to real-time network conditions, ensuring efficient data transfers even in fluctuating environments. We develop optimization models for determining ideal configurations and implement adaptive data transfer protocols to enhance reliability. Through extensive simulations and real-network experiments, we demonstrate that JANUS significantly improves transfer efficiency while maintaining data fidelity.

2025-06-20T15:40:14Z Vladislav Esaulov Jieyang Chen Norbert Podhorszki Fred Suter Scott Klasky Anu G Bourgeois Lipeng Wan http://arxiv.org/abs/2511.21535v1 Modeling the Effect of Data Redundancy on Speedup in MLFMA Near-Field Computation 2025-11-26T16:01:32Z

The near-field (P2P) operator in the Multilevel Fast Multipole Algorithm (MLFMA) is a performance bottleneck on GPUs due to poor memory locality. This work introduces data redundancy to improve spatial locality by reducing memory access dispersion. For validation of results, we propose an analytical model based on a Locality metric that combines data volume and access dispersion to predict speedup trends without hardware-specific profiling. The approach is validated on two MLFMA-based applications: an electromagnetic solver (DBIM-MLFMA) with regular structure, and a stellar dynamics code (PhotoNs-2.0) with irregular particle distribution. Results show up to 7X kernel speedup due to improved cache behavior. However, increased data volume raises overheads in data restructuring, limiting end-to-end application speedup to 1.04X. While the model cannot precisely predict absolute speedups, it reliably captures performance trends across different problem sizes and densities. The technique is injectable into existing implementations with minimal code changes. This work demonstrates that data redundancy can enhance GPU performance for P2P operator, provided locality gains outweigh data movement costs.

2025-11-26T16:01:32Z Morteza Sadeghi http://arxiv.org/abs/2505.05623v3 Characterizing GPU Energy Usage in Exascale-Ready Portable Science Applications 2025-11-26T15:04:27Z

We characterize the GPU energy usage of two widely adopted exascale-ready applications representing two classes of particle and mesh solvers: (i) QMCPACK, a quantum Monte Carlo package, and (ii) AMReXCastro, an adaptive mesh astrophysical code. We analyze power, temperature, utilization, and energy traces from double-/single (mixed)-precision benchmarks on NVIDIA's A100 and H100 and AMD's MI250X GPUs using queries in NVML and rocm_smi_lib, respectively. We explore application-specific metrics to provide insights on energy vs. performance trade-offs. Our results suggest that mixed-precision energy savings range between 6-25% on QMCPACK and 45% on AMReX-Castro. Also, we found gaps in the AMD tooling used on Frontier GPUs that need to be understood, while query resolutions on NVML have little variability between 1 ms-1 s. Overall, application level knowledge is crucial to define energy-cost/science-benefit opportunities for the codesign of future supercomputer architectures in the post-Moore era.

2025-05-08T20:02:45Z 13 pages, 8 figures, 3 tables. Accepted at the Energy Efficiency with Sustainable Performance: Techniques, Tools, and Best Practices, EESP Workshop, in conjunction with ISC High Performance 2025 In: Neuwirth, S., Paul, A.K., Weinzierl, T., Carson, E.C. (eds) High Performance Computing. ISC High Performance 2025. Lecture Notes in Computer Science, vol 16091. Springer, Cham William F. Godoy Oscar Hernandez Paul R. C. Kent Maria Patrou Kazi Asifuzzaman Narasinga Rao Miniskar Pedro Valero-Lara Jeffrey S. Vetter Matthew D. Sinclair Jason Lowe-Power Bobby R. Bruce 10.1007/978-3-032-07612-0_14 http://arxiv.org/abs/2511.21413v1 Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM 2025-11-26T14:06:22Z

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.

2025-11-26T14:06:22Z 6 pages, 3 figures Tim Trappen Robert Keßler Roland Pabel Viktor Achter Stefan Wesner 10.1145/3774902.3776632 http://arxiv.org/abs/2511.20834v1 Accelerating Sparse Convolutions in Voxel-Based Point Cloud Networks 2025-11-25T20:34:37Z

Sparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and AR/VR. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous-neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes: (i) a high-performance one-shot search algorithm that builds the kernel map with no preprocessing and high memory locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior SpC engines by 1.71x on average and up to 2.31x for end-to-end inference, and by 2.13x on average and up to 3.32x for layer-wise execution across diverse layer configurations.

2025-11-25T20:34:37Z Dionysios Adamopoulos Anastasia Poulopoulou Georgios Goumas Christina Giannoula