https://arxiv.org/api/gxMn41DvXE/FMrMDgaH9iI4CQfc 2026-03-30T08:54:58Z 5080 225 15 http://arxiv.org/abs/2512.14445v1 Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs 2025-12-16T14:31:49Z

In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to synchronize their start and possibly departure times. This is true of many parallelized machine learning workloads, and the popular Apache Spark processing engine has recently added support for Barrier Execution Mode, which allows users to add such barriers to their jobs. These barriers necessarily result in idle periods on some of the workers, which reduces their stability and performance, compared to equivalent workloads with no barriers. In this paper we will consider and analyze the stability and performance penalties resulting from barriers. We include an analysis of the stability of $(s,k,l)$ barrier systems that allow jobs to depart after $l$ out of $k$ of their tasks complete. We also derive and evaluate performance bounds for hybrid barrier systems servicing a mix of jobs, both with and without barriers, and with varying degrees of parallelism. For the purely 1-barrier case we compare the bounds and simulation results to benchmark data from a standalone Spark system. We study the overhead in the real system, and based on its distribution we attribute it to the dual event and polling-driven mechanism used to schedule barrier-mode jobs. We develop a model for this type of overhead and validate it against the real system through simulation.

2025-12-16T14:31:49Z Brenton Walker Markus Fidler http://arxiv.org/abs/2512.14297v1 A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge Networks 2025-12-16T11:11:37Z

Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial networks. These events violate IEC~61850-derived quality-of-service requirements and user-defined service-level agreements, hindering the reliable and timely delivery of control, monitoring, and best-effort traffic in IEC~61400-25-compliant wind power plants. Failure to maintain these requirements often results in delayed or lost control signals, reduced operational efficiency, and increased risk of wind turbine generator downtime. To address these challenges, this study proposes a threshold-triggered Deep Q-Network self-healing agent that autonomically detects, analyzes, and mitigates network disruptions while adapting routing behavior and resource allocation in real time. The proposed agent was trained, validated, and tested on an emulated tri-clustered switch network deployed in a cloud-based proof-of-concept testbed. Simulation results show that the proposed agent improves disruption recovery performance by 53.84% compared to a baseline shortest-path and load-balanced routing approach and outperforms state-of-the-art methods, including the Adaptive Network-based Fuzzy Inference System by 13.1% and the Deep Q-Network and traffic prediction-based routing optimization method by 21.5%, in a super-spine leaf data-plane architecture. Additionally, the agent maintains switch thermal stability by proactively initiating external rack cooling when required. These findings highlight the potential of deep reinforcement learning in building resilience in software-defined industrial networks deployed in mission-critical, time-sensitive application scenarios.

2025-12-16T11:11:37Z Agrippina Mwangi Utrecht University, The Netherlands León Navarro-Hilfiker Ørsted, USA Lukasz Brewka Ørsted, Denmark Mikkel Gryning Ørsted, Denmark Elena Fumagalli Utrecht University, The Netherlands Madeleine Gibescu Utrecht University, The Netherlands 10.1109/TNSM.2025.3647853 http://arxiv.org/abs/2512.14151v1 Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware Replacement 2025-12-16T07:16:10Z

Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals. These workloads generate highly irregular and bursty access patterns, causing traditional prefetching and replacement policies to mispredict and trigger severe cache pollution, thereby degrading system performance. To address this challenge, this paper proposes an Adaptive Cache Pollution Control (ACPC) mechanism tailored for LLM inference workloads, integrating Temporal Convolutional Network (TCN)-based access prediction with a priority-aware replacement strategy. The TCN module learns temporal dependencies in token access sequences to identify potential high-reuse cache lines, while the replacement policy dynamically adjusts eviction priorities based on predicted reuse likelihood and cache occupancy. The proposed framework is implemented and evaluated on representative transformer-based inference traces, including GPT-style autoregressive decoding and embedding retrieval workloads. Experimental results demonstrate that ACPC reduces cache pollution by 41.7 percent, improves cache hit rate by 8.9 percent, and achieves a 60.0 percent reduction in L2 miss penalty, compared with state-of-the-art machine-learning-based replacement baselines. Additionally, the proposed Temporal CNN-based ACPC framework increases token generation throughput by 15.9 percent and achieves the lowest final loss of 0.21, confirming its superior efficiency and stability under complex LLM inference workloads. These results highlight ACPC's effectiveness in recognizing useful cache lines and mitigating redundant prefetches under dynamic LLM access behaviors. The proposed approach provides a scalable, learning-driven solution for optimizing memory efficiency and latency in large-scale LLM serving and inference systems.

2025-12-16T07:16:10Z Songze Liu Hongkun Du Shaowen Wang http://arxiv.org/abs/2512.07108v2 Scheduling in Quantum Satellite Networks: Fairness and Performance Optimization 2025-12-15T17:56:43Z

Quantum satellite networks offer a promising solution for achieving long-distance quantum communication by enabling entanglement distribution across global scales. This work formulates and solves the quantum satellite network scheduling problem by optimizing satellite-to-ground station pair assignments under realistic system and environmental constraints. Our framework accounts for limited satellite and ground station resources, fairness, entanglement fidelity thresholds, and real world non-idealities including atmospheric losses, weather and background noise. In addition, we incorporate the complexities of multi-satellite relays enabled via inter-satellite links. We propose an integer linear programming (ILP) based optimization framework that supports multiple scheduling objectives, allowing us to analyze tradeoffs between maximizing total entanglement distribution rate and ensuring fairness across ground station pairs. Our framework can also be used as a benchmark tool to measure the performance of other potential transmission scheduling policies.

2025-12-08T02:44:14Z Ashutosh Jayant Dikshit Naga Lakshmi Anipeddi Prajit Dhara Saikat Guha Deirdre Kilbane Leandros Tassiulas Don Towsley Nitish K. Panigrahy http://arxiv.org/abs/2505.08091v2 LEGO: A Layout Expression Language for Code Generation of Hierarchical Mapping 2025-12-15T16:10:37Z

We describe LEGO, a new approach to optimizing data movement whereby code is expressed as a layout-independent computation and composed with layouts for data and computation. This code generator organization derives complex indexing expressions associated with hierarchical parallel code and data movement for GPUs. LEGO maps from layout specification to indexing expressions, and can be integrated into existing compilers and code templates. It facilitates the exploration of data layouts in combination with other optimizations. We demonstrate LEGO's integration with the Triton and MLIR compilers, and with CUDA templates. We show that LEGO is capable of deriving performance competitive with Triton, and shows broad applicability for data and thread layout mapping optimizations in its integration with CUDA and MLIR.

2025-05-12T21:53:09Z Amir Mohammad Tavakkoli Cosmin Oancea Mary Hall http://arxiv.org/abs/2306.04531v2 Comparison of SeDuMi and SDPT3 Solvers for Stability of Continuous-time Linear System 2025-12-15T11:08:07Z

SeDuMi and SDPT3 are two solvers for solving Semi-definite Programming (SDP) or Linear Matrix Inequality (LMI) problems. A computational performance comparison of these two are undertaken in this paper regarding the Stability of Continuous-time Linear Systems. The comparison mainly focuses on computational times and memory requirements for different scales of problems. To implement and compare the two solvers on a set of well-posed problems, we employ YALMIP, a widely used toolbox for modeling and optimization in MATLAB. The primary goal of this study is to provide an empirical assessment of the relative computational efficiency of SeDuMi and SDPT3 under varying problem conditions. Our evaluation indicates that SDPT3 performs much better in large-scale, high-precision calculations.

2023-06-07T15:40:15Z This version will be incorporated into another work Guangda Xu http://arxiv.org/abs/2512.13176v1 EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC 2025-12-15T10:37:48Z

Resource disaggregation is a promising technique for improving the efficiency of large-scale computing systems. However, this comes at the cost of increased memory access latency due to the need to rely on the network fabric to transfer data between remote nodes. As such, it is crucial to ascertain an application's memory latency sensitivity to minimize the overall performance impact. Existing tools for measuring memory latency sensitivity often rely on custom ad-hoc hardware or cycle-accurate simulators, which can be inflexible and time-consuming. To address this, we present EDAN (Execution DAG Analyzer), a novel performance analysis tool that leverages an application's runtime instruction trace to generate its corresponding execution DAG. This approach allows us to estimate the latency sensitivity of sequential programs and investigate the impact of different hardware configurations. EDAN not only provides us with the capability of calculating the theoretical bounds for performance metrics, but it also helps us gain insight into the memory-level parallelism inherent to HPC applications. We apply EDAN to applications and benchmarks such as PolyBench, HPCG, and LULESH to unveil the characteristics of their intrinsic memory-level parallelism and latency sensitivity.

2025-12-15T10:37:48Z Proc. 39th ACM International Conference on Supercomputing, 2025 Siyuan Shen Mikhail Khalilov Lukas Gianinazzi Timo Schneider Marcin Chrapek Jai Dayal Manisha Gajbe Robert Wisniewski Torsten Hoefler 10.1145/3721145.3734530 http://arxiv.org/abs/2512.22147v1 GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs 2025-12-15T07:20:15Z

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large applications where full builds and runs are expensive. We present an end-to-end LLM framework with performance feedback that optimizes kernels without building the full application. From independently extracted hotspot kernels, it automatically completes code into a Minimal Executable Program (MEP), then performs multi-round iterative optimization and evaluation outside the full application. The framework integrates Automatic Error Repair and Performance Pattern Inheritance to fix faults, preserve correctness, reuse effective tiling/memory/synchronization strategies, and reduce search cost. Optimized variants are reintegrated into the original application for validation. We evaluate on NVIDIA GPUs and the Haiguang Deep Computing Unit (DCU) platform (AMD-licensed architecture) using PolyBench, the AMD APP SDK, and hotspot kernels from large-scale supercomputing applications. The method achieves average speedups of 5.05x (PolyBench on NVIDIA), 7.77x (PolyBench on DCU), 1.77x (AMD APP SDK), and 1.25x on three hotspot kernels, surpassing direct LLM optimization. The approach requires no full-source dependencies, offers cross-platform portability, and enables practical, low-cost GPU kernel optimization.

2025-12-15T07:20:15Z Ruifan Chu Anbang Wang Xiuxiu Bai Shuai Liu Xiaoshe Dong http://arxiv.org/abs/2512.12801v1 Fine-Grained Energy Prediction For Parallellized LLM Inference With PIE-P 2025-12-14T18:50:51Z

With the widespread adoption of Large Language Models (LLMs), energy costs of running LLMs is quickly becoming a critical concern. However, precisely measuring the energy consumption of LLMs is often infeasible because hardware-based power monitors are not always accessible and software-based energy measurement tools are not accurate. While various prediction techniques have been developed to estimate LLM energy consumption, these approaches are limited to single-GPU environments and thus are not applicable to modern LLM inference which is typically parallelized across multiple GPUs. In this work, we remedy this gap and introduce PIE-P, a fine-grained energy prediction framework for multi-GPU inference, including tensor, pipeline, and data parallelism. Predicting the energy under parallelized inference is complicated by the non-determinism in inter-GPU communication, additional communication overheads, and difficulties in isolating energy during the communication/synchronization phase. We develop a scalable prediction framework that addresses these issues via precise sampling, fine-grained modeling of inter-GPU communication, and careful accounting of parallelization overhead. Our evaluation results show that PIE-P yields accurate and fine-grained energy predictions across parallelism strategies, significantly outperforming baselines.

2025-12-14T18:50:51Z Anurag Dutt Young Won Choi Avirup Sil Anshul Gandhi Aruna Balasubramanian Niranjan Balasubramanian http://arxiv.org/abs/2512.12314v1 Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo 2025-12-13T13:08:05Z

While distributed tracing and chaos engineering are becoming standard for microservices, resilience models remain largely manual and bespoke. We revisit a trace-discovered connectivity model that derives a service dependency graph from traces and uses Monte Carlo simulation to estimate endpoint availability under fail-stop service failures. Compared to earlier work, we (i) derive the graph directly from raw OpenTelemetry traces, (ii) attach endpoint-specific success predicates, and (iii) add a simple asynchronous semantics that treats Kafka edges as non-blocking for immediate HTTP success. We apply this model to the OpenTelemetry Demo ("Astronomy Shop") using a GitHub Actions workflow that discovers the graph, runs simulations, and executes chaos experiments that randomly kill microservices in a Docker Compose deployment. Across the studied failure fractions, the model reproduces the overall availability degradation curve, while asynchronous semantics for Kafka edges change predicted availabilities by at most about 10^(-5) (0.001 percentage points). This null result suggests that for immediate HTTP availability in this case study, explicitly modeling asynchronous dependencies is not warranted, and a simpler connectivity-only model is sufficient.

2025-12-13T13:08:05Z Anatoly A. Krasnovsky http://arxiv.org/abs/2512.12008v1 Hold Onto That Thought: Assessing KV Cache Compression On Reasoning 2025-12-12T19:50:34Z

Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.

2025-12-12T19:50:34Z Minghui Liu Aadi Palnitkar Tahseen Rabbani Hyunwoo Jae Kyle Rui Sang Dixi Yao Shayan Shabihi Fuheng Zhao Tian Li Ce Zhang Furong Huang Kunpeng Zhang http://arxiv.org/abs/2512.15766v1 LOOPRAG: Enhancing Loop Transformation Optimization with Retrieval-Augmented Large Language Models 2025-12-12T11:09:48Z

Loop transformations are semantics-preserving optimization techniques, widely used to maximize objectives such as parallelism. Despite decades of research, applying the optimal composition of loop transformations remains challenging due to inherent complexities, including cost modeling for optimization objectives. Recent studies have explored the potential of Large Language Models (LLMs) for code optimization. However, our key observation is that LLMs often struggle with effective loop transformation optimization, frequently leading to errors or suboptimal optimization, thereby missing opportunities for performance improvements. To bridge this gap, we propose LOOPRAG, a novel retrieval-augmented generation framework designed to guide LLMs in performing effective loop optimization on Static Control Part. We introduce a parameter-driven method to harness loop properties, which trigger various loop transformations, and generate diverse yet legal example codes serving as a demonstration source. To effectively obtain the most informative demonstrations, we propose a loop-aware algorithm based on loop features, which balances similarity and diversity for code retrieval. To enhance correct and efficient code generation, we introduce a feedback-based iterative mechanism that incorporates compilation, testing and performance results as feedback to guide LLMs. Each optimized code undergoes mutation, coverage and differential testing for equivalence checking. We evaluate LOOPRAG on PolyBench, TSVC and LORE benchmark suites, and compare it against compilers (GCC-Graphite, Clang-Polly, Perspective and ICX) and representative LLMs (DeepSeek and GPT-4). The results demonstrate average speedups over base compilers of up to 11.20$\times$, 14.34$\times$, and 9.29$\times$ for PolyBench, TSVC, and LORE, respectively, and speedups over base LLMs of up to 11.97$\times$, 5.61$\times$, and 11.59$\times$.

2025-12-12T11:09:48Z Accepted to ASPLOS 2026 Yijie Zhi Yayu Cao Jianhua Dai Xiaoyang Han Jingwen Pu Qingran Wu Sheng Cheng Ming Cai http://arxiv.org/abs/2512.15764v1 AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs 2025-12-12T09:44:07Z

Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.

2025-12-12T09:44:07Z Anshul Kumar Gagan Raj Gupta Manisha Chawla http://arxiv.org/abs/2512.09800v1 Ariel-ML: Computing Parallelization with Embedded Rust for Neural Networks on Heterogeneous Multi-core Microcontrollers 2025-12-10T16:13:29Z

Low-power microcontroller (MCU) hardware is currently evolving from single-core architectures to predominantly multi-core architectures. In parallel, new embedded software building blocks are more and more written in Rust, while C/C++ dominance fades in this domain. On the other hand, small artificial neural networks (ANN) of various kinds are increasingly deployed in edge AI use cases, thus deployed and executed directly on low-power MCUs. In this context, both incremental improvements and novel innovative services will have to be continuously retrofitted using ANNs execution in software embedded on sensing/actuating systems already deployed in the field. However, there was so far no Rust embedded software platform automating parallelization for inference computation on multi-core MCUs executing arbitrary TinyML models. This paper thus fills this gap by introducing Ariel-ML, a novel toolkit we designed combining a generic TinyML pipeline and an embedded Rust software platform which can take full advantage of multi-core capabilities of various 32bit microcontroller families (Arm Cortex-M, RISC-V, ESP-32). We published the full open source code of its implementation, which we used to benchmark its capabilities using a zoo of various TinyML models. We show that Ariel-ML outperforms prior art in terms of inference latency as expected, and we show that, compared to pre-existing toolkits using embedded C/C++, Ariel-ML achieves comparable memory footprints. Ariel-ML thus provides a useful basis for TinyML practitioners and resource-constrained embedded Rust developers.

2025-12-10T16:13:29Z Zhaolan Huang Kaspar Schleiser Gyungmin Myung Emmanuel Baccelli http://arxiv.org/abs/2512.09786v1 TinyDéjàVu: Smaller Memory Footprint & Faster Inference on Sensor Data Streams with Always-On Microcontrollers 2025-12-10T16:07:17Z

Always-on sensors are increasingly expected to embark a variety of tiny neural networks and to continuously perform inference on time-series of the data they sense. In order to fit lifetime and energy consumption requirements when operating on battery, such hardware uses microcontrollers (MCUs) with tiny memory budget e.g., 128kB of RAM. In this context, optimizing data flows across neural network layers becomes crucial. In this paper, we introduce TinyDéjàVu, a new framework and novel algorithms we designed to drastically reduce the RAM footprint required by inference using various tiny ML models for sensor data time-series on typical microcontroller hardware. We publish the implementation of TinyDéjàVu as open source, and we perform reproducible benchmarks on hardware. We show that TinyDéjàVu can save more than 60% of RAM usage and eliminate up to 90% of redundant compute on overlapping sliding window inputs.

2025-12-10T16:07:17Z Zhaolan Huang Emmanuel Baccelli