https://arxiv.org/api/GUrtJ2/b9F7CBrStaBAkQreuhsg 2026-03-26T15:35:02Z 5077 195 15 http://arxiv.org/abs/2404.00346v2 Asymptotically Optimal Scheduling of Multiple Parallelizable Job Classes 2025-12-29T01:27:42Z Modern computing workloads are often composed of parallelizable jobs. A parallelizable job can be completed more quickly when run on additional servers. However, each job can only use a limited number of servers, known as its parallelizability level, which is determined by the type of computation the job performs and how it is implemented. Workloads generally consist of multiple job classes, where jobs from different classes have different parallelizability levels and follow different job size (service requirement) distributions. This paper considers scheduling parallelizable jobs belonging to an arbitrary number of job classes. Given a limited number of servers, we must allocate servers across a stream of arriving jobs to minimize mean response time -- the average time from when a job arrives to the system until it completes. We find that in lighter-load scaling regimes (i.e., Sub-Halfin-Whitt), the optimal allocation policy is Least-Parallelizable-First (LPF), which prioritizes jobs from the least parallelizable job classes regardless of their size distributions. By contrast, we find that in the heavier-load regimes (i.e., Super-NDS), the optimal allocation policy prioritizes jobs with the Shortest Expected Remaining Processing Time (SERPT). We also develop policies that are asymptotically optimal when the scaling regime is not known a priori. 2024-03-30T12:50:31Z Benjamin Berg Benjamin Moseley Weina Wang Mor Harchol-Balter http://arxiv.org/abs/2510.10209v2 LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization 2025-12-27T16:08:33Z The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization. 2025-10-11T13:27:02Z Massinissa Merouani Afif Boudaoud Riyadh Baghdadi http://arxiv.org/abs/2511.00592v2 Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization 2025-12-27T10:04:35Z Automatic code optimization remains a difficult challenge, particularly for complex loop nests on modern hardware. This paper investigates a novel approach to code optimization where Large Language Models (LLMs) guide the process through a closed-loop interaction with a compiler. We present ComPilot, an experimental framework that leverages off-the-shelf LLMs, without any task-specific fine-tuning, as interactive optimization agents. ComPilot establishes a feedback loop where an LLM proposes transformations for a given loop nest to a compiler. The compiler attempts the transformations, reporting back legality status and measured speedup or slowdown. The LLM utilizes this concrete feedback to iteratively refine its optimization strategy. Our extensive evaluation across the PolyBench benchmark suite demonstrates the effectiveness of this zero-shot approach. ComPilot achieves geometric mean speedups of 2.66x (single run) and 3.54x (best-of-5 runs) over the original code. Furthermore, ComPilot demonstrates competitive performance against the state-of-the-art Pluto polyhedral optimizer, outperforming it in many cases. This experimental study demonstrates that general-purpose LLMs can effectively guide the code optimization process when grounded by compiler feedback, opening promising research directions for agentic AI in code optimization. 2025-11-01T15:32:34Z Accepted at the 34th International Conference on Parallel Architectures and Compilation Techniques (PACT 2025). 12 pages, plus appendix 2025 34th International Conference on Parallel Architectures and Compilation Techniques (PACT) Massinissa Merouani Islem Kara Bernou Riyadh Baghdadi 10.1109/PACT65351.2025.00027 http://arxiv.org/abs/2512.22066v1 Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling 2025-12-26T15:42:29Z Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of LLM inference, focusing on the distinct behaviors of the compute-bound prefill and memory-bound decode phases. Our simulation methodology combines OpenRAM for energy modeling, LLMCompass for latency simulation, and ScaleSIM for systolic array operational intensity. Our findings show that total energy use is predominantly determined by SRAM size in both phases, with larger buffers significantly increasing static energy due to leakage, which is not offset by corresponding latency benefits. We quantitatively explore the memory-bandwidth bottleneck, demonstrating that while high operating frequencies reduce prefill latency, their positive impact on memory-bound decode latency is capped by the external memory bandwidth. Counter-intuitively, high compute frequency can reduce total energy by reducing execution time and consequently decreasing static energy consumption more than the resulting dynamic power increase. We identify an optimal hardware configuration for the simulated workload: high operating frequencies (1200MHz-1400MHz) and a small local buffer size of 32KB to 64KB. This combination achieves the best energy-delay product, balancing low latency with high energy efficiency. Furthermore, we demonstrate how memory bandwidth acts as a performance ceiling, and that increasing compute frequency only yields performance gains up to the point where the workload becomes memory-bound. This analysis provides concrete architectural insights for designing energy-efficient LLM accelerators, especially for datacenters aiming to minimize their energy overhead. 2025-12-26T15:42:29Z Hannah Atmer Yuan Yao Thiemo Voigt Stefanos Kaxiras http://arxiv.org/abs/2506.04049v3 WANDER: An Explainable Decision-Support Framework for HPC 2025-12-25T06:09:30Z High-performance computing (HPC) systems expose many interdependent configuration knobs that impact runtime, resource usage, power, and variability. Existing predictive tools model these outcomes, but do not support structured exploration, explanation, or guided reconfiguration. We present WANDER, a decision-support framework that synthesizes alternate configurations using counterfactual analysis aligned with user goals and constraints. We introduce a composite trade-off score that ranks suggestions based on prediction uncertainty, consistency between feature-target relationships using causal models, and similarity between feature distributions against historical data. To our knowledge, WANDER is the first such system to unify prediction, exploration, and explanation for HPC tuning under a common query interface. Across multiple datasets WANDER generates interpretable and trustworthy, human-readable alternatives that guide users to achieve their performance objectives. 2025-06-04T15:15:23Z Ankur Lahiry Banooqa Banday Yugesh Bhattarai Tanzima Z. Islam http://arxiv.org/abs/2512.21433v1 DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction 2025-12-24T21:46:17Z Error-bounded lossy compression techniques have become vital for scientific data management and analytics, given the ever-increasing volume of data generated by modern scientific simulations and instruments. Nevertheless, assessing data quality post-compression remains computationally expensive due to the intensive nature of metric calculations. In this work, we present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ), with the following key contributions: 1) We develop a surrogate model for compression quality prediction that is generalizable to different error-bounded lossy compressors, quality metrics, and input datasets; 2) We adopt a novel two-stage design that decouples the computationally expensive feature-extraction stage from the light-weight metrics prediction, enabling efficient training and modular inference; 3) We optimize the model performance on time-evolving data using a mixture-of-experts design. Such a design enhances the robustness when predicting across simulation timesteps, especially when the training and test data exhibit significant variation. We validate the effectiveness of DeepCQ on four real-world scientific applications. Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10\% across most settings, significantly outperforming existing methods. Our framework empowers scientific users to make informed decisions about data compression based on their preferred data quality, thereby significantly reducing I/O and computational overhead in scientific data analysis. 2025-12-24T21:46:17Z Khondoker Mirazul Mumenin Robert Underwood Dong Dai Jinzhen Wang Sheng Di Zarija Lukić Franck Cappello http://arxiv.org/abs/2512.21010v1 LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics 2025-12-24T07:14:31Z The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model's dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation ($N=100,000$ iterations) is used to approximate the statistically robust Expected Win Score ($E[S_m]$), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity ($T_k$), which allows us to profile models based on their risk appetite--distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation. 2025-12-24T07:14:31Z 18 pages Jiashuo Liu Jiayun Wu Chunjie Wu Jingkai Liu Zaiyuan Wang Huan Zhou Wenhao Huang Hongseok Namkoong http://arxiv.org/abs/2512.20243v1 Post-Quantum Cryptography in the 5G Core 2025-12-23T10:53:32Z In this work, the conventional cryptographic algorithms used in the 5G Core are replaced with post-quantum alternatives and the practical impact of this transition is evaluated. Using a simulation environment, we model the registration and deregistration of varying numbers of user equipments (UEs) and measure the resulting effects on bandwidth consumption and latency. Our results show that the deployment of post-quantum cryptographic algorithms has a measurable effect on performance, but that this effect is small, and perhaps more crucially, that the extra overhead needed in terms of computation and bandwidth does not have any substantial impact on the usability of the network and the efficiency of its network functions. Overall the experimental results in this work corroborate earlier research: the 5G Core is technically able to support post-quantum cryptography without any inherent issues connected to the increased computational overhead or larger message size. 2025-12-23T10:53:32Z 11 pages, 7 figures, 2 tables Thomas Attema Bor de Kock Sandesh Manganahalli Jayaprakash Dimitrios Schoinianakis Thom Sijpesteijn Rintse van de Vlasakker http://arxiv.org/abs/2512.20178v1 SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication 2025-12-23T09:16:52Z Distributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in numerous high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in the substantial communication overhead, which limits both performance and scalability. In this paper, we identify and analyze sources of inefficient communication in existing distributed SpMM implementations at two levels and address these inefficiencies by proposing: (1) a fine-grained, sparsity-aware communication strategy that reduces communication overhead by exploiting the sparsity pattern of the sparse matrix, and (2) a hierarchical communication strategy that integrates the sparsity-aware strategy with the common two-tier network architectures in GPU-accelerated systems, to reduce redundant communication across slow network links. We implement these optimizations in a comprehensive distributed SpMM framework, \method{}. Extensive evaluations on real-world datasets show that our framework demonstrates strong scalability up to 128 GPUs, achieving geometric mean speedups of 221.5$\times$, 56.0$\times$, 23.4$\times$, and 8.8$\times$ over four state-of-the-art baselines (CAGNET, SPA, BCL, and CoLa, respectively) at this scale. 2025-12-23T09:16:52Z Under Review Chen Zhuang Lingqi Zhang Benjamin Brock Du Wu Peng Chen Toshio Endo Satoshi Matsuoka Mohamed Wahib http://arxiv.org/abs/2509.23410v3 PATCH: Learnable Tile-level Hybrid Sparsity for LLMs 2025-12-22T19:09:57Z Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM. 2025-09-27T16:57:28Z Younes Hourri Mohammad Mozaffari Maryam Mehri Dehnavi http://arxiv.org/abs/2512.19606v1 RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference 2025-12-22T17:42:51Z RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an abstract LLM specification (model shape, batch/sequence settings, training vs. inference, and hybrid parallelism choices) with an extended Astra-Sim backend that executes those traces on explicit multi-dimensional network topologies with congestion-aware routing and support for degraded and faulty links. The frontend assigns per-operator latency using a tile-based model that accounts for SM under-utilization and multi-level memory traffic (SRAM/ L2/ HBM), and prunes memory-infeasible configurations using an activation-liveness traversal under recomputation, parallelism and ZeRO/FDSP sharding policies. Across A100-based validation cases, RAPID-LLM predicts Llama inference step latency and GPT-scale training time per batch within 10.4\% relative to published measurements, and matches ns-3 packet-level results within 8\% on representative communication workloads. Case studies demonstrate how RAPID-LLM enables fast, exhaustive sweeps over hybrid-parallel configurations, quantifies sensitivity to soft link faults under realistic routing and congestion, and evaluates hypothetical GPU design variants including HBM bandwidth throttling effects. 2025-12-22T17:42:51Z 11 pages, 12 figures George Karfakis Faraz Tahmasebi Binglu Chen Lime Yao Saptarshi Mitra Tianyue Pan Hyoukjun Kwon Puneet Gupta http://arxiv.org/abs/2509.06716v2 Efficiently Ranking Software Variants with Minimal Benchmarks 2025-12-22T10:17:59Z Benchmarking is a common practice in software engineering to assess the qualities and performance of software variants, coming from multiple competing systems or from configurations of the same system. Benchmarks are used notably to compare and understand variant performance, fine-tune software, detect regressions, or design new software systems. The execution of benchmarks to get a complete picture of software variants is highly costly in terms of computational resources and time. In this paper, we propose a novel approach for reducing benchmarks while maintaining stable rankings, using test suite optimization techniques. That is, we remove instances from the benchmarks while trying to keep the same rankings of the variants on all tests. Our method, BISection Sampling, BISS, strategically retains the most critical tests and applies a novel divide-and-conquer approach to efficiently sample among relevant remaining tests. We experiment with datasets and use cases from LLM leaderboards, SAT competitions, and configurable systems for performance modeling. Our results show that our method outperforms baselines even when operating on a subset of variants. Using BISS, we reduce the computational cost of the benchmarks on average to 44% and on more than half the benchmarks by up to 99% without loss in ranking stability. 2025-09-08T14:11:35Z Théo Matricon Mathieu Acher Helge Spieker Arnaud Gotlieb http://arxiv.org/abs/2506.22714v2 Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication 2025-12-22T05:47:00Z Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor Core Units (TCUs) and CUDA cores to accelerate sparse operators. The former excels at structured matrix computations, whereas the latter offers greater programming flexibility. However, how to combine these two resources to maximize sparse-operator performance remains unclear. In this work, we first identify the source of performance gains in hybrid computation and systematically analyze their complementary strengths. Motivated by this, we propose Libra, a holistic framework that efficiently leverages heterogeneous computing resources to accelerate both SpMM and SDDMM operators. Specifically, Libra introduces a 2D-aware (locality and utilization) workload distribution method to precisely identify the optimal task mapping, simultaneously leveraging the data reuse capabilities of TCUs and the flexibility of CUDA cores to minimize computational redundancy. Libra further incorporates hybrid load balancing, occupancy-aware task scheduling, and efficient kernel implementations to maximize execution efficiency. Extensive experiments on H100 and RTX 4090 GPUs demonstrate that Libra surpasses all the 12 up-to-date baselines significantly, e.g., on average 1.77x speedup over FlashSparse, 1.73x over RoDe, and 2.9x over DGL for end-to-end GNN applications. Libra opens up a new perspective for sparse operator acceleration by fully unleashing the power of heterogeneous GPU resources. 2025-06-28T01:50:13Z Jinliang Shi Shigang Li Youxuan Xu Xueying Wang Rongtian Fu Zhi Ma Tong Wu http://arxiv.org/abs/2512.18457v1 Age of Information with Age-Dependent Server Selection 2025-12-20T18:23:10Z In this paper, we consider a single-source multi-server generate-at-will discrete-time non-preemptive status update system where update packets are transmitted using {\em only one} of the available servers, according to a server selection policy. In particular, when a transmission is complete, the update system makes a threshold-based decision on whether to wait or transmit, and if latter, which server to use for transmissions, on the basis of the instantaneous value of the age of information (AoI) process. In our setting, servers have general heterogeneous discrete phase-type (DPH) distributed service times, and also heterogeneous transmission costs. The goal is to find an age-dependent multi-threshold policy that minimizes the AoI cost with a constraint on transmission costs, the former cost defined in terms of the time average of an arbitrary function of AoI. For this purpose, we propose a novel tool called \emph{multi-regime absorbing Markov chain} (MR-AMC) in discrete time. Using the MR-AMC framework, we exactly obtain the distribution of AoI, and subsequently the costs associated with AoI and transmissions. With the exact analysis in hand, optimum thresholds can be obtained in the case of a few servers, by exhaustive search. We validate the proposed analytical model, and also demonstrate the benefits of age-dependent server selection, with numerical examples. 2025-12-20T18:23:10Z 11 pages, 6 figures, preliminary version presented at Asilomar Conference, 2025 Nail Akar Ismail Cosandal Sennur Ulukus http://arxiv.org/abs/2508.13057v6 Hierarchical Evaluation Function: A Multi-Metric Approach for Optimizing Demand Forecasting Models 2025-12-20T00:56:55Z Demand forecasting in competitive, uncertain business environments requires models that can integrate multiple evaluation perspectives rather than being restricted to hyperparameter optimization based on a single metric. This traditional approach tends to prioritize one error indicator, which can bias results when metrics provide contradictory signals. In this context, the Hierarchical Evaluation Function (HEF) is proposed as a multi-metric framework for hyperparameter optimization that integrates explanatory power (R2), sensitivity to extreme errors (RMSE), and average accuracy (MAE). The performance of HEF was assessed using four widely recognized benchmark datasets in the forecasting domain: Walmart, M3, M4, and M5. Prediction models were optimized through Grid Search, Particle Swarm Optimization (PSO), and Optuna, and statistical analyses based on difference-of-proportions tests confirmed that HEF delivers superior results compared to a unimetric reference function, regardless of the optimizer employed, with particular relevance for heterogeneous monthly time series (M3) and highly granular daily demand scenarios (M5). The findings demonstrate that HEF improves stability, generalization, and robustness at low computational cost, consolidating its role as a reliable evaluation framework that enhances model selection, enables more accurate demand forecasts, and supports decision-making in dynamic, competitive business environments. 2025-08-18T16:25:49Z 31 pages, 15 figures, 25 tables. Submitted as a preprint. The manuscript introduces the Hierarchical Evaluation Function, a multi-metric framework for optimizing demand forecasting models under high uncertainty. Includes extensive experimental validation using real-world datasets and a comparative analysis against classical and modern methods Adolfo González Víctor Parada