https://arxiv.org/api/GUrtJ2/b9F7CBrStaBAkQreuhsg2026-03-26T15:35:02Z507719515http://arxiv.org/abs/2404.00346v2Asymptotically Optimal Scheduling of Multiple Parallelizable Job Classes2025-12-29T01:27:42ZModern computing workloads are often composed of parallelizable jobs. A parallelizable job can be completed more quickly when run on additional servers. However, each job can only use a limited number of servers, known as its parallelizability level, which is determined by the type of computation the job performs and how it is implemented. Workloads generally consist of multiple job classes, where jobs from different classes have different parallelizability levels and follow different job size (service requirement) distributions.
This paper considers scheduling parallelizable jobs belonging to an arbitrary number of job classes. Given a limited number of servers, we must allocate servers across a stream of arriving jobs to minimize mean response time -- the average time from when a job arrives to the system until it completes. We find that in lighter-load scaling regimes (i.e., Sub-Halfin-Whitt), the optimal allocation policy is Least-Parallelizable-First (LPF), which prioritizes jobs from the least parallelizable job classes regardless of their size distributions. By contrast, we find that in the heavier-load regimes (i.e., Super-NDS), the optimal allocation policy prioritizes jobs with the Shortest Expected Remaining Processing Time (SERPT). We also develop policies that are asymptotically optimal when the scaling regime is not known a priori.2024-03-30T12:50:31ZBenjamin BergBenjamin MoseleyWeina WangMor Harchol-Balterhttp://arxiv.org/abs/2510.10209v2LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization2025-12-27T16:08:33ZThe advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.2025-10-11T13:27:02ZMassinissa MerouaniAfif BoudaoudRiyadh Baghdadihttp://arxiv.org/abs/2511.00592v2Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization2025-12-27T10:04:35ZAutomatic code optimization remains a difficult challenge, particularly for complex loop nests on modern hardware. This paper investigates a novel approach to code optimization where Large Language Models (LLMs) guide the process through a closed-loop interaction with a compiler. We present ComPilot, an experimental framework that leverages off-the-shelf LLMs, without any task-specific fine-tuning, as interactive optimization agents. ComPilot establishes a feedback loop where an LLM proposes transformations for a given loop nest to a compiler. The compiler attempts the transformations, reporting back legality status and measured speedup or slowdown. The LLM utilizes this concrete feedback to iteratively refine its optimization strategy. Our extensive evaluation across the PolyBench benchmark suite demonstrates the effectiveness of this zero-shot approach. ComPilot achieves geometric mean speedups of 2.66x (single run) and 3.54x (best-of-5 runs) over the original code. Furthermore, ComPilot demonstrates competitive performance against the state-of-the-art Pluto polyhedral optimizer, outperforming it in many cases. This experimental study demonstrates that general-purpose LLMs can effectively guide the code optimization process when grounded by compiler feedback, opening promising research directions for agentic AI in code optimization.2025-11-01T15:32:34ZAccepted at the 34th International Conference on Parallel Architectures and Compilation Techniques (PACT 2025). 12 pages, plus appendix2025 34th International Conference on Parallel Architectures and Compilation Techniques (PACT)Massinissa MerouaniIslem Kara BernouRiyadh Baghdadi10.1109/PACT65351.2025.00027http://arxiv.org/abs/2512.22066v1Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling2025-12-26T15:42:29ZEnergy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of LLM inference, focusing on the distinct behaviors of the compute-bound prefill and memory-bound decode phases. Our simulation methodology combines OpenRAM for energy modeling, LLMCompass for latency simulation, and ScaleSIM for systolic array operational intensity. Our findings show that total energy use is predominantly determined by SRAM size in both phases, with larger buffers significantly increasing static energy due to leakage, which is not offset by corresponding latency benefits. We quantitatively explore the memory-bandwidth bottleneck, demonstrating that while high operating frequencies reduce prefill latency, their positive impact on memory-bound decode latency is capped by the external memory bandwidth. Counter-intuitively, high compute frequency can reduce total energy by reducing execution time and consequently decreasing static energy consumption more than the resulting dynamic power increase. We identify an optimal hardware configuration for the simulated workload: high operating frequencies (1200MHz-1400MHz) and a small local buffer size of 32KB to 64KB. This combination achieves the best energy-delay product, balancing low latency with high energy efficiency. Furthermore, we demonstrate how memory bandwidth acts as a performance ceiling, and that increasing compute frequency only yields performance gains up to the point where the workload becomes memory-bound. This analysis provides concrete architectural insights for designing energy-efficient LLM accelerators, especially for datacenters aiming to minimize their energy overhead.2025-12-26T15:42:29ZHannah AtmerYuan YaoThiemo VoigtStefanos Kaxirashttp://arxiv.org/abs/2506.04049v3WANDER: An Explainable Decision-Support Framework for HPC2025-12-25T06:09:30ZHigh-performance computing (HPC) systems expose many interdependent configuration knobs that impact runtime, resource usage, power, and variability. Existing predictive tools model these outcomes, but do not support structured exploration, explanation, or guided reconfiguration. We present WANDER, a decision-support framework that synthesizes alternate configurations using counterfactual analysis aligned with user goals and constraints. We introduce a composite trade-off score that ranks suggestions based on prediction uncertainty, consistency between feature-target relationships using causal models, and similarity between feature distributions against historical data. To our knowledge, WANDER is the first such system to unify prediction, exploration, and explanation for HPC tuning under a common query interface. Across multiple datasets WANDER generates interpretable and trustworthy, human-readable alternatives that guide users to achieve their performance objectives.2025-06-04T15:15:23ZAnkur LahiryBanooqa BandayYugesh BhattaraiTanzima Z. Islamhttp://arxiv.org/abs/2512.21433v1DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction2025-12-24T21:46:17ZError-bounded lossy compression techniques have become vital for scientific data management and analytics, given the ever-increasing volume of data generated by modern scientific simulations and instruments. Nevertheless, assessing data quality post-compression remains computationally expensive due to the intensive nature of metric calculations. In this work, we present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ), with the following key contributions: 1) We develop a surrogate model for compression quality prediction that is generalizable to different error-bounded lossy compressors, quality metrics, and input datasets; 2) We adopt a novel two-stage design that decouples the computationally expensive feature-extraction stage from the light-weight metrics prediction, enabling efficient training and modular inference; 3) We optimize the model performance on time-evolving data using a mixture-of-experts design. Such a design enhances the robustness when predicting across simulation timesteps, especially when the training and test data exhibit significant variation. We validate the effectiveness of DeepCQ on four real-world scientific applications. Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10\% across most settings, significantly outperforming existing methods. Our framework empowers scientific users to make informed decisions about data compression based on their preferred data quality, thereby significantly reducing I/O and computational overhead in scientific data analysis.2025-12-24T21:46:17ZKhondoker Mirazul MumeninRobert UnderwoodDong DaiJinzhen WangSheng DiZarija LukićFranck Cappellohttp://arxiv.org/abs/2512.21010v1LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics2025-12-24T07:14:31ZThe rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model's dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation ($N=100,000$ iterations) is used to approximate the statistically robust Expected Win Score ($E[S_m]$), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity ($T_k$), which allows us to profile models based on their risk appetite--distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.2025-12-24T07:14:31Z18 pagesJiashuo LiuJiayun WuChunjie WuJingkai LiuZaiyuan WangHuan ZhouWenhao HuangHongseok Namkoonghttp://arxiv.org/abs/2512.20243v1Post-Quantum Cryptography in the 5G Core2025-12-23T10:53:32ZIn this work, the conventional cryptographic algorithms used in the 5G Core are replaced with post-quantum alternatives and the practical impact of this transition is evaluated. Using a simulation environment, we model the registration and deregistration of varying numbers of user equipments (UEs) and measure the resulting effects on bandwidth consumption and latency.
Our results show that the deployment of post-quantum cryptographic algorithms has a measurable effect on performance, but that this effect is small, and perhaps more crucially, that the extra overhead needed in terms of computation and bandwidth does not have any substantial impact on the usability of the network and the efficiency of its network functions.
Overall the experimental results in this work corroborate earlier research: the 5G Core is technically able to support post-quantum cryptography without any inherent issues connected to the increased computational overhead or larger message size.2025-12-23T10:53:32Z11 pages, 7 figures, 2 tablesThomas AttemaBor de KockSandesh Manganahalli JayaprakashDimitrios SchoinianakisThom SijpesteijnRintse van de Vlasakkerhttp://arxiv.org/abs/2512.20178v1SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication2025-12-23T09:16:52ZDistributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in numerous high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in the substantial communication overhead, which limits both performance and scalability. In this paper, we identify and analyze sources of inefficient communication in existing distributed SpMM implementations at two levels and address these inefficiencies by proposing: (1) a fine-grained, sparsity-aware communication strategy that reduces communication overhead by exploiting the sparsity pattern of the sparse matrix, and (2) a hierarchical communication strategy that integrates the sparsity-aware strategy with the common two-tier network architectures in GPU-accelerated systems, to reduce redundant communication across slow network links. We implement these optimizations in a comprehensive distributed SpMM framework, \method{}. Extensive evaluations on real-world datasets show that our framework demonstrates strong scalability up to 128 GPUs, achieving geometric mean speedups of 221.5$\times$, 56.0$\times$, 23.4$\times$, and 8.8$\times$ over four state-of-the-art baselines (CAGNET, SPA, BCL, and CoLa, respectively) at this scale.2025-12-23T09:16:52ZUnder ReviewChen ZhuangLingqi ZhangBenjamin BrockDu WuPeng ChenToshio EndoSatoshi MatsuokaMohamed Wahibhttp://arxiv.org/abs/2509.23410v3PATCH: Learnable Tile-level Hybrid Sparsity for LLMs2025-12-22T19:09:57ZLarge language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.2025-09-27T16:57:28ZYounes HourriMohammad MozaffariMaryam Mehri Dehnavihttp://arxiv.org/abs/2512.19606v1RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference2025-12-22T17:42:51ZRAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an abstract LLM specification (model shape, batch/sequence settings, training vs. inference, and hybrid parallelism choices) with an extended Astra-Sim backend that executes those traces on explicit multi-dimensional network topologies with congestion-aware routing and support for degraded and faulty links. The frontend assigns per-operator latency using a tile-based model that accounts for SM under-utilization and multi-level memory traffic (SRAM/ L2/ HBM), and prunes memory-infeasible configurations using an activation-liveness traversal under recomputation, parallelism and ZeRO/FDSP sharding policies.
Across A100-based validation cases, RAPID-LLM predicts Llama inference step latency and GPT-scale training time per batch within 10.4\% relative to published measurements, and matches ns-3 packet-level results within 8\% on representative communication workloads. Case studies demonstrate how RAPID-LLM enables fast, exhaustive sweeps over hybrid-parallel configurations, quantifies sensitivity to soft link faults under realistic routing and congestion, and evaluates hypothetical GPU design variants including HBM bandwidth throttling effects.2025-12-22T17:42:51Z11 pages, 12 figuresGeorge KarfakisFaraz TahmasebiBinglu ChenLime YaoSaptarshi MitraTianyue PanHyoukjun KwonPuneet Guptahttp://arxiv.org/abs/2509.06716v2Efficiently Ranking Software Variants with Minimal Benchmarks2025-12-22T10:17:59ZBenchmarking is a common practice in software engineering to assess the qualities and performance of software variants, coming from multiple competing systems or from configurations of the same system. Benchmarks are used notably to compare and understand variant performance, fine-tune software, detect regressions, or design new software systems. The execution of benchmarks to get a complete picture of software variants is highly costly in terms of computational resources and time. In this paper, we propose a novel approach for reducing benchmarks while maintaining stable rankings, using test suite optimization techniques. That is, we remove instances from the benchmarks while trying to keep the same rankings of the variants on all tests. Our method, BISection Sampling, BISS, strategically retains the most critical tests and applies a novel divide-and-conquer approach to efficiently sample among relevant remaining tests. We experiment with datasets and use cases from LLM leaderboards, SAT competitions, and configurable systems for performance modeling. Our results show that our method outperforms baselines even when operating on a subset of variants. Using BISS, we reduce the computational cost of the benchmarks on average to 44% and on more than half the benchmarks by up to 99% without loss in ranking stability.2025-09-08T14:11:35ZThéo MatriconMathieu AcherHelge SpiekerArnaud Gotliebhttp://arxiv.org/abs/2506.22714v2Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication2025-12-22T05:47:00ZSparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor Core Units (TCUs) and CUDA cores to accelerate sparse operators. The former excels at structured matrix computations, whereas the latter offers greater programming flexibility. However, how to combine these two resources to maximize sparse-operator performance remains unclear. In this work, we first identify the source of performance gains in hybrid computation and systematically analyze their complementary strengths. Motivated by this, we propose Libra, a holistic framework that efficiently leverages heterogeneous computing resources to accelerate both SpMM and SDDMM operators. Specifically, Libra introduces a 2D-aware (locality and utilization) workload distribution method to precisely identify the optimal task mapping, simultaneously leveraging the data reuse capabilities of TCUs and the flexibility of CUDA cores to minimize computational redundancy. Libra further incorporates hybrid load balancing, occupancy-aware task scheduling, and efficient kernel implementations to maximize execution efficiency. Extensive experiments on H100 and RTX 4090 GPUs demonstrate that Libra surpasses all the 12 up-to-date baselines significantly, e.g., on average 1.77x speedup over FlashSparse, 1.73x over RoDe, and 2.9x over DGL for end-to-end GNN applications. Libra opens up a new perspective for sparse operator acceleration by fully unleashing the power of heterogeneous GPU resources.2025-06-28T01:50:13ZJinliang ShiShigang LiYouxuan XuXueying WangRongtian FuZhi MaTong Wuhttp://arxiv.org/abs/2512.18457v1Age of Information with Age-Dependent Server Selection2025-12-20T18:23:10ZIn this paper, we consider a single-source multi-server generate-at-will discrete-time non-preemptive status update system where update packets are transmitted using {\em only one} of the available servers, according to a server selection policy. In particular, when a transmission is complete, the update system makes a threshold-based decision on whether to wait or transmit, and if latter, which server to use for transmissions, on the basis of the instantaneous value of the age of information (AoI) process. In our setting, servers have general heterogeneous discrete phase-type (DPH) distributed service times, and also heterogeneous transmission costs. The goal is to find an age-dependent multi-threshold policy that minimizes the AoI cost with a constraint on transmission costs, the former cost defined in terms of the time average of an arbitrary function of AoI. For this purpose, we propose a novel tool called \emph{multi-regime absorbing Markov chain} (MR-AMC) in discrete time. Using the MR-AMC framework, we exactly obtain the distribution of AoI, and subsequently the costs associated with AoI and transmissions. With the exact analysis in hand, optimum thresholds can be obtained in the case of a few servers, by exhaustive search. We validate the proposed analytical model, and also demonstrate the benefits of age-dependent server selection, with numerical examples.2025-12-20T18:23:10Z11 pages, 6 figures, preliminary version presented at Asilomar Conference, 2025Nail AkarIsmail CosandalSennur Ulukushttp://arxiv.org/abs/2508.13057v6Hierarchical Evaluation Function: A Multi-Metric Approach for Optimizing Demand Forecasting Models2025-12-20T00:56:55ZDemand forecasting in competitive, uncertain business environments requires models that can integrate multiple evaluation perspectives rather than being restricted to hyperparameter optimization based on a single metric. This traditional approach tends to prioritize one error indicator, which can bias results when metrics provide contradictory signals. In this context, the Hierarchical Evaluation Function (HEF) is proposed as a multi-metric framework for hyperparameter optimization that integrates explanatory power (R2), sensitivity to extreme errors (RMSE), and average accuracy (MAE). The performance of HEF was assessed using four widely recognized benchmark datasets in the forecasting domain: Walmart, M3, M4, and M5. Prediction models were optimized through Grid Search, Particle Swarm Optimization (PSO), and Optuna, and statistical analyses based on difference-of-proportions tests confirmed that HEF delivers superior results compared to a unimetric reference function, regardless of the optimizer employed, with particular relevance for heterogeneous monthly time series (M3) and highly granular daily demand scenarios (M5). The findings demonstrate that HEF improves stability, generalization, and robustness at low computational cost, consolidating its role as a reliable evaluation framework that enhances model selection, enables more accurate demand forecasts, and supports decision-making in dynamic, competitive business environments.2025-08-18T16:25:49Z31 pages, 15 figures, 25 tables. Submitted as a preprint. The manuscript introduces the Hierarchical Evaluation Function, a multi-metric framework for optimizing demand forecasting models under high uncertainty. Includes extensive experimental validation using real-world datasets and a comparative analysis against classical and modern methodsAdolfo GonzálezVíctor Parada