https://arxiv.org/api/Bb9oDyLGDZMA6TheheIJsSLg02w 2026-04-05T13:15:56Z 5096 300 15 http://arxiv.org/abs/2511.17922v1 GROOT: General-Purpose Automatic Parameter Tuning Across Layers, Domains, and Use Cases 2025-11-22T05:37:57Z

Modern software systems are executed on a runtime stack with layers (virtualization, storage, trusted execution, etc.) each incurring an execution and/or monetary cost, which may be mitigated by finding suitable parameter configurations. While specialized parameter tuners exist, they are tied to a particular domain or use case, fixed in type and number of optimization goals, or focused on a specific layer or technology. These limitations pose significant adoption hurdles for specialized and innovative ventures (SIVs) that address a variety of domains and use cases, operate under strict cost-performance constraints requiring tradeoffs, and rely on self-hosted servers with custom technology stacks while having little data or expertise to set up and operate specialized tuners. In this paper, we present Groot - a general-purpose configuration tuner designed to a) be explicitly agnostic of a particular domain or use case, b) balance multiple potentially competing optimization goals, c) support different custom technology setups, and d) make minimal assumptions about parameter types, ranges, or suitable values. Our evaluation on both real-world use cases and benchmarks shows that Groot reliably improves performance and reduces resource consumption in scenarios representative for SIVs.

2025-11-22T05:37:57Z International Conferences on Applied Computing 2025 and WWW/Internet 2025 Robert Krahn Josia Mädler Christoph Seidl Christof Fetzer http://arxiv.org/abs/2511.19464v1 Temperature in SLMs: Impact on Incident Categorization in On-Premises Environments 2025-11-21T19:37:09Z

SOCs and CSIRTs face increasing pressure to automate incident categorization, yet the use of cloud-based LLMs introduces costs, latency, and confidentiality risks. We investigate whether locally executed SLMs can meet this challenge. We evaluated 21 models ranging from 1B to 20B parameters, varying the temperature hyperparameter and measuring execution time and precision across two distinct architectures. The results indicate that temperature has little influence on performance, whereas the number of parameters and GPU capacity are decisive factors.

2025-11-21T19:37:09Z 5 pages, 3 figures, 2 tables, submitted to ERRC/WRSeg 2025 Marcio Pohlmann Alex Severo Gefté Almeida Diego Kreutz Tiago Heinrich Lourenço Pereira http://arxiv.org/abs/2511.17265v1 DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format 2025-11-21T14:13:16Z

Nowadays, we are witnessing an Artificial Intelligence revolution that dominates the technology landscape in various application domains, such as healthcare, robotics, automotive, security, and defense. Massive-scale AI models, which mimic the human brain's functionality, typically feature millions and even billions of parameters through data-intensive matrix multiplication tasks. While conventional Von-Neumann architectures struggle with the memory wall and the end of Moore's Law, these AI applications are migrating rapidly towards the edge, such as in robotics and unmanned aerial vehicles for surveillance, thereby adding more constraints to the hardware budget of AI architectures at the edge. Although in-memory computing has been proposed as a promising solution for the memory wall, both analog and digital in-memory computing architectures suffer from substantial degradation of the proposed benefits due to various design limitations. We propose a new digital in-memory stochastic computing architecture, DISCA, utilizing a compressed version of the quasi-stochastic Bent-Pyramid data format. DISCA inherits the same computational simplicity of analog computing, while preserving the same scalability, productivity, and reliability of digital systems. Post-layout modeling results of DISCA show an energy efficiency of 3.59 TOPS/W per bit at 500 MHz using a commercial 180nm CMOS technology. Therefore, DISCA significantly improves the energy efficiency for matrix multiplication workloads by orders of magnitude if scaled and compared to its counterpart architectures.

2025-11-21T14:13:16Z 6 pages, 5 figures Shady Agwa Yikang Shen Shiwei Wang Themis Prodromakis http://arxiv.org/abs/2511.16890v1 An Introductory Study on the Power Consumption Overhead of ERC-4337 Bundlers 2025-11-21T02:11:03Z

Ethereum is currently the main blockchain ecosystem providing decentralised trust guarantees for applications ranging from finance to e-government. A common criticism of blockchain networks has been their energy consumption and operational costs. The switch from Proof-of-Work (PoW) protocol to Proof-of-Stake (PoS) protocol has significantly reduced this issue, though concerns remain, especially with network expansions via additional layers. The ERC-4337 standard is a recent proposal that facilitates end-user access to Ethereum-backed applications. It introduces a middleware called a bundler, operated as a third-party service, where part of its operational cost is represented by its power consumption. While bundlers have served over 500 million requests in the past two years, fewer than 15 official bundler providers exist, compared to over 100 regular Ethereum access providers. In this paper, we provide a first look at the active power consumption overhead that a bundler would add to an Ethereum access service. Using SmartWatts, a monitoring system leveraging Running Average Power Limit (RAPL) hardware interfaces, we empirically determine correlations between the bundler workload and its active power consumption.

2025-11-21T02:11:03Z Accepted for the 9th IEEE International Symposium on Electrical and Electronics Engineering, ISEEE 2025; Copyright is with IEEE Andrei Arusoaie Claudiu-Nicu Bărbieru Oana-Otilia Captarencu Paul-Flavian Diac Emanuel Onica Cosmin-Nicolae Vârlan http://arxiv.org/abs/2503.09114v2 Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge 2025-11-20T21:03:47Z

The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators-including memory usage, inference speed, and energy consumption-across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.

2025-03-12T07:01:34Z Paper currently under review in an ACM journal. This version reflects reviewer-driven revisions: calibrated power measurements validated with external hardware, updated figures and conclusions, added downstream benchmarks (HellaSwag, Winogrande, TruthfulQA, ARC), clarified hardware scope and cold-start behavior, corrected Orin GPU Q4_0 results, improved visuals, and discussed emerging GenAI NPUs Maximilian Abstreiter Sasu Tarkoma Roberto Morabito http://arxiv.org/abs/2511.16412v1 Algorithms and optimizations for global non-linear hybrid fluid-kinetic finite element stellarator simulations 2025-11-20T14:35:04Z

Predictive modeling of stellarator plasmas is crucial for advancing nuclear fusion energy, yet it faces unique computational difficulties. One of the main challenges is accurately simulating the dynamics of specific particle species that are not well captured by fluid models, which necessitates the use of hybrid fluid-kinetic models. The non-axisymmetric geometry of stellarators fundamentally couples the toroidal Fourier modes, in contrast to what happens in tokamaks, requiring different numerical and computational treatment. This work presents a novel, globally coupled projection scheme inside the JOREK finite element framework. The approach ensures a self-consistent and physically accurate transfer of kinetic markers to the fluid grid, effectively handling the complex 3D mesh by constructing and solving a unified linear system that encompasses all toroidal harmonics simultaneously. To manage the computational complexity of this coupling, the construction of the system's matrix is significantly accelerated using the Fast Fourier Transform (FFT). The efficient localization of millions of particles is made possible by implementing a 3D R-Tree spatial index, which supports this projection and ensures computational tractability at scale. On realistic Wendelstein 7-X stellarator geometries, the fidelity of the framework is rigorously shown. In sharp contrast to the uncoupled approaches' poor performance, quantitative convergence tests verify that the coupled scheme attains the theoretically anticipated spectral convergence. This study offers a crucial capability for the predictive analysis and optimization of next-generation stellarator designs by developing a validated, high-fidelity computational tool.

2025-11-20T14:35:04Z Luca Venerando Greco http://arxiv.org/abs/2511.19456v1 Optimizations on Graph-Level for Domain Specific Computations in Julia and Application to QED 2025-11-20T11:42:13Z

Complex computational problems in science often consist of smaller parts that can have largely distinct compute requirements from one another. For optimal efficiency, analyzing each subtask and scheduling it on the best-suited hardware would be necessary. Other considerations must be taken into account, too, such as parallelism, dependencies between different subtasks, and data transfer speeds between devices. To achieve this, directed acyclic graphs are often employed to represent these problems and enable utilizing as much hardware as possible on a given machine. In this paper, we present a software framework written in Julia capable of automatically and dynamically producing statically scheduled and compiled code. We lay theoretical foundations and add domain-specific information about the computation to the existing concepts of DAG scheduling, enabling optimizations that would otherwise be impossible. To illustrate the theory we implement an example application: the computation of matrix elements for scattering processes with many external particles in quantum electrodynamics.

2025-11-20T11:42:13Z Anton Reinhard Simeon Ehrig René Widera Michael Bussmann Uwe Hernandez Acosta http://arxiv.org/abs/2511.16041v1 Can Asymmetric Tile Buffering Be Beneficial? 2025-11-20T04:59:08Z

General matrix multiplication (GEMM) is the computational backbone of modern AI workloads, and its efficiency is critically dependent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input $A$ along the dimension $M$ matches the output tile size of $C$. In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2 AI Engine (AIE), achieving up to a 4.54x speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16--BF16 GEMM, establishing a new performance record for XDNA2 AIE.

2025-11-20T04:59:08Z Chengyue Wang Wesley Pang Xinrui Wu Gregory Jun Luis Romero Endri Taka Diana Marculescu Tony Nowatzki Pranathi Vasireddy Joseph Melber Deming Chen Jason Cong http://arxiv.org/abs/2511.15977v1 Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows 2025-11-20T02:14:56Z

Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.

2025-11-20T02:14:56Z Accepted at AAAI 2026 Daniel Mas Montserrat Ray Verma Míriam Barrabés Francisco M. de la Vega Carlos D. Bustamante Alexander G. Ioannidis http://arxiv.org/abs/2407.16026v3 KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer 2025-11-19T19:23:14Z

This paper explores the adaptation of Transformerbased models for edge devices through the quantisation and hardware acceleration of the ARM Keyword Transformer (KWT) model on a RISC-V platform. The model was targeted to run on 64kB RAM in bare-metal C using a custom-developed edge AI library. KWT-1 was retrained to be 369 times smaller, with only a 10% loss in accuracy through reducing output classes from 35 to 2. The retraining and quantisation reduced model size from 2.42 MB to 1.65 kB. The integration of custom RISC-V instructions that accelerated GELU and SoftMax operations enabled a 5x speedup and thus ~5x power reduction in inference, with inference clock cycle counts decreasing from 26 million to 5.5 million clock cycles while incurring a small area overhead of approximately 29%. The results demonstrate a viable method for porting and accelerating Transformer-based models in low-power IoT devices.

2024-07-22T20:07:21Z 6 pages, 7 figures, published in the IEEE SOCC 2024 conference Aness Al-Qawlaq Ajay Kumar M Deepu John 10.1109/SOCC62300.2024.10737828 http://arxiv.org/abs/2511.15503v1 A Tensor Compiler for Processing-In-Memory Architectures 2025-11-19T14:58:16Z

Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. To address this, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction that enables various data distribution and processing strategies on different PIM backends. DCC enables effective co-optimization by mapping data partitioning strategies to compute loop partitions, applying PIM-specific code optimizations and leveraging a fast and accurate performance prediction model to select optimal configurations. Our evaluations in various individual ML kernels demonstrate that DCC achieves up to 7.68x speedup (2.7x average) on HBM-PIM and up to 13.17x speedup (5.75x average) on AttAcc PIM backend over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by up to 7.71x (4.88x average) over GPU.

2025-11-19T14:58:16Z Peiming Yang Sankeerth Durvasula Ivan Fernandez Mohammad Sadrosadati Onur Mutlu Gennady Pekhimenko Christina Giannoula http://arxiv.org/abs/2502.18137v8 SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference 2025-11-19T14:34:24Z

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The code is available at https://github.com/thu-ml/SpargeAttn.

2025-02-25T12:02:17Z @inproceedings{zhang2025spargeattn, title={Spargeattn: Accurate sparse attention accelerating any model inference}, author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei}, booktitle={International Conference on Machine Learning (ICML)}, year={2025} } Proceedings of the 42 nd International Conference on Machine Learning, PMLR 267, 2025 (ICML 2025) Jintao Zhang Chendong Xiang Haofeng Huang Jia Wei Haocheng Xi Jun Zhu Jianfei Chen http://arxiv.org/abs/2505.21185v3 Constructive community race: full-density spiking neural network model drives neuromorphic computing 2025-11-19T14:17:46Z

The local circuitry of the mammalian brain is a focus of the search for generic computational principles because it is largely conserved across species and modalities. In 2014 a model was proposed representing all neurons and synapses of the stereotypical cortical microcircuit below $1\,\text{mm}^2$ of brain surface. The model reproduces fundamental features of brain activity but its impact remained limited because of its computational demands. For theory and simulation, however, the model was a breakthrough because it removes uncertainties of downscaling, and larger models are less densely connected. This sparked a race in the neuromorphic computing community and the model became a de facto standard benchmark. Within a few years real-time performance was reached and surpassed at significantly reduced energy consumption. We review how the computational challenge was tackled by different simulation technologies and derive guidelines for the next generation of benchmarks and other domains of science.

2025-05-27T13:34:51Z 23 pages, 3 figures, 2 tables Neuromorph. Comput. Eng. 6 (2026) 012001 Johanna Senk Anno C. Kurth Steve Furber Tobias Gemmeke Bruno Golosio Arne Heittmann James C. Knight Eric Müller Tobias Noll Thomas Nowotny Gorka Peraza Coppola Luca Peres Oliver Rhodes Andrew Rowley Johannes Schemmel Tim Stadtmann Tom Tetzlaff Gianmarco Tiddia Sacha J. van Albada José Villamar Markus Diesmann 10.1088/2634-4386/ae379a http://arxiv.org/abs/2508.08343v3 A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving 2025-11-19T13:36:14Z

With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads. The code is publicly available at https://github.com/FerranAgulloLopez/GPULLMAdapterOptimization.

2025-08-11T10:47:35Z Accepted in a computer science workshop Ferran Agullo Joan Oliveras Chen Wang Alberto Gutierrez-Torre Olivier Tardieu Alaa Youssef Jordi Torres Josep Ll. Berral http://arxiv.org/abs/2511.14400v2 PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking 2025-11-19T04:13:40Z

Processing-in-memory (PIM) reduces data movement by executing near memory, but our large-scale characterization on real PIM hardware shows that end-to-end performance is often limited by disjoint host and device address spaces that force explicit staging transfers. In contrast, CXL-PIM provides a unified address space and cache-coherent access at the cost of higher access latency. These opposing interface models create workload-dependent tradeoffs that are not captured by small-scale studies. This work presents a side-by-side, large-scale comparison of PIM and CXL-PIM using measurements from real PIM hardware and trace-driven CXL modeling. We identify when unified-address access amortizes link latency enough to overcome transfer bottlenecks, and when tightly coupled PIM remains preferable. Our results reveal phase- and dataset-size regimes in which the relative ranking between the two architectures reverses, offering practical guidance for future near-memory system design.

2025-11-18T12:05:31Z I-Ting Lee Bao-Kai Wang Liang-Chi Chen Wen Sheng Lim Da-Wei Chang Yu-Ming Chang Chieng-Chung Ho