https://arxiv.org/api/kHCcQ6mfIwNoqyVT3QxBTyUCQgc 2026-03-24T08:29:13Z 5073 15 15 http://arxiv.org/abs/2603.16786v1 Elastic Sketch under Random Stationary Streams: Limiting Behavior and Near-Optimal Configuration 2026-03-17T17:00:01Z

\texttt{Elastic-Sketch} is a hash-based data structure for counting item's appearances in a data stream, and it has been empirically shown to achieve a better memory-accuracy trade-off compared to classical methods. This algorithm combines a \textit{heavy block}, which aims to maintain exact counts for a small set of dynamically \textit{elected} items, with a light block that implements \texttt{Count-Min} \texttt{Sketch} (\texttt{CM}) for summarizing the remaining traffic. The heavy block dynamics are governed by a hash function~$β$ that hashes items into~$m_1$ buckets, and an \textit{eviction threshold}~$λ$, which controls how easily an elected item can be replaced. We show that the performance of \texttt{Elastic-Sketch} strongly depends on the stream characteristics and the choice of~$λ$. Since optimal parameter choices depend on unknown stream properties, we analyze \texttt{Elastic-Sketch} under a \textit{stationary random stream} model -- a common assumption that captures the statistical regularities observed in real workloads. Formally, as the stream length goes to infinity, we derive closed-form expressions for the limiting distribution of the counters and the resulting expected counting error. These expressions are efficiently computable, enabling practical grid-based tuning of the heavy and \texttt{CM} blocks memory split (via $m_1$) and the eviction threshold~$λ$. We further characterize the structure of the optimal eviction threshold, substantially reducing the search space and showing how this threshold depends on the arrival distribution. Extensive numerical simulations validate our asymptotic results on finite streams from the Zipf distribution.

2026-03-17T17:00:01Z Younes Ben Mazziane Vinay Kumar B. R. Othmane Marfoq http://arxiv.org/abs/2603.16164v1 AI Application Benchmarking: Power-Aware Performance Analysis for Vision and Language Models 2026-03-17T06:32:41Z

Artificial Intelligence (AI) workloads drive a rapid expansion of high-performance computing (HPC) infrastructures and increase their power and energy demands towards a critical level. AI benchmarks representing state-of-the art workloads and their understanding in the context of performance-energy trade-offs are critical to deploy efficient infrastructures and can guide energy efficiency measures, such as power capping. We introduce a benchmarking framework with popular deep learning applications from computer vision (image classification and generation) and large language models (continued pre-training and inference) implementing modern methods. Our performance analysis focuses on throughput rather than time to "completion", which is the standard metric in HPC. We analyse performance and energy efficiency under various power capping scenarios on NVIDIA H100, NVIDIA H200, and AMD MI300X GPUs. Our results reveal that no universal optimal power cap exists, as the efficiency peak varies across application types and GPU architectures. Interestingly, the two NVIDIA GPUs which mainly differ in their HBM configuration show qualitatively different performance-energy trade-offs. The developed benchmarking framework will be released as a public tool.

2026-03-17T06:32:41Z Martin Mayr Sebastian Wind Lukas Schröder Georg Hager Harald Köstler Gerhard Wellein http://arxiv.org/abs/2603.15699v1 This Is Taking Too Long -- Investigating Time as a Proxy for Energy Consumption of LLMs 2026-03-16T08:26:57Z

The energy consumption of Large Language Models (LLMs) is raising growing concerns due to their adverse effects on environmental stability and resource use. Yet, these energy costs remain largely opaque to users, especially when models are accessed through an API -- a black box in which all information depends on what providers choose to disclose. In this work, we investigate inference time measurements as a proxy to approximate the associated energy costs of API-based LLMs. We ground our approach by comparing our estimations with actual energy measurements from locally hosted equivalents. Our results show that time measurements allow us to infer GPU models for API-based LLMs, grounding our energy cost estimations. Our work aims to create means for understanding the associated energy costs of API-based LLMs, especially for end users.

2026-03-16T08:26:57Z This work was accepted at PerCom 2026 Lars Krupp Daniel Geißler Francisco M. Calatrava-Nicolas Vishal Banwari Paul Lukowicz Jakob Karolus http://arxiv.org/abs/2603.14633v1 When Scanners Lie: Evaluator Instability in LLM Red-Teaming 2026-03-15T22:08:16Z

Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.

2026-03-15T22:08:16Z Submitted to the EvalEval Workshop at ACL 2026 Lidor Erez Omer Hofman Tamir Nizri Roman Vainshtein http://arxiv.org/abs/2603.14019v1 MapReplay: Trace-Driven Benchmark Generation for Java HashMap 2026-03-14T16:46:09Z

Hash-based maps, particularly java.util.HashMap, are pervasive in Java applications and the JVM, making their performance critical. Evaluating optimizations is challenging because performance depends on factors such as operation patterns, key distributions, and resizing behavior. Microbenchmarks are fast and repeatable but often oversimplify workloads, failing to capture the realistic usage patterns. Application benchmarks (e.g., DaCapo, Renaissance) provide realistic usages but are more expensive to run, prone to variability, and dominated by non-HashMap computations, making map-related performance changes difficult to observe. To address this challenge, we propose MapReplay, a benchmarking methodology that combines the realism of application benchmarks with the efficiency of microbenchmarks. MapReplay traces HashMap API usages generating a replay workload that reproduces the same operation sequence while faithfully reconstructing internal map states. This enables realistic and efficient evaluation of alternative implementations under realistic usage patterns. Applying MapReplay to DaCapo-Chopin and Renaissance, the resulting suite, MapReplayBench, reproduces application-level performance trends while reducing experimentation time and revealing insights difficult to obtain from full benchmarks.

2026-03-14T16:46:09Z Filippo Schiavio Andrea Rosà Júnior Löff Lubomír Bulej Petr Tůma Walter Binder 10.1145/3777884.3797010 http://arxiv.org/abs/2603.13945v1 A Case for CATS: A Conductor-driven Asymmetric Transport Scheme for Semantic Prioritization 2026-03-14T13:36:15Z

Standard transport protocols like TCP operate as a blind, FIFO conveyor belt for data, a model that is increasingly suboptimal for latency-sensitive and interactive applications. This paper challenges this model by introducing CATS (Conductor-driven Asymmetric Transport Scheme), a framework that provides TCP with the semantic awareness necessary to prioritize critical content. By centralizing scheduling intelligence in a transport-native "Conductor", CATS significantly improves user-perceived performance by delivering essential data first. This architecture directly confronts a cascade of historical performance workarounds and their limitations, including the high overhead of parallel connections in HTTP/1.1, the transport-layer Head-of-Line blocking in HTTP/2, and the observed implementation heterogeneity of prioritization in HTTP/3 over QUIC. Built upon TCP BBR, our ns-3 implementation demonstrates this principle by reducing the First Contentful Paint by over 78% in a representative webpage download configured as a deliberate worst-case scenario, with no penalty to total page load time compared to the baseline.

2026-03-14T13:36:15Z 2025 6th International Conference on Innovative Computing (ICIC) Syed Muhammad Aqdas Rizvi 10.1109/ICIC68258.2025.11413235 http://arxiv.org/abs/2507.17618v2 SimLens for Early Exit in Large Language Models: Eliciting Accurate Latent Predictions with One More Token 2026-03-14T08:27:06Z

Intermediate-layer predictions in large language models (LLMs) are informative but hard to decode accurately, especially at early layers. Existing lens-style methods typically rely on direct linear readout, which is simple but often drifts away from the model's eventual prediction. We proposeSimLens, a simple training-free decoder for single-token decision tasks that keeps only the start token and a candidate answer token ([s] and [a]) and performs one lightweight continuation through the remaining upper layers. This surprisingly small modification recovers much more accurate latent predictions than direct linear decoding. We further introduce Linear SimLens, a lightweight linear approximation for entropy-based confidence estimation, and combine the two in SimExit, a hybrid early-exit mechanism. On ARC, BoolQ, and HeadQA with LLaMA-7B and Vicuna-7B, SimLens improves Iso-Compute accuracy in all six settings, with an average gain of +0.43 even when fair compute includes the extra two-token post-forward overhead. SimExit yields an average 1.15$\times$ speedup at the best-accuracy operating points and 1.40$\times$ when allowing up to a 1 percentage-point accuracy drop. Ablations show that [s] and [a] play distinct roles as global condition and semantic anchor, respectively.

2025-07-23T15:49:03Z Ming Ma Bowen Zheng Zhongqiao Lin Tianming Yang http://arxiv.org/abs/2602.11506v3 RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis 2026-03-13T14:38:56Z

The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware. However, objectively measuring the theoretical performance ceilings of diverse architectures across heterogeneous platforms remains a formidable challenge. In this work, we propose a systematic framework based on the Roofline model that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI). By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate. Extensive empirical analysis across diverse compute tiers reveals that variations in performance and OI are significantly influenced by sequence length. We further identify a critical regression in OI as model depth increases. Additionally, our findings highlight an efficiency trap induced by hardware heterogeneity and demonstrate how structural refinements, such as Multi-head Latent Attention (M LA), can effectively unlock latent inference potential across various hardware substrates. These insights provide actionable directions for hardware-software co-design to align neural structures with physical constraints in on-device intelligence. The released code is available in the Appendix C.

2026-02-12T03:02:22Z Zhen Bi Xueshu Chen Luoyang Sun Yuhang Yao Qing Shen Jungang Lou Cheng Deng http://arxiv.org/abs/1704.05867v5 A note on integrating products of linear forms over the unit simplex 2026-03-13T06:58:21Z

Integrating a product of linear forms over the unit simplex can be done in polynomial time if the number of variables n is fixed (V. Baldoni et al., 2011). In this note, we highlight that this problem is equivalent to obtaining the normalizing constant of state probabilities for a popular class of Markov processes used in queueing network theory. In light of this equivalence, we survey existing computational algorithms developed in queueing theory that can be used for exact integration. For example, under some regularity conditions, queueing theory algorithms can exactly integrate a product of linear forms of total degree N by solving N systems of linear equations.

2017-04-19T18:05:04Z Giuliano Casale http://arxiv.org/abs/2603.12465v1 TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition 2026-03-12T21:30:07Z

Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.

2026-03-12T21:30:07Z Accepted at IEEE ISPASS 2026. Copyright assigned to IEEE Prabhu Vellaisamy Shreesh Tripathi Vignesh Natarajan Surya Santhan Thenarasu Shawn Blanton John P. Shen http://arxiv.org/abs/2509.21619v2 PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters 2026-03-12T18:51:19Z

Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. As training progresses, these changes stabilize, suggesting that the resulting updates may be amenable to approximation using low intrinsic-rank matrices. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%.

2025-09-25T21:34:17Z 13 pages, 8 figures, 2 algorithms, workshop paper Krishu K Thapa Reet Barik Krishna Teja Chitty-Venkata Murali Emani Venkatram Vishwanath http://arxiv.org/abs/2504.18047v2 Spatiotemporal Analysis of Parallelized Computing at the Extreme Edge 2026-03-11T22:26:19Z

Extreme Edge Computing (EEC) pushes computing even closer to end users than traditional Multi-access Edge Computing (MEC), harnessing the idle resources of Extreme Edge Devices (EEDs) to enable low-latency, distributed processing. However, EEC faces key challenges, including spatial randomness in device distribution, limited EED computational power necessitating parallel task execution, vulnerability to failure, and temporal randomness due to variability in wireless communication and execution times. These challenges highlight the need for a rigorous analytical framework to evaluate EEC performance. We present the first spatiotemporal mathematical model for EEC over large-scale millimeter-wave networks. Utilizing stochastic geometry and an Absorbing Continuous-Time Markov Chain (ACTMC), the framework captures the complex interaction between communication and computation performance, including their temporal overlap during parallel execution. We evaluate two key metrics: average task response delay and task completion probability. Together, they provide a holistic view of latency and reliability. The analysis considers fundamental offloading strategies, including randomized and location-aware schemes, while accounting for EED failures. Results show that there exists an optimal task segmentation that minimizes delay. Under limited EED availability, we investigate a bias-based EEC and MEC collaboration that offloads excess demand to MEC resources, effectively reducing congestion and improving system responsiveness.

2025-04-25T03:30:30Z This work has been accepted for publication in IEEE Transactions on Mobile Computing Yasser Nabil Mahmoud Abdelhadi Sameh Sorour Hesham ElSawy Sara A. Elsayed Hossam S. Hassanein 10.1109/TMC.2026.3673215 http://arxiv.org/abs/2603.11340v1 Improving LLM Performance Through Black-Box Online Tuning: A Case for Adding System Specs to Factsheets for Trusted AI 2026-03-11T22:13:43Z

In this paper, we present a novel black-box online controller that uses only end-to-end measurements over short segments, without internal instrumentation, and hill climbing to maximize goodput, defined as the throughput of requests that satisfy the service-level objective. We provide empirical evidence that this design is well-founded. Using this advance in LLM serving as a concrete example, we then discuss the importance of integrating system performance and sustainability metrics into Factsheets for organizations adopting AI systems.

2026-03-11T22:13:43Z Yonas Atinafu Henry Lin Robin Cohen http://arxiv.org/abs/2603.10765v1 RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems 2026-03-11T13:41:26Z

We present the design and implementation of a RAG-based AI system benchmarking (RAGPerf) framework for characterizing the system behaviors of RAG pipelines. To facilitate detailed profiling and fine-grained performance analysis, RAGPerf decouples the RAG workflow into several modular components - embedding, indexing, retrieval, reranking, and generation. RAGPerf offers the flexibility for users to configure the core parameters of each component and examine their impact on the end-to-end query performance and quality. RAGPerf has a workload generator to model real-world scenarios by supporting diverse datasets (e.g., text, pdf, code, and audio), different retrieval and update ratios, and query distributions. RAGPerf also supports different embedding models, major vector databases such as LanceDB, Milvus, Qdrant, Chroma, and Elasticsearch, as well as different LLMs for content generation. It automates the collection of performance metrics (i.e., end-to-end query throughput, host/GPU memory footprint, and CPU/GPU utilization) and accuracy metrics (i.e., context recall, query accuracy, and factual consistency). We demonstrate the capabilities of RAGPerf through a comprehensive set of experiments and open source its codebase at GitHub. Our evaluation shows that RAGPerf incurs negligible performance overhead.

2026-03-11T13:41:26Z The codebase of RAGPerf is available at https://github.com/platformxlab/RAGPerf Shaobo Li Yirui Zhou Yuan Xu Kevin Chen Daniel Waddington Swaminathan Sundararaman Hubertus Franke Jian Huang http://arxiv.org/abs/2603.09642v1 Multi-DNN Inference of Sparse Models on Edge SoCs 2026-03-10T13:16:59Z

Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.

2026-03-10T13:16:59Z Jiawei Luo Di Wu Simon Dobson Blesson Varghese