https://arxiv.org/api/OiQ47kFbL7Mc7mlQ5hwdMrw7zIA 2026-03-26T12:41:14Z 5077 165 15 http://arxiv.org/abs/2511.16682v2 Bench360: Benchmarking Local LLM Inference from 360 Degrees 2026-01-14T08:53:59Z

Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.

2025-11-12T09:57:21Z Linus Stuhlmann Mauricio Fadel Argerich Jonathan Fürst http://arxiv.org/abs/2601.08960v1 LookAhead: The Optimal Non-decreasing Index Policy for a Time-Varying Holding Cost problem 2026-01-13T20:00:03Z

In practice, the cost of delaying a job can grow as the job waits. Such behavior is modeled by the Time-Varying Holding Cost (TVHC) problem, where each job's instantaneous holding cost increases with its current age (a job's age is the time since it arrived). The goal of the TVHC problem is to find a scheduling policy that minimizes the time-average total holding cost across all jobs. However, no optimality results are known for the TVHC problem outside of the asymptotic regime. In this paper, we study a simple yet still challenging special case: A two-class M/M/1 queue in which class 1 jobs incur a non-decreasing, time-varying holding cost and class 2 jobs incur a constant holding cost. Our main contribution is deriving the first optimal (non-decreasing) index policy for this special case of the TVHC problem. Our optimal policy, called LookAhead, stems from the following idea: Rather than considering each job's current holding cost when making scheduling decisions, we should look at their cost some $X$ time into the future, where this $X$ is intuitively called the ``lookahead amount." This paper derives that optimal lookahead amount.

2026-01-13T20:00:03Z To be published in Queueing Systems Keerthana Gurushankar Zhouzi Li Mor Harchol-Balter Alan Scheller-Wolf http://arxiv.org/abs/2601.08539v1 Reducing Compute Waste in LLMs through Kernel-Level DVFS 2026-01-13T13:26:57Z

The rapid growth of AI has fueled the expansion of accelerator- or GPU-based data centers. However, the rising operational energy consumption has emerged as a critical bottleneck and a major sustainability concern. Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique used to reduce energy consumption, and thus improve energy-efficiency, since it requires little effort and works with existing hardware. Reducing the energy consumption of training and inference of Large Language Models (LLMs) through DVFS or power capping is feasible: related work has shown energy savings can be significant, but at the cost of significant slowdowns. In this work, we focus on reducing waste in LLM operations: i.e., reducing energy consumption without losing performance. We propose a fine-grained, kernel-level, DVFS approach that explores new frequency configurations, and prove these save more energy than previous, pass- or iteration-level solutions. For example, for a GPT-3 training run, a pass-level approach could reduce energy consumption by 2% (without losing performance), while our kernel-level approach saves as much as 14.6% (with a 0.6% slowdown). We further investigate the effect of data and tensor parallelism, and show our discovered clock frequencies translate well for both. We conclude that kernel-level DVFS is a suitable technique to reduce waste in LLM operations, providing significant energy savings with negligible slow-down.

2026-01-13T13:26:57Z Jeffrey Spaan Kuan-Hsun Chen Ana-Lucia Varbanescu http://arxiv.org/abs/2508.07640v2 Taming Cold Starts: Proactive Serverless Scheduling with Model Predictive Control 2026-01-13T11:43:04Z

Serverless computing has transformed cloud application deployment by introducing a fine-grained, event-driven execution model that abstracts away infrastructure management. Its on-demand nature makes it especially appealing for latency-sensitive and bursty workloads. However, the cold start problem, i.e., where the platform incurs significant delay when provisioning new containers, remains the Achilles' heel of such platforms. This paper presents a predictive serverless scheduling framework based on Model Predictive Control to proactively mitigate cold starts, thereby improving end-to-end response time. By forecasting future invocations, the controller jointly optimizes container prewarming and request dispatching, improving latency while minimizing resource overhead. We implement our approach on Apache OpenWhisk, deployed on a Kubernetes-based testbed. Experimental results using real-world function traces and synthetic workloads demonstrate that our method significantly outperforms state-of-the-art baselines, achieving up to 85% lower tail latency and a 34% reduction in resource usage.

2025-08-11T05:45:28Z 8 pages, 8 figures, preprint accepted at MASCOTS 2025 33rd International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication System (MASCOTS 2025) Chanh Nguyen Monowar Bhuyan Erik Elmroth 10.1109/MASCOTS67699.2025.11283271 http://arxiv.org/abs/2601.08374v1 Shifting the Sweet Spot: High-Performance Matrix-Free Method for High-Order Elasticity 2026-01-13T09:36:02Z

In high-order finite element analysis for elasticity, matrix-free (PA) methods are a key technology for overcoming the memory bottleneck of traditional Full Assembly (FA). However, existing implementations fail to fully exploit the special structure of modern CPU architectures and tensor-product elements, causing their performance "sweet spot" to anomalously remain at the low order of $p \approx 2$, which severely limits the potential of high-order methods. To address this challenge, we design and implement a highly optimized PA operator within the MFEM framework, deeply integrated with a Geometric Multigrid (GMG) preconditioner. Our multi-level optimization strategy includes replacing the original $O(p^6)$ generic algorithm with an efficient $O(p^4)$ one based on tensor factorization, exploiting Voigt symmetry to reduce redundant computations for the elasticity problem, and employing macro-kernel fusion to enhance data locality and break the memory bandwidth bottleneck. Extensive experiments on mainstream x86 and ARM architectures demonstrate that our method successfully shifts the performance "sweet spot" to the higher-order region of $p \ge 6$. Compared to the MFEM baseline, the optimized core operator (kernel) achieves speedups of 7x to 83x, which translates to a 3.6x to 16.8x end-to-end performance improvement in the complete solution process. This paper provides a validated and efficient practical path for conducting large-scale, high-order elasticity simulations on mainstream CPU hardware.

2026-01-13T09:36:02Z Dali Chang Chong Zhang Kaiqi Zhang Mingguan Yang Huiyuan Li Weiqiang Kong http://arxiv.org/abs/2503.23988v2 Deep Learning Model Deployment in Multiple Cloud Providers: an Exploratory Study Using Low Computing Power Environments 2026-01-11T21:19:16Z

The deployment of Machine Learning models in the cloud has grown among tech companies. Hardware requirements are higher when these models involve Deep Learning techniques, and the cloud providers' costs may be a barrier. We explore deploying Deep Learning models, using for experiments the GECToR model, a Deep Learning solution for Grammatical Error Correction, across three of the major cloud providers (Amazon Web Services, Google Cloud Platform, and Microsoft Azure). We evaluate real-time latency, hardware usage, and cost at each cloud provider in 7 execution environments with 10 experiments reproduced. We found that while Graphics Processing Units (GPUs) excel in performance, they had an average cost 300% higher than solutions without a GPU. Our analysis also suggests that processor cache memory size is a key variable for CPU-only deployments, and setups with sufficient cache achieved a 50% cost reduction compared to GPU-based deployments. This study indicates the feasibility and affordability of cloud-based Deep Learning inference solutions without a GPU, benefiting resource-constrained users such as startups and small research groups.

2025-03-31T11:58:37Z 15 pages, 7 figures Elayne Lemos Rodrigo Oliveira Jairson Rodrigues Rosalvo F. Oliveira Neto http://arxiv.org/abs/2507.14959v3 Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices 2026-01-11T20:41:32Z

Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.

2025-07-20T13:39:50Z Accepted at the IEEE/CVF winter conference on applications of computer vision (WACV 2026) Saeid Ghafouri Mohsen Fayyaz Xiangchen Li Deepu John Bo Ji Dimitrios Nikolopoulos Hans Vandierendonck http://arxiv.org/abs/2601.06886v1 Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM 2026-01-11T12:20:48Z

Accurate performance prediction is essential for optimizing scientific applications on modern high-performance computing (HPC) architectures. Widely used performance models primarily focus on cache and memory bandwidth, which is suitable for many memory-bound workloads. However, it is unsuitable for highly arithmetic intensive cases such as the sum-factorization with tensor $n$-mode product kernels, which are an optimization technique for high-order finite element methods (FEM). On processors with relatively high single instruction multiple data (SIMD) instruction latency, such as the Fujitsu A64FX, the performance of these kernels is strongly influenced by loop-body splitting strategies. Memory-bandwidth-oriented models are therefore not appropriate for evaluating these splitting configurations, and a model that directly reflects instruction-level efficiency is required. To address this need, we develop a dependency-chain-based analytical formulation that links loop-splitting configurations to instruction dependencies in the tensor $n$-mode product kernel. We further use XGBoost to estimate key parameters in the analytical model that are difficult to model explicitly. Evaluations show that the learning-augmented model outperforms the widely used standard Roofline and Execution-Cache-Memory (ECM) models. On the Fujitsu A64FX processor, the learning-augmented model achieves mean absolute percentage errors (MAPE) between 1% and 24% for polynomial orders ($P$) from 1 to 15. In comparison, the standard Roofline and ECM models yield errors of 42%-256% and 5%-117%, respectively. On the Intel Xeon Gold 6230 processor, the learning-augmented model achieves MAPE values from 1% to 13% for $P$=1 to $P$=14, and 24% at $P$=15. In contrast, the standard Roofline and ECM models produce errors of 1%-73% and 8%-112% for $P$=1 to $P$=15, respectively.

2026-01-11T12:20:48Z This work has been submitted to the IEEE for possible publication Xuanzhengbo Ren Yuta Kawai Tetsuya Hoshino Hirofumi Tomita Takahiro Katagiri Daichi Mukunoki Seiya Nishizawa 10.1109/ACCESS.2026.3675604 http://arxiv.org/abs/2601.06591v1 Modeling Tradeoffs between mobility, cost, and performance in Edge Computing 2026-01-10T15:02:52Z

Edge computing provides a cloud-like architecture where small-scale resources are distributed near the network edge, enabling applications on resource-constrained devices to offload latency-critical computations to these resources. While some recent work showed that the resource constraints of the edge could result in higher end-to-end latency under medium to high utilization due to higher queuing delays, to the best of our knowledge, there has not been any work on modeling the trade-offs of deploying on edge versus cloud infrastructures in the presence of mobility. Understanding the costs and trade-offs of this architecture is important for network designers, as the architecture is now adopted to be part of 5G and beyond networks in the form of the Multi-access Edge Computing (MEC). In this paper we focus on quantifying and estimating the cost of edge computing. Using closed-form queuing models, we explore the cost-performance trade-offs in the presence of different systems dynamics. We model how workload mobility and workload variations influence these tradeoffs, and validate our results with realistic experiments and simulations. Finally, we discuss the practical implications for designing edge systems and developing algorithms for efficient resource and workload management.

2026-01-10T15:02:52Z Muhammad Danish Waseem Ahmed Ali-Eldin http://arxiv.org/abs/2601.01353v2 Benchmarking Quantum Data Center Architectures: A Performance and Scalability Perspective 2026-01-10T00:22:02Z

Scalable distributed quantum computing (DQC) has motivated the design of multiple quantum data-center (QDC) architectures that overcome the limitations of single quantum processors through modular interconnection. While these architectures adopt fundamentally different design philosophies, their relative performance under realistic quantum hardware constraints remains poorly understood. In this paper, we present a systematic benchmarking study of four representative QDC architectures-QFly, BCube, Clos, and Fat-Tree-quantifying their impact on distributed quantum circuit execution latency, resource contention, and scalability. Focusing on quantum-specific effects absent from classical data-center evaluations, we analyze how optical-loss-induced Einstein-Podolsky-Rosen (EPR) pair generation delays, coherence-limited entanglement retry windows, and contention from teleportation-based non-local gates shape end-to-end execution performance. Across diverse circuit workloads, we evaluate how architectural properties such as path diversity and path length, and shared BSM (Bell State Measurement) resources interact with optical-switch insertion loss and reconfiguration delay. Our results show that distributed quantum performance is jointly shaped by topology, scheduling policies, and physical-layer parameters, and that these factors interact in nontrivial ways. Together, these insights provide quantitative guidance for the design of scalable and high-performance quantum data-center architectures for DQC.

2026-01-04T03:48:02Z Shahrooz Pouryousef Eneet Kaur Hassan Shapourian Don Towsley Ramana Kompella Reza Nejabati http://arxiv.org/abs/2601.06349v1 Fixing ill-formed UTF-16 strings with SIMD instructions 2026-01-09T23:09:42Z

UTF-16 is a widely used Unicode encoding representing characters with one or two 16-bit code units. The format relies on surrogate pairs to encode characters beyond the Basic Multilingual Plane, requiring a high surrogate followed by a low surrogate. Ill-formed UTF-16 strings -- where surrogates are mismatched -- can arise from data corruption or improper encoding, posing security and reliability risks. Consequently, programming languages such as JavaScript include functions to fix ill-formed UTF-16 strings by replacing mismatched surrogates with the Unicode replacement character (U+FFFD). We propose using Single Instruction, Multiple Data (SIMD) instructions to handle multiple code units in parallel, enabling faster and more efficient execution. Our software is part of the Google JavaScript engine (V8) and thus part of several major Web browsers.

2026-01-09T23:09:42Z Robert Clausecker Daniel Lemire http://arxiv.org/abs/2601.05205v1 EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI 2026-01-08T18:31:11Z

Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.

2026-01-08T18:31:11Z 6 pages, 9 figures, 2 Tables, conference [Submitted in PerConAI-2026] Zain Iqbal Lorenzo Valerio http://arxiv.org/abs/2601.04904v1 Parallel Quadratic Selected Inversion in Quantum Transport Simulation 2026-01-08T13:03:56Z

Driven by Moore's Law, the dimensions of transistors have been pushed down to the nanometer scale. Advanced quantum transport (QT) solvers are required to accurately simulate such nano-devices. The non-equilibrium Green's function (NEGF) formalism lends itself optimally to these tasks, but it is computationally very intensive, involving the selected inversion (SI) of matrices and the selected solution of quadratic matrix (SQ) equations. Existing algorithms to tackle these numerical problems are ideally suited to GPU acceleration, e.g., the so-called recursive Green's function (RGF) technique, but they are typically sequential, require block-tridiagonal (BT) matrices as inputs, and their implementation has been so far restricted to shared memory parallelism, thus limiting the achievable device sizes. To address these shortcomings, we introduce distributed methods that build on RGF and enable parallel selected inversion and selected solution of the quadratic matrix equation. We further extend them to handle BT matrices with arrowhead, which allows for the investigation of multi-terminal transistor structures. We evaluate the performance of our approach on a real dataset from the QT simulation of a nano-ribbon transistor and compare it with the sparse direct package PARDISO. When scaling to 16 GPUs, our fused SI and SQ solver is 5.2x faster than the SI module of PARDISO applied to a device 16x shorter. These results highlight the potential of our method to accelerate NEGF-based nano-device simulations.

2026-01-08T13:03:56Z 12 pages, 9 figures Vincent Maillou Matthias Bollhofer Olaf Schenk Alexandros Nikolaos Ziogas Mathieu Luisier http://arxiv.org/abs/2602.17670v1 The Dark Side of Dark Mode -- User behaviour rebound effects and consequences for digital energy consumption 2026-01-08T10:51:30Z

User devices are the largest contributor to media related global emissions. For web content, dark mode has been widely recommended as an energy-saving measure for certain display types. However, the energy savings achieved by dark mode may be undermined by user behaviour. This pilot study investigates the unintended consequences of dark mode adoption, revealing a rebound effect wherein users may increase display brightness when interacting with dark-themed web pages. This behaviour may negate the potential energy savings that dark mode offers. Our findings suggest that the energy efficiency benefits of dark mode are not as straightforward as commonly believed for display energy, and the interplay between content colourscheme and user behaviour must be carefully considered in sustainability guidelines and interventions.

2026-01-08T10:51:30Z 3 pages (2 + references), 3 figures, 1 table. To be included in the proceedings of the 1st International Workshop on Low Carbon Computing (LOCO) 2024, December 3, 2024, Glasgow/Online Zak Datson http://arxiv.org/abs/2507.10367v4 FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline 2026-01-08T09:19:47Z

Client-side metadata caching has long been considered an effective method for accelerating metadata operations in distributed file systems (DFSs). However, we have found that client-side state (e.g., caching) is not only ineffective but also consumes valuable memory resources in the deep learning pipelines. We thus propose FalconFS, a DFS optimized for deep learning pipelines with the stateless-client architecture. Specifically, instead of performing client-side path resolution and caching, FalconFS efficiently resolves paths on the server side using hybrid metadata indexing and lazy namespace replication. FalconFS also boosts server concurrency with concurrent request merging and provides easy deployment with VFS shortcut. Evaluations against CephFS and Lustre show that FalconFS achieves up to 5.72$\times$ throughput for small file read/write and up to 12.81$\times$ throughput for deep learning model training. FalconFS has been running in Huawei autonomous driving system's production environment with 10,000 NPUs for one year and has been open-sourced.

2025-07-14T15:09:01Z Accepted by NSDI'26 Jingwei Xu Junbin Kang Mingkai Dong Mingyu Liu Lu Zhang Shaohong Guo Ziyan Qiu Mingzhen You Ziyi Tian Anqi Yu Tianhong Ding Xinwei Hu Haibo Chen