https://arxiv.org/api/OiQ47kFbL7Mc7mlQ5hwdMrw7zIA2026-03-26T12:41:14Z507716515http://arxiv.org/abs/2511.16682v2Bench360: Benchmarking Local LLM Inference from 360 Degrees2026-01-14T08:53:59ZRunning LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.2025-11-12T09:57:21ZLinus StuhlmannMauricio Fadel ArgerichJonathan Fürsthttp://arxiv.org/abs/2601.08960v1LookAhead: The Optimal Non-decreasing Index Policy for a Time-Varying Holding Cost problem2026-01-13T20:00:03ZIn practice, the cost of delaying a job can grow as the job waits. Such behavior is modeled by the Time-Varying Holding Cost (TVHC) problem, where each job's instantaneous holding cost increases with its current age (a job's age is the time since it arrived). The goal of the TVHC problem is to find a scheduling policy that minimizes the time-average total holding cost across all jobs.
However, no optimality results are known for the TVHC problem outside of the asymptotic regime. In this paper, we study a simple yet still challenging special case: A two-class M/M/1 queue in which class 1 jobs incur a non-decreasing, time-varying holding cost and class 2 jobs incur a constant holding cost.
Our main contribution is deriving the first optimal (non-decreasing) index policy for this special case of the TVHC problem. Our optimal policy, called LookAhead, stems from the following idea: Rather than considering each job's current holding cost when making scheduling decisions, we should look at their cost some $X$ time into the future, where this $X$ is intuitively called the ``lookahead amount." This paper derives that optimal lookahead amount.2026-01-13T20:00:03ZTo be published in Queueing SystemsKeerthana GurushankarZhouzi LiMor Harchol-BalterAlan Scheller-Wolfhttp://arxiv.org/abs/2601.08539v1Reducing Compute Waste in LLMs through Kernel-Level DVFS2026-01-13T13:26:57ZThe rapid growth of AI has fueled the expansion of accelerator- or GPU-based data centers. However, the rising operational energy consumption has emerged as a critical bottleneck and a major sustainability concern. Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique used to reduce energy consumption, and thus improve energy-efficiency, since it requires little effort and works with existing hardware. Reducing the energy consumption of training and inference of Large Language Models (LLMs) through DVFS or power capping is feasible: related work has shown energy savings can be significant, but at the cost of significant slowdowns. In this work, we focus on reducing waste in LLM operations: i.e., reducing energy consumption without losing performance. We propose a fine-grained, kernel-level, DVFS approach that explores new frequency configurations, and prove these save more energy than previous, pass- or iteration-level solutions. For example, for a GPT-3 training run, a pass-level approach could reduce energy consumption by 2% (without losing performance), while our kernel-level approach saves as much as 14.6% (with a 0.6% slowdown). We further investigate the effect of data and tensor parallelism, and show our discovered clock frequencies translate well for both. We conclude that kernel-level DVFS is a suitable technique to reduce waste in LLM operations, providing significant energy savings with negligible slow-down.2026-01-13T13:26:57ZJeffrey SpaanKuan-Hsun ChenAna-Lucia Varbanescuhttp://arxiv.org/abs/2508.07640v2Taming Cold Starts: Proactive Serverless Scheduling with Model Predictive Control2026-01-13T11:43:04ZServerless computing has transformed cloud application deployment by introducing a fine-grained, event-driven execution model that abstracts away infrastructure management. Its on-demand nature makes it especially appealing for latency-sensitive and bursty workloads. However, the cold start problem, i.e., where the platform incurs significant delay when provisioning new containers, remains the Achilles' heel of such platforms.
This paper presents a predictive serverless scheduling framework based on Model Predictive Control to proactively mitigate cold starts, thereby improving end-to-end response time. By forecasting future invocations, the controller jointly optimizes container prewarming and request dispatching, improving latency while minimizing resource overhead.
We implement our approach on Apache OpenWhisk, deployed on a Kubernetes-based testbed. Experimental results using real-world function traces and synthetic workloads demonstrate that our method significantly outperforms state-of-the-art baselines, achieving up to 85% lower tail latency and a 34% reduction in resource usage.2025-08-11T05:45:28Z8 pages, 8 figures, preprint accepted at MASCOTS 202533rd International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication System (MASCOTS 2025)Chanh NguyenMonowar BhuyanErik Elmroth10.1109/MASCOTS67699.2025.11283271http://arxiv.org/abs/2601.08374v1Shifting the Sweet Spot: High-Performance Matrix-Free Method for High-Order Elasticity2026-01-13T09:36:02ZIn high-order finite element analysis for elasticity, matrix-free (PA) methods are a key technology for overcoming the memory bottleneck of traditional Full Assembly (FA). However, existing implementations fail to fully exploit the special structure of modern CPU architectures and tensor-product elements, causing their performance "sweet spot" to anomalously remain at the low order of $p \approx 2$, which severely limits the potential of high-order methods. To address this challenge, we design and implement a highly optimized PA operator within the MFEM framework, deeply integrated with a Geometric Multigrid (GMG) preconditioner. Our multi-level optimization strategy includes replacing the original $O(p^6)$ generic algorithm with an efficient $O(p^4)$ one based on tensor factorization, exploiting Voigt symmetry to reduce redundant computations for the elasticity problem, and employing macro-kernel fusion to enhance data locality and break the memory bandwidth bottleneck. Extensive experiments on mainstream x86 and ARM architectures demonstrate that our method successfully shifts the performance "sweet spot" to the higher-order region of $p \ge 6$. Compared to the MFEM baseline, the optimized core operator (kernel) achieves speedups of 7x to 83x, which translates to a 3.6x to 16.8x end-to-end performance improvement in the complete solution process. This paper provides a validated and efficient practical path for conducting large-scale, high-order elasticity simulations on mainstream CPU hardware.2026-01-13T09:36:02ZDali ChangChong ZhangKaiqi ZhangMingguan YangHuiyuan LiWeiqiang Konghttp://arxiv.org/abs/2503.23988v2Deep Learning Model Deployment in Multiple Cloud Providers: an Exploratory Study Using Low Computing Power Environments2026-01-11T21:19:16ZThe deployment of Machine Learning models in the cloud has grown among tech companies. Hardware requirements are higher when these models involve Deep Learning techniques, and the cloud providers' costs may be a barrier. We explore deploying Deep Learning models, using for experiments the GECToR model, a Deep Learning solution for Grammatical Error Correction, across three of the major cloud providers (Amazon Web Services, Google Cloud Platform, and Microsoft Azure). We evaluate real-time latency, hardware usage, and cost at each cloud provider in 7 execution environments with 10 experiments reproduced. We found that while Graphics Processing Units (GPUs) excel in performance, they had an average cost 300% higher than solutions without a GPU. Our analysis also suggests that processor cache memory size is a key variable for CPU-only deployments, and setups with sufficient cache achieved a 50% cost reduction compared to GPU-based deployments. This study indicates the feasibility and affordability of cloud-based Deep Learning inference solutions without a GPU, benefiting resource-constrained users such as startups and small research groups.2025-03-31T11:58:37Z15 pages, 7 figuresElayne LemosRodrigo OliveiraJairson RodriguesRosalvo F. Oliveira Netohttp://arxiv.org/abs/2507.14959v3Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices2026-01-11T20:41:32ZReal-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.2025-07-20T13:39:50ZAccepted at the IEEE/CVF winter conference on applications of computer vision (WACV 2026)Saeid GhafouriMohsen FayyazXiangchen LiDeepu JohnBo JiDimitrios NikolopoulosHans Vandierendonckhttp://arxiv.org/abs/2601.06886v1Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM2026-01-11T12:20:48ZAccurate performance prediction is essential for optimizing scientific applications on modern high-performance computing (HPC) architectures. Widely used performance models primarily focus on cache and memory bandwidth, which is suitable for many memory-bound workloads. However, it is unsuitable for highly arithmetic intensive cases such as the sum-factorization with tensor $n$-mode product kernels, which are an optimization technique for high-order finite element methods (FEM). On processors with relatively high single instruction multiple data (SIMD) instruction latency, such as the Fujitsu A64FX, the performance of these kernels is strongly influenced by loop-body splitting strategies. Memory-bandwidth-oriented models are therefore not appropriate for evaluating these splitting configurations, and a model that directly reflects instruction-level efficiency is required. To address this need, we develop a dependency-chain-based analytical formulation that links loop-splitting configurations to instruction dependencies in the tensor $n$-mode product kernel. We further use XGBoost to estimate key parameters in the analytical model that are difficult to model explicitly. Evaluations show that the learning-augmented model outperforms the widely used standard Roofline and Execution-Cache-Memory (ECM) models. On the Fujitsu A64FX processor, the learning-augmented model achieves mean absolute percentage errors (MAPE) between 1% and 24% for polynomial orders ($P$) from 1 to 15. In comparison, the standard Roofline and ECM models yield errors of 42%-256% and 5%-117%, respectively. On the Intel Xeon Gold 6230 processor, the learning-augmented model achieves MAPE values from 1% to 13% for $P$=1 to $P$=14, and 24% at $P$=15. In contrast, the standard Roofline and ECM models produce errors of 1%-73% and 8%-112% for $P$=1 to $P$=15, respectively.2026-01-11T12:20:48ZThis work has been submitted to the IEEE for possible publicationXuanzhengbo RenYuta KawaiTetsuya HoshinoHirofumi TomitaTakahiro KatagiriDaichi MukunokiSeiya Nishizawa10.1109/ACCESS.2026.3675604http://arxiv.org/abs/2601.06591v1Modeling Tradeoffs between mobility, cost, and performance in Edge Computing2026-01-10T15:02:52ZEdge computing provides a cloud-like architecture where small-scale resources are distributed near the network edge, enabling applications on resource-constrained devices to offload latency-critical computations to these resources. While some recent work showed that the resource constraints of the edge could result in higher end-to-end latency under medium to high utilization due to higher queuing delays, to the best of our knowledge, there has not been any work on modeling the trade-offs of deploying on edge versus cloud infrastructures in the presence of mobility. Understanding the costs and trade-offs of this architecture is important for network designers, as the architecture is now adopted to be part of 5G and beyond networks in the form of the Multi-access Edge Computing (MEC). In this paper we focus on quantifying and estimating the cost of edge computing. Using closed-form queuing models, we explore the cost-performance trade-offs in the presence of different systems dynamics. We model how workload mobility and workload variations influence these tradeoffs, and validate our results with realistic experiments and simulations. Finally, we discuss the practical implications for designing edge systems and developing algorithms for efficient resource and workload management.2026-01-10T15:02:52ZMuhammad Danish WaseemAhmed Ali-Eldinhttp://arxiv.org/abs/2601.01353v2Benchmarking Quantum Data Center Architectures: A Performance and Scalability Perspective2026-01-10T00:22:02ZScalable distributed quantum computing (DQC) has motivated the design of multiple quantum data-center (QDC) architectures that overcome the limitations of single quantum processors through modular interconnection. While these architectures adopt fundamentally different design philosophies, their relative performance under realistic quantum hardware constraints remains poorly understood.
In this paper, we present a systematic benchmarking study of four representative QDC architectures-QFly, BCube, Clos, and Fat-Tree-quantifying their impact on distributed quantum circuit execution latency, resource contention, and scalability. Focusing on quantum-specific effects absent from classical data-center evaluations, we analyze how optical-loss-induced Einstein-Podolsky-Rosen (EPR) pair generation delays, coherence-limited entanglement retry windows, and contention from teleportation-based non-local gates shape end-to-end execution performance. Across diverse circuit workloads, we evaluate how architectural properties such as path diversity and path length, and shared BSM (Bell State Measurement) resources interact with optical-switch insertion loss and reconfiguration delay. Our results show that distributed quantum performance is jointly shaped by topology, scheduling policies, and physical-layer parameters, and that these factors interact in nontrivial ways. Together, these insights provide quantitative guidance for the design of scalable and high-performance quantum data-center architectures for DQC.2026-01-04T03:48:02ZShahrooz PouryousefEneet KaurHassan ShapourianDon TowsleyRamana KompellaReza Nejabatihttp://arxiv.org/abs/2601.06349v1Fixing ill-formed UTF-16 strings with SIMD instructions2026-01-09T23:09:42ZUTF-16 is a widely used Unicode encoding representing characters with one or two 16-bit code units. The format relies on surrogate pairs to encode characters beyond the Basic Multilingual Plane, requiring a high surrogate followed by a low surrogate. Ill-formed UTF-16 strings -- where surrogates are mismatched -- can arise from data corruption or improper encoding, posing security and reliability risks. Consequently, programming languages such as JavaScript include functions to fix ill-formed UTF-16 strings by replacing mismatched surrogates with the Unicode replacement character (U+FFFD). We propose using Single Instruction, Multiple Data (SIMD) instructions to handle multiple code units in parallel, enabling faster and more efficient execution. Our software is part of the Google JavaScript engine (V8) and thus part of several major Web browsers.2026-01-09T23:09:42ZRobert ClauseckerDaniel Lemirehttp://arxiv.org/abs/2601.05205v1EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI2026-01-08T18:31:11ZPervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.2026-01-08T18:31:11Z6 pages, 9 figures, 2 Tables, conference [Submitted in PerConAI-2026]Zain IqbalLorenzo Valeriohttp://arxiv.org/abs/2601.04904v1Parallel Quadratic Selected Inversion in Quantum Transport Simulation2026-01-08T13:03:56ZDriven by Moore's Law, the dimensions of transistors have been pushed down to the nanometer scale. Advanced quantum transport (QT) solvers are required to accurately simulate such nano-devices. The non-equilibrium Green's function (NEGF) formalism lends itself optimally to these tasks, but it is computationally very intensive, involving the selected inversion (SI) of matrices and the selected solution of quadratic matrix (SQ) equations. Existing algorithms to tackle these numerical problems are ideally suited to GPU acceleration, e.g., the so-called recursive Green's function (RGF) technique, but they are typically sequential, require block-tridiagonal (BT) matrices as inputs, and their implementation has been so far restricted to shared memory parallelism, thus limiting the achievable device sizes. To address these shortcomings, we introduce distributed methods that build on RGF and enable parallel selected inversion and selected solution of the quadratic matrix equation. We further extend them to handle BT matrices with arrowhead, which allows for the investigation of multi-terminal transistor structures. We evaluate the performance of our approach on a real dataset from the QT simulation of a nano-ribbon transistor and compare it with the sparse direct package PARDISO. When scaling to 16 GPUs, our fused SI and SQ solver is 5.2x faster than the SI module of PARDISO applied to a device 16x shorter. These results highlight the potential of our method to accelerate NEGF-based nano-device simulations.2026-01-08T13:03:56Z12 pages, 9 figuresVincent MaillouMatthias BollhoferOlaf SchenkAlexandros Nikolaos ZiogasMathieu Luisierhttp://arxiv.org/abs/2602.17670v1The Dark Side of Dark Mode -- User behaviour rebound effects and consequences for digital energy consumption2026-01-08T10:51:30ZUser devices are the largest contributor to media related global emissions. For web content, dark mode has been widely recommended as an energy-saving measure for certain display types. However, the energy savings achieved by dark mode may be undermined by user behaviour. This pilot study investigates the unintended consequences of dark mode adoption, revealing a rebound effect wherein users may increase display brightness when interacting with dark-themed web pages. This behaviour may negate the potential energy savings that dark mode offers. Our findings suggest that the energy efficiency benefits of dark mode are not as straightforward as commonly believed for display energy, and the interplay between content colourscheme and user behaviour must be carefully considered in sustainability guidelines and interventions.2026-01-08T10:51:30Z3 pages (2 + references), 3 figures, 1 table. To be included in the proceedings of the 1st International Workshop on Low Carbon Computing (LOCO) 2024, December 3, 2024, Glasgow/OnlineZak Datsonhttp://arxiv.org/abs/2507.10367v4FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline2026-01-08T09:19:47ZClient-side metadata caching has long been considered an effective method for accelerating metadata operations in distributed file systems (DFSs). However, we have found that client-side state (e.g., caching) is not only ineffective but also consumes valuable memory resources in the deep learning pipelines. We thus propose FalconFS, a DFS optimized for deep learning pipelines with the stateless-client architecture. Specifically, instead of performing client-side path resolution and caching, FalconFS efficiently resolves paths on the server side using hybrid metadata indexing and lazy namespace replication. FalconFS also boosts server concurrency with concurrent request merging and provides easy deployment with VFS shortcut. Evaluations against CephFS and Lustre show that FalconFS achieves up to 5.72$\times$ throughput for small file read/write and up to 12.81$\times$ throughput for deep learning model training. FalconFS has been running in Huawei autonomous driving system's production environment with 10,000 NPUs for one year and has been open-sourced.2025-07-14T15:09:01ZAccepted by NSDI'26Jingwei XuJunbin KangMingkai DongMingyu LiuLu ZhangShaohong GuoZiyan QiuMingzhen YouZiyi TianAnqi YuTianhong DingXinwei HuHaibo Chen