Visual Insights into Agentic Optimization of Pervasive Stream Processing Services

2026-02-19T11:37:52Z

Processing sensory data close to the data source, often involving Edge devices, promises low latency for pervasive applications, like smart cities. This commonly involves a multitude of processing services, executed with limited resources; this setup faces three problems: first, the application demand and the resource availability fluctuate, so the service execution must scale dynamically to sustain processing requirements (e.g., latency); second, each service permits different actions to adjust its operation, so they require individual scaling policies; third, without a higher-level mediator, services would cannibalize any resources of services co-located on the same device. This demo first presents a platform for context-aware autoscaling of stream processing services that allows developers to monitor and adjust the service execution across multiple service-specific parameters. We then connect a scaling agent to these interfaces that gradually builds an understanding of the processing environment by exploring each service's action space; the agent then optimizes the service execution according to this knowledge. Participants can revisit the demo contents as video summary and introductory poster, or build a custom agent by extending the artifact repository.

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

2026-02-18T20:34:45Z

The deployment of deep learning inference in production environments continues to grow, where throughput, latency, and hardware efficiency are critical. Although specialized accelerators are increasingly adopted, many inference workloads still run on CPU-only systems, particularly in legacy data centers and cost-sensitive environments. This study investigates the scalability limits of CPU-based inference for convolutional neural networks by benchmarking ResNet models across varying batch sizes on two hardware tiers: a legacy Intel Xeon E5-2403 v2 processor and a modern Intel Xeon 6 "Granite Rapids" platform. Results show that legacy CPUs quickly reach throughput saturation, with limited scaling beyond small batch sizes due to instruction-level and memory constraints. In contrast, the Granite Rapids system leverages Intel Advanced Matrix Extensions (AMX) to achieve substantially higher throughput. However, oversubscription beyond physical core limits introduces execution contention and tail-latency amplification, revealing a performance degradation regime in modern architectures. We introduce GDEV-AI, a reproducible benchmarking framework for analyzing scalability behavior and architectural saturation in CPU-based inference. By establishing a vendor-neutral baseline, this work provides empirical insight into performance bottlenecks and informs capacity planning in heterogeneous data center environments.

A Study on Inference Latency for Vision Transformers on Mobile Devices

2026-02-18T19:19:56Z

Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

2026-02-18T19:12:42Z

Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers (close to the achievable maximum values of 2.01x and 1.87x, respectively, found by exhaustive grid search on a Pixel~5 smartphone).

Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering

2026-02-17T19:29:16Z

The phase ordering problem has been a long-standing challenge since the late 1970s, yet it remains an open problem due to having a vast optimization space and an unbounded nature, making it an open-ended problem without a finite solution, one can limit the scope by reducing the number and the length of optimizations. Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes. In the past 20 years, Machine Learning has been employed to construct performance models to improve the selection and ordering of compiler optimizations, however, the approaches are not baked into the compiler seamlessly and never materialized to be leveraged at a fine-grained scope of code segments. This paper presents Protean Compiler: An agile framework to enable LLVM with built-in phase-ordering capabilities at a fine-grained scope. The framework also comprises a complete library of more than 140 handcrafted static feature collection methods at varying scopes, and the experimental results showcase speedup gains of up to 4.1% on average and up to 15.7% on select Cbench applications wrt LLVM's O3 by just incurring a few extra seconds of build time on Cbench. Additionally, Protean compiler allows for an easy integration with third-party ML frameworks and other Large Language Models, and this two-step optimization shows a gain of 10.1% and 8.5% speedup wrt O3 on Cbench's Susan and Jpeg applications. Protean compiler is seamlessly integrated into LLVM and can be used as a new, enhanced, full-fledged compiler. We plan to release the project to the open-source community in the near future.

Energy Concerns with HPC Systems and Applications

2026-02-17T15:30:49Z

For various reasons including those related to climate changes, {\em energy} has become a critical concern in all relevant activities and technical designs. For the specific case of computer activities, the problem is exacerbated with the emergence and pervasiveness of the so called {\em intelligent devices}. From the application side, we point out the special topic of {\em Artificial Intelligence}, who clearly needs an efficient computing support in order to succeed in its purpose of being a {\em ubiquitous assistant}. There are mainly two contexts where {\em energy} is one of the top priority concerns: {\em embedded computing} and {\em supercomputing}. For the former, power consumption is critical because the amount of energy that is available for the devices is limited. For the latter, the heat dissipated is a serious source of failure and the financial cost related to energy is likely to be a significant part of the maintenance budget. On a single computer, the problem is commonly considered through the electrical power consumption. This paper, written in the form of a survey, we depict the landscape of energy concerns in computer activities, both from the hardware and the software standpoints.

Embedding Retrofitting: Data Engineering for better RAG

2026-02-17T13:39:59Z

Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5\%$ to $-5.2\%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2\%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8\%$ average). The gap between clean and noisy preprocessing (10\%+ swing) exceeds the gap between algorithms (3\%), establishing preprocessing quality as the primary determinant of retrofitting success.

TQml Simulator: optimized simulation of quantum machine learning

2026-02-16T23:21:57Z

Hardware-efficient circuits employed in Quantum Machine Learning are typically composed of alternating layers of uniformly applied gates. High-speed numerical simulators for such circuits are crucial for advancing research in this field. In this work, we numerically benchmark universal and gate-specific techniques for simulating the action of layers of gates on quantum state vectors, aiming to accelerate the overall simulation of Quantum Machine Learning algorithms. Our analysis shows that the optimal simulation method for a given layer of gates depends on the number of qubits involved, and that a tailored combination of techniques can yield substantial performance gains in the forward and backward passes for a given circuit. Building on these insights, we developed a numerical simulator, named TQml Simulator, that employs the most efficient simulation method for each layer in a given circuit. We evaluated TQml Simulator on circuits constructed from standard gate sets, such as rotations and CNOTs, as well as on native gates from IonQ and IBM quantum processing units. In most cases, our simulator outperforms equivalent Pennylane's default.qubit simulator by up to a factor of 10, depending on the circuit, the number of qubits, the batch size of the input data, and the hardware used.

Decomposing Docker Container Startup Performance: A Three-Tier Measurement Study on Heterogeneous Infrastructure

2026-02-16T21:45:13Z

Container startup latency is a critical performance metric for CI/CD pipelines, serverless computing, and auto-scaling systems, yet practitioners lack empirical guidance on how infrastructure choices affect this latency. We present a systematic measurement study that decomposes Docker container startup into constituent operations across three heterogeneous infrastructure tiers: Azure Premium SSD (cloud SSD), Azure Standard HDD (cloud HDD), and macOS Docker Desktop (developer workstation with hypervisor-based virtualization). Using a reproducible benchmark suite that executes 50 iterations per test across 10 performance dimensions, we quantify previously under-characterized relationships between infrastructure configuration and container runtime behavior. Our key findings include: (1) container startup is dominated by runtime overhead rather than image size, with only 2.5% startup variation across images ranging from 5 MB to 155 MB on SSD; (2) storage tier selection imposes a 2.04x startup penalty (HDD 1157 ms vs. SSD 568 ms); (3) Docker Desktop's hypervisor layer introduces a 2.69x startup penalty and 9.5x higher CPU throttling variance compared to native Linux; (4) OverlayFS write performance collapses by up to two orders of magnitude compared to volume mounts on SSD-backed storage; and (5) Linux namespace creation contributes only 8-10 ms (<1.5%) of total startup time. All measurement scripts, raw data, and analysis tools are publicly available.

Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

2026-02-15T17:06:02Z

Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.

Efficient Data-Driven Production Scheduling in Pharmaceutical Manufacturing

2026-02-14T08:31:22Z

This paper develops a data-driven, constraint-based optimization framework for a complex industrial job shop scheduling problem variant in pharmaceutical manufacturing. The formulation captures fixed routings and designated machines, explicit resource calendars with weekends and planned maintenance, and campaign sequencing through sequence-dependent cleaning times derived from site tables. The model is implemented with an open source constraint solver and evaluated on deterministic snapshots from a solid oral dosage facility under three objective formulations: makespan, makespan plus total tardiness, and makespan plus average tardiness. On three industrial instances of increasing size (10, 30, and 84 jobs) the proposed schedules dominate reference plans that solve a simplified variant without the added site rules. Makespan reductions reach $88.1\%$, $77.6\%$, and $54.9\%$ and total tardiness reductions reach $72.1\%$, $58.7\%$, and $18.2\%$, respectively. The composite objectives further decrease late job counts with negligible makespan change on the smaller instances and a modest increase on the largest instance. Optimality is proven on the small case, with relative gaps of $0.77\%$ and $14.92\%$ on the medium and large cases under a fixed time limit. The results show that a compact constraint programming formulation can deliver feasible, transparent schedules that respect site rules while improving adherence to due dates on real industrial data.

Benchmarking quantum computers

2026-02-13T22:41:10Z

The rapid pace of development in quantum computing technology has sparked a proliferation of benchmarks for assessing the performance of quantum computing hardware and software. Good benchmarks empower scientists, engineers, programmers, and users to understand a computing system's power, but bad benchmarks can misdirect research and inhibit progress. In this Perspective, we survey the science of quantum computer benchmarking. We discuss the role of benchmarks and benchmarking, and how good benchmarks can drive and measure progress towards the long-term goal of useful quantum computations, i.e., "quantum utility". We explain how different kinds of benchmark quantify the performance of different parts of a quantum computer, survey existing benchmarks, examine recent trends in benchmarking, and highlight important open research questions in this field.

On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers

2026-02-13T13:26:51Z

The increase in computation and storage has led to a significant growth in the scale of systems powering applications and services, raising concerns about sustainability and operational costs. In this paper, we explore power-saving techniques in high-performance computing (HPC) and datacenter networks, and their relation with performance degradation. From this premise, we propose leveraging Energy Efficient Ethernet (EEE) protocol, with the flexibility to extend to conventional Ethernet or upcoming Ethernet-derived interconnect versions of BXI and Omnipath. We analyze the PerfBound power-saving mechanism, identifying possible improvements and modeling it into a simulation framework. Through different experiments, we examine its impact on performance and determine the most appropriate interconnect. We also study traffic patterns generated by selected HPC and machine learning applications to evaluate the behavior of power-saving techniques. From these experiments, we provide an analysis of how applications affect system and network energy consumption. Based on this, we disclose the weakness of dynamic power-down mechanisms and propose an approach that improves energy reduction with minimal or no performance penalty. To the best of our knowledge, this work presents the first thorough analysis of PerfBound and an enhancement to the technique, while also targeting emerging post-exascale networks.

Characterize LSM-tree Compaction Performance via On-Device LLM Inference

2026-02-13T07:02:28Z

Modern key-value storage engines built on Log-Structured Merge-trees (LSM-trees), such as RocksDB and LevelDB, rely heavily on the performance of their compaction operations, which are impacted by a complex set of interdependent configuration parameters. Manually tuning these parameters for optimal performance demands considerable expertise, while traditional auto-tuning approaches struggle with the enormous search space and low sample efficiency inherent to this domain. In recent years, Large Language Models (LLMs) have demonstrated strong capabilities in code generation and logical reasoning, offering new possibilities for system optimization. However, applying LLMs to real-time compaction tuning in such latency-sensitive environments is a double-edged sword. While large-scale LLMs can offer superior reasoning for strategy generation, their high inference latency and computational cost make them impractical for interactive, low-latency tuning. In contrast, small-scale LLMs achieve low latency but often at the expense of reasoning accuracy and tuning effectiveness. In this paper, we first evaluate this trade-off by analyzing the compaction-tuning performance and inference latency of LLMs at different scales in an LSM-tree-based tuning case. We then characterize the performance of LSM-tree on RocksDB v8.8.1, with a focus on adjusting the key compaction-related parameters under db_bench workloads. Our experimental results show a clear positive correlation between model capability and tuning effectiveness.

Impact of Data-Oriented and Object-Oriented Design on Performance and Cache Utilization with Artificial Intelligence Algorithms in Multi-Threaded CPUs

2026-02-13T00:12:14Z

The growing performance gap between multi-core CPUs and main memory necessitates hardware-aware software design paradigms. This study provides a comprehensive performance analysis of Data Oriented Design (DOD) versus the traditional Object-Oriented Design (OOD), focusing on cache utilization and efficiency in multi-threaded environments. We developed and compared four distinct versions of the A* search algorithm: single-threaded OOD (ST-OOD), single-threaded DOD (ST-DOD), multi-threaded OOD (MT-OOD), and multi-threaded DOD (MT-DOD). The evaluation was based on metrics including execution time, memory usage, and CPU cache misses. In multi-threaded tests, the DOD implementation demonstrated considerable performance gains, with faster execution times and a lower number of raw system calls and cache misses. While OOD occasionally showed marginal advantages in memory usage or percentage-based cache miss rates, DOD's efficiency in data-intensive operations was more evident. Furthermore, our findings reveal that for a fine-grained task like the A* algorithm, the overhead associated with thread management led to single-threaded versions significantly outperforming their multi-threaded counterparts in both paradigms. We conclude that even when performance differences appear subtle in simple algorithms, the consistent advantages of DOD in critical metrics highlight its foundational architectural superiority, suggesting it is a more effective approach for maximizing hardware efficiency in complex, large-scale AI and parallel computing tasks.