https://arxiv.org/api/7nGCkw7o5sPHP1EO8elhVzJXzp4 2026-03-26T17:14:32Z 5077 210 15 http://arxiv.org/abs/2512.18155v1 Performance Guarantees for Data Freshness in Resource-Constrained Adversarial IoT Systems 2025-12-20T00:31:24Z

Timely updates are critical for real-time monitoring and control applications powered by the Internet of Things (IoT). As these systems scale, they become increasingly vulnerable to adversarial attacks, where malicious agents interfere with legitimate transmissions to reduce data rates, thereby inflating the age of information (AoI). Existing adversarial AoI models often assume stationary channels and overlook queueing dynamics arising from compromised sensing sources operating under resource constraints. Motivated by the G-queue framework, this paper investigates a two-source M/G/1/1 system in which one source is adversarial and disrupts the update process by injecting negative arrivals according to a Poisson process and inducing i.i.d. service slowdowns, bounded in attack rate and duration. Using moment generating functions, we then derive closed-form expressions for average and peak AoI for an arbitrary number of sources. Moreover, we introduce a worst-case constrained attack model and employ stochastic dominance arguments to establish analytical AoI bounds. Numerical results validate the analysis and highlight the impact of resource-limited adversarial interference under general service time distributions.

2025-12-20T00:31:24Z 6 pages, 4 figures, conference paper Aresh Dadlani Muthukrishnan Senthil Kumar Omid Ardakanian Ioanis Nikolaidis http://arxiv.org/abs/2512.17855v1 On General Linearly Implicit Quantized State System Methods 2025-12-19T17:57:18Z

This work proposes a methodology to develop new numerical integration algorithms for ordinary differential equations based on state quantization, generalizing the notions of Linearly Implicit Quantized State Systems (LIQSS) methods. Using this idea, two novel sub-families of algorithms are designed that improve the performance of current LIQSS methods while preserving their properties regarding stability, global error bound and efficient event handling capabilities. The features of the new algorithms are studied in two application examples where the advantages over classic numerical integration algorithms is also analyzed.

2025-12-19T17:57:18Z Mariana Bergonzi Joaquín Fernández Ernesto Kofman http://arxiv.org/abs/2512.06699v2 Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storage Optimization 2025-12-19T06:10:50Z

Modern machine learning training is increasingly bottlenecked by data I/O rather than compute. GPUs often sit idle at below 50% utilization waiting for data. This paper presents a machine learning approach to predict I/O performance and recommend optimal storage configurations for ML training pipelines. We collected 141 observations through systematic benchmarking across different storage backends (NVMe SSD, network-attached storage, in-memory filesystems), data formats, and access patterns, covering both low-level I/O operations and full training pipelines. After evaluating seven regression models and three classification approaches, XGBoost achieved the best performance with R-squared of 0.991, predicting I/O throughput within 11.8% error on average. Feature importance analysis revealed that throughput metrics and batch size are the primary performance drivers. This data-driven approach can reduce configuration time from days of trial-and-error to minutes of predictive recommendation. The methodology is reproducible and extensible to other resource management problems in ML systems. Code and data are available at https://github.com/knkarthik01/gpu_storage_ml_project

2025-12-07T07:25:08Z 20 pages, 10 figures Karthik Prabhakar Durgamadhab Mishra http://arxiv.org/abs/2512.16854v1 An Upper Bound on the M/M/k Queue With Deterministic Setup Times 2025-12-18T18:27:00Z

In many systems, servers do not turn on instantly; instead, a setup time must pass before a server can begin work. These "setup times" can wreak havoc on a system's queueing; this is especially true in modern systems, where servers are regularly turned on and off as a way to reduce operating costs (energy, labor, CO2, etc.). To design modern systems which are both efficient and performant, we need to understand how setup times affect queues. Unfortunately, despite successes in understanding setup in a single-server system, setup in a multiserver system remains poorly understood. To circumvent the main difficulty in analyzing multiserver setup, all existing results assume that setup times are memoryless, i.e. distributed Exponentially. However, in most practical settings, setup times are close to Deterministic, and the widely used Exponential-setup assumption leads to unrealistic model behavior and a dramatic underestimation of the true harm caused by setup times. This paper provides a comprehensive characterization of the average waiting time in a multiserver system with Deterministic setup times, the M/M/k/Setup-Deterministic. In particular, we derive upper and lower bounds on the average waiting time in this system, and show these bounds are within a multiplicative constant of each other. These bounds are the first closed-form characterization of waiting time in any finite-server system with setup times. Further, we demonstrate how to combine our upper and lower bounds to derive a simple and accurate approximation for the average waiting time. These results are all made possible via a new technique for analyzing random time integrals that we named the Method of Intervening Stopping Times, or MIST.

2025-12-18T18:27:00Z Jalani Williams Weina Wang Mor Harchol-Balter http://arxiv.org/abs/2512.16512v1 XTC, A Research Platform for Optimizing AI Workload Operators 2025-12-18T13:24:44Z

Achieving high efficiency on AI operators demands precise control over computation and data movement. However, existing scheduling languages are locked into specific compiler ecosystems, preventing fair comparison, reuse, and evaluation across frameworks. No unified interface currently decouples scheduling specification from code generation and measurement. We introduce XTC, a platform that unifies scheduling and performance evaluation across compilers. With its common API and reproducible measurement framework, XTC enables portable experimentation and accelerates research on optimization strategies.

2025-12-18T13:24:44Z Pompougnac Hugo Guillon Christophe Noiry Sylvain Dutilleul Alban Iooss Guillaume Rastello Fabrice http://arxiv.org/abs/2512.16056v1 MultiPath Transfer Engine: Breaking GPU and Host-Memory Bandwidth Bottlenecks in LLM Services 2025-12-18T00:45:00Z

The limited bandwidth of PCIe has emerged as the critical bottleneck for large language model (LLM) performance, such as prefix cache fetching and model switching. Although intra-server multipath data transfer between GPU and host memory is theoretically possible, heterogeneous protocols such as PCIe and NVLink currently limit the bandwidth between host memory and GPUs to that of a single PICe link. This limitation resuals in underutilized intra-server bandwidth. To address this issue, we propose Multipath Memory Access (MMA), a scheme that, to the best of our knowledge, is the first to enalbe efficient multipath data transfer between GPU and host memory. MMA supports seamless deployment via dynamic library injection, enabling LLM applications to benefit from MMA without requiring any code modification. In our testbed, MMA significantly improves the data transfer bandwidth between the GPU and memory, achieving a peak bandwidth of 245 GB/s-representing a 4.62x speedup compared to the natice single-path bandwidth. End-to-end evaluations demonstrate that MMA reduces the time-to-first-token (TTFT) for LLM serving by 1.14x to 2.38x and decreases model-switching latency in vLLM's sleep mode by 1.12x to 2.48x.

2025-12-18T00:45:00Z Lingfeng Tang Daoping Zhang Junjie Chen Peihao Huang Feng Jin Chengguang Xu Yuxin Chen Feiqiang Sun Guo Chen http://arxiv.org/abs/2512.15834v1 Optimizing Agentic Language Model Inference via Speculative Tool Calls 2025-12-17T18:22:44Z

Language models (LMs) are becoming increasingly dependent on external tools. LM-based agentic frameworks frequently interact with their environment via such tools to search files, run code, call APIs, etc. Further, modern reasoning-based LMs use tools such as web search and Python code execution to enhance their reasoning capabilities. While tools greatly improve the capabilities of LMs, they also introduce performance bottlenecks during the inference process. In this paper, we introduce novel systems optimizations to address such performance bottlenecks by speculating tool calls and forcing sequences to remain resident in the inference engine to minimize overheads. Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents. We provide a theoretical analysis of our algorithms to provide insights into speculation configurations that will yield the best performance. Further, we recommend a new "tool cache" API endpoint to enable LM providers to easily adopt these optimizations.

2025-12-17T18:22:44Z Daniel Nichols Prajwal Singhania Charles Jekel Abhinav Bhatele Harshitha Menon http://arxiv.org/abs/2601.11557v1 From HNSW to Information-Theoretic Binarization: Rethinking the Architecture of Scalable Vector Search 2025-12-16T23:24:37Z

Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting in escalating operational cost and inherent trade-offs between latency, throughput, and retrieval accuracy. This paper analyzes the architectural limitations of the dominant "HNSW + float32 + cosine similarity" stack and evaluates existing cost-reduction strategies, including storage disaggregation and lossy vector quantization, which inevitably sacrifice either performance or accuracy. We introduce and empirically evaluate an alternative information-theoretic architecture based on maximally informative binarization (MIB), efficient bitwise distance metrics, and an information-theoretic scoring (ITS) mechanism. Unlike conventional ANN systems, this approach enables exhaustive search over compact binary representations, allowing deterministic retrieval and eliminating accuracy degradation under high query concurrency. Using the MAIR benchmark across 14 datasets and 10,038 queries, we compare this architecture against Elasticsearch, Pinecone, PGVector, and Qdrant. Results demonstrate retrieval quality comparable to full-precision systems, while achieving substantially lower latency and maintaining constant throughput at high request rates. We show that this architectural shift enables a truly serverless, cost-per-query deployment model, challenging the necessity of large in-memory ANN indexes for high-quality semantic search.

2025-12-16T23:24:37Z 16 Pages, 5 Figures, 3 Tables Seyed Moein Abtahi Majid Fekri Tara Khani Akramul Azim http://arxiv.org/abs/2512.14445v1 Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs 2025-12-16T14:31:49Z

In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to synchronize their start and possibly departure times. This is true of many parallelized machine learning workloads, and the popular Apache Spark processing engine has recently added support for Barrier Execution Mode, which allows users to add such barriers to their jobs. These barriers necessarily result in idle periods on some of the workers, which reduces their stability and performance, compared to equivalent workloads with no barriers. In this paper we will consider and analyze the stability and performance penalties resulting from barriers. We include an analysis of the stability of $(s,k,l)$ barrier systems that allow jobs to depart after $l$ out of $k$ of their tasks complete. We also derive and evaluate performance bounds for hybrid barrier systems servicing a mix of jobs, both with and without barriers, and with varying degrees of parallelism. For the purely 1-barrier case we compare the bounds and simulation results to benchmark data from a standalone Spark system. We study the overhead in the real system, and based on its distribution we attribute it to the dual event and polling-driven mechanism used to schedule barrier-mode jobs. We develop a model for this type of overhead and validate it against the real system through simulation.

2025-12-16T14:31:49Z Brenton Walker Markus Fidler http://arxiv.org/abs/2512.14297v1 A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge Networks 2025-12-16T11:11:37Z

Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial networks. These events violate IEC~61850-derived quality-of-service requirements and user-defined service-level agreements, hindering the reliable and timely delivery of control, monitoring, and best-effort traffic in IEC~61400-25-compliant wind power plants. Failure to maintain these requirements often results in delayed or lost control signals, reduced operational efficiency, and increased risk of wind turbine generator downtime. To address these challenges, this study proposes a threshold-triggered Deep Q-Network self-healing agent that autonomically detects, analyzes, and mitigates network disruptions while adapting routing behavior and resource allocation in real time. The proposed agent was trained, validated, and tested on an emulated tri-clustered switch network deployed in a cloud-based proof-of-concept testbed. Simulation results show that the proposed agent improves disruption recovery performance by 53.84% compared to a baseline shortest-path and load-balanced routing approach and outperforms state-of-the-art methods, including the Adaptive Network-based Fuzzy Inference System by 13.1% and the Deep Q-Network and traffic prediction-based routing optimization method by 21.5%, in a super-spine leaf data-plane architecture. Additionally, the agent maintains switch thermal stability by proactively initiating external rack cooling when required. These findings highlight the potential of deep reinforcement learning in building resilience in software-defined industrial networks deployed in mission-critical, time-sensitive application scenarios.

2025-12-16T11:11:37Z Agrippina Mwangi Utrecht University, The Netherlands León Navarro-Hilfiker Ørsted, USA Lukasz Brewka Ørsted, Denmark Mikkel Gryning Ørsted, Denmark Elena Fumagalli Utrecht University, The Netherlands Madeleine Gibescu Utrecht University, The Netherlands 10.1109/TNSM.2025.3647853 http://arxiv.org/abs/2512.14151v1 Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware Replacement 2025-12-16T07:16:10Z

Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals. These workloads generate highly irregular and bursty access patterns, causing traditional prefetching and replacement policies to mispredict and trigger severe cache pollution, thereby degrading system performance. To address this challenge, this paper proposes an Adaptive Cache Pollution Control (ACPC) mechanism tailored for LLM inference workloads, integrating Temporal Convolutional Network (TCN)-based access prediction with a priority-aware replacement strategy. The TCN module learns temporal dependencies in token access sequences to identify potential high-reuse cache lines, while the replacement policy dynamically adjusts eviction priorities based on predicted reuse likelihood and cache occupancy. The proposed framework is implemented and evaluated on representative transformer-based inference traces, including GPT-style autoregressive decoding and embedding retrieval workloads. Experimental results demonstrate that ACPC reduces cache pollution by 41.7 percent, improves cache hit rate by 8.9 percent, and achieves a 60.0 percent reduction in L2 miss penalty, compared with state-of-the-art machine-learning-based replacement baselines. Additionally, the proposed Temporal CNN-based ACPC framework increases token generation throughput by 15.9 percent and achieves the lowest final loss of 0.21, confirming its superior efficiency and stability under complex LLM inference workloads. These results highlight ACPC's effectiveness in recognizing useful cache lines and mitigating redundant prefetches under dynamic LLM access behaviors. The proposed approach provides a scalable, learning-driven solution for optimizing memory efficiency and latency in large-scale LLM serving and inference systems.

2025-12-16T07:16:10Z Songze Liu Hongkun Du Shaowen Wang http://arxiv.org/abs/2512.07108v2 Scheduling in Quantum Satellite Networks: Fairness and Performance Optimization 2025-12-15T17:56:43Z

Quantum satellite networks offer a promising solution for achieving long-distance quantum communication by enabling entanglement distribution across global scales. This work formulates and solves the quantum satellite network scheduling problem by optimizing satellite-to-ground station pair assignments under realistic system and environmental constraints. Our framework accounts for limited satellite and ground station resources, fairness, entanglement fidelity thresholds, and real world non-idealities including atmospheric losses, weather and background noise. In addition, we incorporate the complexities of multi-satellite relays enabled via inter-satellite links. We propose an integer linear programming (ILP) based optimization framework that supports multiple scheduling objectives, allowing us to analyze tradeoffs between maximizing total entanglement distribution rate and ensuring fairness across ground station pairs. Our framework can also be used as a benchmark tool to measure the performance of other potential transmission scheduling policies.

2025-12-08T02:44:14Z Ashutosh Jayant Dikshit Naga Lakshmi Anipeddi Prajit Dhara Saikat Guha Deirdre Kilbane Leandros Tassiulas Don Towsley Nitish K. Panigrahy http://arxiv.org/abs/2505.08091v2 LEGO: A Layout Expression Language for Code Generation of Hierarchical Mapping 2025-12-15T16:10:37Z

We describe LEGO, a new approach to optimizing data movement whereby code is expressed as a layout-independent computation and composed with layouts for data and computation. This code generator organization derives complex indexing expressions associated with hierarchical parallel code and data movement for GPUs. LEGO maps from layout specification to indexing expressions, and can be integrated into existing compilers and code templates. It facilitates the exploration of data layouts in combination with other optimizations. We demonstrate LEGO's integration with the Triton and MLIR compilers, and with CUDA templates. We show that LEGO is capable of deriving performance competitive with Triton, and shows broad applicability for data and thread layout mapping optimizations in its integration with CUDA and MLIR.

2025-05-12T21:53:09Z Amir Mohammad Tavakkoli Cosmin Oancea Mary Hall http://arxiv.org/abs/2306.04531v2 Comparison of SeDuMi and SDPT3 Solvers for Stability of Continuous-time Linear System 2025-12-15T11:08:07Z

SeDuMi and SDPT3 are two solvers for solving Semi-definite Programming (SDP) or Linear Matrix Inequality (LMI) problems. A computational performance comparison of these two are undertaken in this paper regarding the Stability of Continuous-time Linear Systems. The comparison mainly focuses on computational times and memory requirements for different scales of problems. To implement and compare the two solvers on a set of well-posed problems, we employ YALMIP, a widely used toolbox for modeling and optimization in MATLAB. The primary goal of this study is to provide an empirical assessment of the relative computational efficiency of SeDuMi and SDPT3 under varying problem conditions. Our evaluation indicates that SDPT3 performs much better in large-scale, high-precision calculations.

2023-06-07T15:40:15Z This version will be incorporated into another work Guangda Xu http://arxiv.org/abs/2512.13176v1 EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC 2025-12-15T10:37:48Z

Resource disaggregation is a promising technique for improving the efficiency of large-scale computing systems. However, this comes at the cost of increased memory access latency due to the need to rely on the network fabric to transfer data between remote nodes. As such, it is crucial to ascertain an application's memory latency sensitivity to minimize the overall performance impact. Existing tools for measuring memory latency sensitivity often rely on custom ad-hoc hardware or cycle-accurate simulators, which can be inflexible and time-consuming. To address this, we present EDAN (Execution DAG Analyzer), a novel performance analysis tool that leverages an application's runtime instruction trace to generate its corresponding execution DAG. This approach allows us to estimate the latency sensitivity of sequential programs and investigate the impact of different hardware configurations. EDAN not only provides us with the capability of calculating the theoretical bounds for performance metrics, but it also helps us gain insight into the memory-level parallelism inherent to HPC applications. We apply EDAN to applications and benchmarks such as PolyBench, HPCG, and LULESH to unveil the characteristics of their intrinsic memory-level parallelism and latency sensitivity.

2025-12-15T10:37:48Z Proc. 39th ACM International Conference on Supercomputing, 2025 Siyuan Shen Mikhail Khalilov Lukas Gianinazzi Timo Schneider Marcin Chrapek Jai Dayal Manisha Gajbe Robert Wisniewski Torsten Hoefler 10.1145/3721145.3734530