https://arxiv.org/api/YrJo+fRrjX3RPGLZRNhZJW7tKak 2026-04-07T08:34:27Z 27913 195 15 http://arxiv.org/abs/2603.23640v1 LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load 2026-03-24T18:28:38Z Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone. 2026-03-24T18:28:38Z 14 pages, 5 figures, 10 tables Pranay Tummalapalli Sahil Arayakandy Ritam Pal Kautuk Kundan http://arxiv.org/abs/2603.23458v1 SNARE: A TRAP for Rational Players to Solve Byzantine Consensus in the 5f+1 Model 2026-03-24T17:29:52Z The TRAP protocol solves rational agreement by combining accountable consensus with a one-shot BFTCR finalization phase. We present SNARE (Scalable Nash Agreement via Reward and Exclusion), the adaptation of TRAP to $n=5f{+}1$, and prove $ε$-$(k,t)$-robustness for rational agreement tolerating coalitions up to ${\approx}73\%$ with deposits under $0.5\%$ of the gain. A central finding is that appending a single all-to-all broadcast round with the $4f{+}1$ threshold after predecisions yields $ε$-$(k,t)$-robustness for coalitions up to $3f$ (${\approx}60\%$) without any deposit: we need not model or know the utility function of deviating players, only that they participate in the protocol. These players can be \emph{deceitful} (arbitrary unknown utility), not just rational, and the finalization structure prevents disagreement regardless of their motivation. This observation is protocol-agnostic, applies to any $5f{+}1$ protocol at the cost of one message delay that runs concurrently with the next view, and does not require commit-reveal mechanisms. Above $60\%$, the full baiting mechanism with deposits under $0.5\%$ extends tolerance to ${\approx}73\%$. A second finding is that valid-candidacy, the property preventing reward front-running, holds unconditionally regardless of the quorum threshold, removing both the $n>2(k{+}t)$ and $n>\frac{3}{2}k{+}3t$ constraints from the original TRAP. This retroactively extends the $3f{+}1$ bound from $C<n/2$ to $C<5n/9$. The binding constraint in both models is the winner consensus operating on $2f$ residual players after excluding $3f{+}1$ detected equivocators. We explore avenues for relaxing this limit. 2026-03-24T17:29:52Z WIP Alejandro Ranchal-Pedrosa Benjamin Marsh http://arxiv.org/abs/2603.28795v1 StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving 2026-03-24T17:19:26Z We address LLM serving workloads where repeated requests share a common solution structure but differ in localized constraints, such as output schema, variable names, or numeric constants. Prior caching approaches typically reuse either full responses (semantic caching) or model-internal KV/prefix states, which are respectively brittle under partial changes or tightly coupled to specific backends. We present StepCache, a backend-agnostic step-level reuse layer that segments outputs into ordered steps, retrieves the best-matching cached request, verifies steps using lightweight task-aware checks, and regenerates only failing regions via selective patching. StepCache additionally supports strict structured-output enforcement for JSON, including single-step extraction, required-key constraints, and one-shot repair, as well as conservative skip-reuse fallbacks for semantic changes. For linear equations, StepCache promotes verification into correction via a bounded repair loop with a deterministic fallback that guarantees correctness when the backend model fails. In a CPU-only perturbation-heavy micro-benchmark on math and JSON variants, averaged over three seeds, StepCache reduces mean latency from 2.13 s to 0.67 s, median latency from 2.42 s to 0.01 s, and p95 latency from 3.38 s to 3.30 s. It also reduces total token usage from 36.1k to 27.3k and improves end-to-end correctness from 72.5% to 100% under task-specific checks and a stitched-output integrity check. Across requests, 79.7% take the reuse-only fast path, 5.4% require patching, and 14.9% trigger skip-reuse. 2026-03-24T17:19:26Z 9 pages, 1 figure Azam Nouri http://arxiv.org/abs/2603.21444v2 Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects 2026-03-24T15:54:50Z The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies. 2026-03-22T23:18:49Z 2026 International Conference on Supercomputing (ICS '26), July 06--09, 2026, Belfast, United Kingdom Julian Bellavita Lorenzo Pichetti Thomas Pasquali Flavio Vella Giulia Guidi 10.1145/3797905.3800543 http://arxiv.org/abs/2411.03231v3 LOGSAFE: Logic-Guided Verification for Trustworthy Federated Time-Series Learning 2026-03-24T15:34:16Z This paper introduces LOGSAFE, a defense mechanism for federated learning in time series settings, particularly within cyber-physical systems. It addresses poisoning attacks by moving beyond traditional update-similarity methods and instead using logical reasoning to evaluate client reliability. LOGSAFE extracts client-specific temporal properties, infers global patterns, and verifies clients against them to detect and exclude malicious participants. Experiments show that it significantly outperforms existing methods, achieving up to 93.27% error reduction over the next best baseline. Our code is available at https://github.com/judydnguyen/LOGSAFE-Robust-FTS. 2024-11-05T16:23:19Z 17th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS) Dung Thuy Nguyen Ziyan An Taylor T. Johnson Meiyi Ma Kevin Leach http://arxiv.org/abs/2603.23329v1 Communication-Aware Diffusion Load Balancing for Persistently Interacting Objects 2026-03-24T15:31:17Z Parallel applications with irregular and time-varying workloads often suffer from load imbalance. Dynamic load balancing techniques address this challenge by redistributing work during execution. We present a new type of distributed diffusion-based load balancing targeted at communication-intensive applications with persistently communicating objects. Leveraging the application's communication graph, our strategy reduces across-node communication while simultaneously distributing load effectively. We also propose an algorithmic variant for cases where the communication patterns are not readily available. We explore optimizations to our algorithm, and comparisons with other related load balancing strategies in simulation and on a Particle-in-Cell benchmark on up to 8 nodes of Perlmutter at NERSC. 2026-03-24T15:31:17Z 8 pages, 6 figures. To appear in the Proceedings of PDSEC 2026 (workshop of the IEEE IPDPS 2026) Maya Taylor Kavitha Chandrasekar Laxmikant V. Kale http://arxiv.org/abs/2502.10001v3 EmbBERT: Attention Under 2 MB Memory 2026-03-24T15:21:45Z Transformer architectures based on the attention mechanism have revolutionized natural language processing (NLP), driving major breakthroughs across virtually every NLP task. However, their substantial memory and computational requirements still hinder deployment on ultra-constrained devices such as wearables and Internet-of-Things (IoT) units, where available memory is limited to just a few megabytes. To address this challenge, we introduce EmbBERT, a tiny language model (TLM) architecturally designed for extreme efficiency. The model integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism that together enable optimal performance under strict memory budgets. Through this redesign for the extreme edge, we demonstrate that highly simplified transformer architectures remain remarkably effective under tight resource constraints. EmbBERT requires only 2 MB of total memory, and achieves accuracy performance comparable to the ones of state-of-the-art (SotA) models that require a $\mathbf{10\times}$ memory budget. Extensive experiments on the curated TinyNLP benchmark and the GLUE suite confirm that EmbBERT achieves competitive accuracy, comparable to that of larger SotA models, and consistently outperforms downsized versions of BERT and MAMBA of similar size. Furthermore, we demonstrate the model resilience to 8-bit quantization, which further reduces memory usage to just 781 kB , and the scalability of the EmbBERT architecture across the sub-megabyte to tens-of-megabytes range. Finally, we perform an ablation study demonstrating the positive contributions of all components and the pre-training procedure. All code, scripts, and checkpoints are publicly released to ensure reproducibility: https://github.com/RiccardoBravin/tiny-LLM. 2025-02-14T08:33:31Z 24 pages, 4 figures, 14 tables Neural Networks, Volume 200, 2026, 108800, ISSN 0893-6080, https://www.sciencedirect.com/science/article/pii/S0893608026002625 Riccardo Bravin Massimo Pavan Hazem Hesham Yousef Shalby Fabrizio Pittorino Manuel Roveri 10.1016/j.neunet.2026.108800 http://arxiv.org/abs/2512.13591v2 astroCAMP: A Community Benchmark and Co-Design Framework for Sustainable SKA-Scale Radio Imaging 2026-03-24T14:23:39Z The Square Kilometre Array (SKA) will operate one of the world's largest continuous scientific data systems, sustaining petascale imaging under strict power envelopes. Current radio-interferometric pipelines typically achieve only 4--14\% of hardware peak utilization due to memory and I/O bottlenecks, incurring high energy, operational, and carbon costs, further compounded by the absence of standardised cross-layer metrics and fidelity tolerances for principled hardware--software co-design. We present astroCAMP, a reproducible benchmarking and co-design framework for SKA-scale imaging, contributing: (1) a unified metric suite spanning performance, utilisation, memory/data-movement, sustainability, economics, and scientific fidelity; (2) standardised SKA-representative datasets and benchmark configurations for reproducible cross-platform evaluation; (3) a multi-objective co-design formulation linking quality constraints to time-, energy-, carbon-, and cost-to-solution; and (4) a design-space exploration workflow to derive Pareto-optimal operating regions. We evaluate WSClean+IDG on an AMD EPYC 9334 CPU and NVIDIA H100 GPU, revealing orchestration and synchronization bottlenecks despite efficient kernels, limited CPU strong scaling, and location-dependent carbon/cost efficiency. We illustrate astroCAMP for heterogeneous CPU--FPGA exploration and call on the SKA community to define quantifiable fidelity thresholds to accelerate principled optimisation for SKA-scale imaging. 2025-12-15T17:47:28Z 13 pages, 16 figures Denisa-Andreea Constantinescu Rubén Rodríguez Álvarez Jacques Morin Etienne Orliac Mickaël Dardaillon Sunrise Wang Hugo Miomandre Miguel Peón-Quirós Jean-François Nezan David Atienza http://arxiv.org/abs/2403.16125v2 Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design 2026-03-24T13:17:41Z Efficiently training large-scale models (LMs) in GPU clusters involves two separate avenues: inter-job dynamic scheduling and intra-job adaptive parallelism (AP). However, existing dynamic schedulers struggle with large-model scheduling due to the mismatch between static parallelism (SP)-aware scheduling and AP-based execution, leading to cluster inefficiencies such as degraded throughput and prolonged job queuing. This paper presents Arena, a large-model training system that co-designs dynamic scheduling and adaptive parallelism to achieve high cluster efficiency. To reduce scheduling costs while improving decision quality, Arena designs low-cost, disaggregated profiling and AP-tailored, load-aware performance estimation, while unifying them by sharding the joint scheduling-parallelism optimization space via a grid abstraction. Building on this, Arena dynamically schedules profiled jobs in elasticity and heterogeneity dimensions, and executes them using efficient AP with pruned search space. Evaluated on heterogeneous testbeds and production workloads, Arena reduces job completion time by up to $49.3\%$ and improves cluster throughput by up to $1.60\times$. 2024-03-24T12:43:04Z Chunyu Xue Weihao Cui Quan Chen Chen Chen Han Zhao Shulai Zhang Linmei Wang Yan Li Limin Xiao Weifeng Zhang Jing Yang Bingsheng He Minyi Guo 10.1145/3767295.3803571 http://arxiv.org/abs/2604.03279v1 Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S 2026-03-24T13:02:58Z Text-to-Speech (TTS) models are significantly more numerically fragile than Large Language Models (LLMs) due to their continuous waveform generation and perceptual sensitivity to small numerical perturbations. While aggressive precision reduction techniques such as BlockFloat8 (BFP8) and low-fidelity (LoFi) compute have been widely adopted in language models, applying similar strategies to TTS systems often results in audible artifacts, phase instability, and spectral distortion. In this work, we present Lightning V2, a production-grade TTS model co-optimized for Tenstorrent hardware. Through precision-aware architectural design and hardware-software co-optimization, we achieve over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment without measurable degradation in audio quality. Leveraging Tenstorrent's Network-on-Chip (NoC), distributed SRAM, and deterministic execution model, we reduce memory movement and redundant weight fetches, enabling efficient low-precision inference. Compared to an NVIDIA L40S baseline, Lightning V2 achieves approximately 4x lower on-prem accelerator cost at equivalent throughput, while maintaining production audio fidelity. Our results demonstrate that precision co-design, combined with hardware-aware optimization, can fundamentally reshape the economics of real-time speech inference. 2026-03-24T13:02:58Z Ranjith M. S. Akshat Mandloi Sudarshan Kamath http://arxiv.org/abs/2304.04699v2 Efficient Distributed Decomposition and Routing Algorithms in Minor-Free Networks and Their Applications 2026-03-24T12:32:44Z In the LOCAL model, low-diameter decomposition is a useful tool in designing algorithms, as it allows us to shift from the general graph setting to the low-diameter graph setting, where brute-force information gathering can be done efficiently. Recently, Chang and Su [PODC 2022] showed that any high-conductance network excluding a fixed minor contains a high-degree vertex, so the entire graph topology can be gathered to one vertex efficiently in the CONGEST model using expander routing. Therefore, in networks excluding a fixed minor, many problems that can be solved efficiently in LOCAL via low-diameter decomposition can also be solved efficiently in CONGEST via expander decomposition. In this work, we show improved decomposition and routing algorithms for networks excluding a fixed minor in the CONGEST model. Our algorithms cost $\text{poly}(\log n, 1/ε)$ rounds deterministically. For bounded-degree graphs, our algorithms finish in $O(ε^{-1}\log n) + ε^{-O(1)}$ rounds. Our algorithms have a wide range of applications, including the following results in CONGEST. 1. A $(1-ε)$-approximate maximum independent set in a network excluding a fixed minor can be computed deterministically in $O(ε^{-1}\log^\ast n) + ε^{-O(1)}$ rounds, nearly matching the $Ω(ε^{-1}\log^\ast n)$ lower bound of Lenzen and Wattenhofer [DISC 2008]. 2. Property testing of any additive minor-closed property can be done deterministically in $O(\log n)$ rounds if $ε$ is a constant or $O(ε^{-1}\log n) + ε^{-O(1)}$ rounds if the maximum degree $Δ$ is a constant, nearly matching the $Ω(ε^{-1}\log n)$ lower bound of Levi, Medina, and Ron [PODC 2018]. 2023-04-10T16:36:16Z Yi-Jun Chang http://arxiv.org/abs/2603.23049v1 PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving 2026-03-24T10:40:58Z Retrieval-Augmented Generation (RAG) systems enhance the performance of large language models (LLMs) by incorporating supplementary retrieved documents, enabling more accurate and context-aware responses. However, integrating these external documents often results in very long input sequences, which significantly increases computation costs during the prefill stage, where key-value (KV) representations for all input tokens are generated. This latency bottleneck becomes especially pronounced under high-throughput serving scenarios. KV-cache reuse offers a promising solution by storing previously computed KV states for shared input prefixes, thereby avoiding redundant computation across requests that contain overlapping context. Yet, the effectiveness of cache reuse is often limited by three practical challenges: low cache hit rates due to naive eviction policies, high CPU-GPU data transfer overhead, and slow SSD I/O when caches spill to storage. To address these issues, we propose PCR, a system designed to maximize KV-cache reuse efficiency through intelligent prefetching and pipelined data movement. Specifically, PCR introduces three key techniques: (1) a prefix-tree caching structure with a look-ahead LRU replacement policy that uses pending requests in the scheduler queue to improve cache hit ratios; (2) layer-wise overlapping that pipelines KV-cache loading and GPU computation across CUDA streams to hide communication latency; and (3) queue-based prefetching that proactively loads relevant KV caches from SSD into DRAM before they are needed. Extensive experiments show that PCR outperforms existing KV-cache reuse methods, achieving up to a 2.47x speedup in terms of average TTFT. 2026-03-24T10:40:58Z Wenfeng Wang Xiaofeng Hou Peng Tang Hengyi Zhou Jing Wang Xinkai Wang Chao Li Minyi Guo http://arxiv.org/abs/2601.09166v2 DP-FedSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix 2026-03-24T06:40:36Z Differentially private federated learning (DP-FL) often suffers from slow convergence under tight privacy budgets because the noise required for privacy preservation degrades gradient quality. Although second-order optimization can accelerate training, existing approaches for DP-FL face significant scalability limitations: Newton-type methods require clients to compute Hessians, while feature covariance methods scale poorly with model dimension. We propose DP-FedSOFIM, a simple and scalable second-order optimization method for DP-FL. The method constructs an online regularized proxy for the Fisher information matrix at the server using only privatized aggregated gradients, capturing useful curvature information without requiring Hessian computations or feature covariance estimation. Efficient rank-one updates based on the Sherman-Morrison formula enable communication costs proportional to the model size and require only O(d) client-side memory. Because all curvature and preconditioning operations are performed at the server on already privatized gradients, DP-FedSOFIM introduces no additional privacy cost beyond the underlying privatized gradient release mechanism. Experiments on CIFAR-10 and PathMNIST show that DP-FedSOFIM converges faster and consistently achieves higher accuracy than DP-FedGD, DP-SCAFFOLD, and DP-FedFC across a range of privacy budgets, with particularly pronounced gains under stringent privacy constraints. 2026-01-14T05:11:28Z 40 pages, 4 figures, 3 tables. Submitted to TMLR Sidhant Nair Tanmay Sen Mrinmay Sen Sayantan Banerjee http://arxiv.org/abs/2603.22774v1 Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference 2026-03-24T04:06:27Z Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving workloads, we find that multi-GPU performance frequently degrades not because GPUs are saturated, but because CPUs fail to keep the GPUs busy. Under limited CPU allocations, systems exhibit symptoms such as delayed kernel launch, stalled communication, and increased tokenization latency, leading to severe GPU underutilization even when ample GPU resources are available. This work presents a systematic analysis of CPU-induced slowdowns in multi-GPU LLM inference. We show that these bottlenecks persist even in serving stacks that employ process-level separation and modern GPU-side optimizations such as CUDA Graphs. Since the marginal cost of additional CPU cores is small relative to GPU instance pricing, our evaluation indicates that increasing the number of CPU cores can substantially improve performance and stability at minimal additional cost. Under moderate serving load, we observe that CPU-starved configurations frequently time out, while providing adequate CPU resources restores responsiveness and reduces time-to-first-token (TTFT) latency by 1.36-5.40x across configurations, all without requiring additional GPUs. This work shows that CPU provisioning is a crucial factor in multi-GPU LLM inference configuration, helping prevent control-side bottlenecks. 2026-03-24T04:06:27Z 13 pages, 13 figures, 1 table Euijun Chung Yuxiao Jia Aaron Jezghani Hyesoon Kim http://arxiv.org/abs/2603.20661v2 WWW.Serve: Interconnecting Global LLM Services through Decentralization 2026-03-24T03:29:51Z Large language model (LLM) services are mostly centralized, leading to scalability bottlenecks and underutilization of substantial scattered GPU resources. While decentralization offers a promising alternative, existing frameworks primarily focus on cooperation among GPU providers while overlooking their inherent competitive dynamics, imposing substantial constraints such as excessive platform-level oversight or rigid requirements to execute all assigned requests using fixed software stacks on fixed hardware configurations. We argue that such assumptions are unrealistic in real-world decentralized environments. To this end, we propose WWW$.$Serve, a decentralized framework for interconnecting LLM services worldwide. It allows participants to flexibly determine their participation policies and resource commitments, and supports self-organizing request dispatch, enabling the network to autonomously allocate requests without centralized coordination. Empirically, we show that WWW$.$Serve improves global SLO (service-level-objective) attainment by up to 1.5x and lowers latency by 27.6%. Its performance approaches, and in some cases surpasses, centralized scheduling, while fully preserving the benefits of decentralization. These results highlight WWW$.$Serve as a promising foundation for real-world, decentralized LLM serving. 2026-03-21T05:34:08Z Huanyu Wang Ziyu Xia Zhuoming Chen Beidi Chen