https://arxiv.org/api/opOrE40/kIAfP9W2TEcIQbhGErM2026-03-26T08:19:31Z507712015http://arxiv.org/abs/2602.03207v1WebSplatter: Enabling Cross-Device Efficient Gaussian Splatting in Web Browsers via WebGPU2026-02-03T07:18:40ZWe present WebSplatter, an end-to-end GPU rendering pipeline for the heterogeneous web ecosystem. Unlike naive ports, WebSplatter introduces a wait-free hierarchical radix sort that circumvents the lack of global atomics in WebGPU, ensuring deterministic execution across diverse hardware. Furthermore, we propose an opacity-aware geometry culling stage that dynamically prunes splats before rasterization, significantly reducing overdraw and peak memory footprint. Evaluation demonstrates that WebSplatter consistently achieves 1.2$\times$ to 4.5$\times$ speedups over state-of-the-art web viewers.2026-02-03T07:18:40ZYudong HanChao XuXiaodan YeWeichen BiZilong DongYun Mahttp://arxiv.org/abs/2511.15626v2A Latency-Constrained, Gated Recurrent Unit (GRU) Implementation in the Versal AI Engine2026-02-02T13:34:23ZThis work explores the use of the AMD Xilinx Versal Adaptable Intelligent Engine (AIE) to accelerate Gated Recurrent Unit (GRU) inference for latency constrained applications. We present a custom workload distribution framework across the AIE's vector processors and propose a hybrid AIE - Programmable Logic (PL) design to optimize computational efficiency. Our approach explores the parallelization over the rows of the matrices by utilizing as many of the AIE vectorized processors effectively computing all the elements of the resulting vector at the same time, an alternative to cascade stream pipelining.2025-11-19T17:12:54ZM. SapkasA. TriossiM. Zanettihttp://arxiv.org/abs/2602.01500v1Implementation Challenges in Quantum Key Distribution2026-02-02T00:30:44ZIn recent years, quantum computing technologies have steadily matured and have begun to find practical applications across various domains. One important area is network communication security, where Quantum Key Distribution (QKD) enables communicating parties to establish a shared secret that can then be used to generate symmetric keys for subsequent encryption and decryption. This study focuses on implementing and comparing two well-known QKD protocols, namely BB84 and E91, within an actual quantum computing environment. It also proposes the use of SX gate operations to generate uniform quantum superposition states. By leveraging the properties of quantum superposition and quantum entanglement, the study illustrates how communicating parties can securely obtain a shared secret while preventing adversaries from intercepting it. The experiments are conducted using the IBM Quantum Platform to demonstrate the feasibility of the BB84 and E91 protocols on actual quantum hardware. The evaluation considers several metrics, including entropy, Independent and Identically Distributed (IID), and error-rate verifications.2026-02-02T00:30:44Zin ChineseAbel C. H. Chenhttp://arxiv.org/abs/2506.02023v2DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials2026-01-31T22:36:41ZLarge-scale atomistic simulations are essential to bridge computational materials and chemistry to realistic materials and drug discovery applications. In the past few years, rapid developments of machine learning interatomic potentials (MLIPs) have offered a solution to scale up quantum mechanical calculations. Parallelizing these interatomic potentials across multiple devices poses a challenging, but promising approach to further extending simulation scales to real-world applications. In this work, we present DistMLIP, an efficient distributed inference platform for MLIPs based on zero-redundancy, graph-level parallelization. In contrast to conventional spatial partitioning parallelization, DistMLIP enables efficient MLIP parallelization through graph partitioning, allowing multi-device inference on flexible MLIP model architectures like multi-layer graph neural networks. DistMLIP presents an easy-to-use, flexible, plug-in interface that enables distributed inference of pre-existing MLIPs. We demonstrate DistMLIP on four widely used and state-of-the-art MLIPs: CHGNet, MACE, TensorNet, and eSEN. We show that DistMLIP can simulate atomic systems 3.4x larger and up to 8x faster compared to previous multi-GPU methods. We show that existing foundation potentials can perform near-million-atom calculations at the scale of a few seconds on 8 GPUs with DistMLIP.2025-05-28T23:23:36ZICLR 2026Kevin HanBowen DengAmir Barati FarimaniGerbrand Cederhttp://arxiv.org/abs/2602.02574v1WritePolicyBench: Benchmarking Memory Write Policies under Byte Budgets2026-01-31T07:26:12ZWe introduce WritePolicyBench, a benchmark for evaluating memory write policies: decision rules that choose what to store, merge, and evict under a strict byte budget while processing a stream with document/API drift. The benchmark provides (i) task generators with controlled non-stationarity, (ii) an explicit action interface for external memory, (iii) a byte-accurate cost model, and (iv) standardized metrics that measure both task success and budget efficiency.2026-01-31T07:26:12Z10 pages, 4 figuresEdgard El Chamhttp://arxiv.org/abs/2603.08713v1Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction2026-01-30T23:24:17ZLarge Language Models (LLMs) have intensified the need for low-precision formats that enable efficient, large-scale inference. The Open Compute Project (OCP) Microscaling (MX) standard is attractive due to its favorable hardware efficiency, but its 4-bit variant (MXFP4) lags behind NVIDIA's NVFP4 in accuracy, limiting adoption. We introduce two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes. OAS reduces overall errors by increasing effective dynamic range under power-of-two block scaling, while MBS allocates higher-precision scaling at a coarser granularity to better preserve outliers. Across multiple LLMs and standard downstream benchmarks, OAS and MBS reduce the end-to-end accuracy gap between MXFP4 and NVFP4 from about 10% to below 1% on average, while incurring modest GEMM overhead (6.2% on average). These results re-establish MXFP4 as a practical alternative to NVFP4, enabling near-NVFP4 accuracy while retaining MX's hardware-efficiency advantages (e.g., 12% relative area savings in tensor cores).2026-01-30T23:24:17ZJatin ChhuganiGeonhwa JeongBor-Yiing SuYunjie PanHanmei YangAayush AnkitJiecao YuSummer DengYunqing ChenNadathur SatishChangkyu Kimhttp://arxiv.org/abs/2602.00343v1Standardized Methods and Recommendations for Green Federated Learning2026-01-30T21:46:36ZFederated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric-Accelerated-Intelligence-Lab/carbon_footprint.2026-01-30T21:46:36Z4 sections, 9 pages, 5 figures, 26 references, submission to acm e-energy,Austin TappHolger R. RothZiyue XuAbhijeet ParidaHareem NisarMarius George Linguraruhttp://arxiv.org/abs/2505.11480v3SuperCoder: Assembly Program Superoptimization with Large Language Models2026-01-30T18:27:24ZSuperoptimization is the task of transforming a program into a faster one while preserving its input-output behavior. In this work, we investigate whether large language models (LLMs) can serve as superoptimizers, generating assembly programs that outperform code already optimized by industry-standard compilers. We construct the first large-scale benchmark for this problem, consisting of 8,072 assembly programs averaging 130 lines, in contrast to prior datasets restricted to 2-15 straight-line, loop-free programs. We evaluate 23 LLMs on this benchmark and find that the strongest baseline, Claude-opus-4, achieves a 51.5% test-passing rate and a 1.43x average speedup over gcc -O3. To further enhance performance, we fine-tune models with reinforcement learning, optimizing a reward function that integrates correctness and performance speedup. Starting from Qwen2.5-Coder-7B-Instruct (61.4% correctness, 1.10x speedup), the fine-tuned model SuperCoder attains 95.0% correctness and 1.46x average speedup, with additional improvement enabled by Best-of-N sampling and iterative refinement. Our results demonstrate, for the first time, that LLMs can be applied as superoptimizers for assembly programs, establishing a foundation for future research in program performance optimization beyond compiler heuristics.2025-05-16T17:40:45ZAnjiang WeiTarun SureshHuanmi TanYinglun XuGagandeep SinghKe WangAlex Aikenhttp://arxiv.org/abs/2601.22892v1Assessing the Real-World Impact of Post-Quantum Cryptography on WPA-Enterprise Networks2026-01-30T12:12:07ZThe advent of large-scale quantum computers poses a significant threat to contemporary network security protocols, including Wi-Fi Protected Access (WPA)-Enterprise authentication. To mitigate this threat, the adoption of Post-Quantum Cryptography (PQC) is critical. In this work, we investigate the performance impact of PQC algorithms on WPA-Enterprise-based authentication. To this end, we conduct an experimental evaluation of authentication latency using a testbed built with the open-source tools FreeRADIUS and hostapd, measuring the time spent at the client, access point, and RADIUS server. We evaluate multiple combinations of PQC algorithms and analyze their performance overhead in comparison to currently deployed cryptographic schemes. Beyond performance, we assess the security implications of these algorithm choices by relating authentication mechanisms to the quantum effort required for their exploitation. This perspective enables a systematic categorization of PQ-relevant weaknesses in WPA-Enterprise according to their practical urgency. The evaluation results show that, although PQC introduces additional authentication latency, combinations such as ML-DSA-65 and Falcon-1024 used in conjunction with ML-KEM provide a favorable trade-off between security and performance. Furthermore, we demonstrate that the resulting overhead can be effectively mitigated through session resumption. Overall, this work presents a first real-world performance evaluation of PQC-enabled WPA-Enterprise authentication and demonstrates its practical feasibility for enterprise Wi-Fi deployments.2026-01-30T12:12:07ZLukas KöderNils LohmillerPhil SchmiederBastian BuckMichael MenthTobias Heerhttp://arxiv.org/abs/2601.22760v1AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation2026-01-30T09:34:59ZThe performance of deep learning models critically depends on efficient kernel implementations, yet developing high-performance kernels for specialized accelerators remains time-consuming and expertise-intensive. While recent work demonstrates that large language models (LLMs) can generate correct and performant GPU kernels, kernel generation for neural processing units (NPUs) remains largely underexplored due to domain-specific programming models, limited public examples, and sparse documentation. Consequently, directly generating AscendC kernels with LLMs yields extremely low correctness, highlighting a substantial gap between GPU and NPU kernel generation.
We present AscendCraft, a DSL-guided approach for automatic AscendC kernel generation. AscendCraft introduces a lightweight DSL that abstracts non-essential complexity while explicitly modeling Ascend-specific execution semantics. Kernels are first generated in the DSL using category-specific expert examples and then transcompiled into AscendC through structured, constraint-driven LLM lowering passes. Evaluated on MultiKernelBench across seven operator categories, AscendCraft achieves 98.1% compilation success and 90.4% functional correctness. Moreover, 46.2% of generated kernels match or exceed PyTorch eager execution performance, demonstrating that DSL-guided transcompilation can enable LLMs to generate both correct and competitive NPU kernels. Beyond benchmarks, AscendCraft further demonstrates its generality by successfully generating two correct kernels for newly proposed mHC architecture, achieving performance that substantially surpasses PyTorch eager execution.2026-01-30T09:34:59ZZhongzhen WenShudi ShaoZhong LiYu GeTongtong XuYuanyi LinTian Zhanghttp://arxiv.org/abs/2601.17356v2Obfuscation as an Effective Signal for Prioritizing Cross-Chain Smart Contract Audits: Large-Scale Measurement and Risk Profiling2026-01-30T05:43:21ZObfuscation raises the interpretation cost of smart-contract auditing, yet its signals are hard to transfer across chains. We present HOBFNET, a fast surrogate of OBFPROBE, enabling million-scale cross-chain scoring. The model aligns with tool outputs on Ethereum (PCC 0.9158, MAPE 8.20 percent) and achieves 8-9 ms per contract, yielding a 2.3k-5.2k times speedup. Across BSC, Polygon, and Avalanche, we observe systematic score drift, motivating within-chain percentile queues (p99 as the main queue, p99.9 as an emergency queue). The high-score tail is characterized by rare selectors, external-call enrichment, and low signature density, supporting secondary triage. Cross-chain reuse is tail-enriched and directionally biased from smaller to larger ecosystems. On two publicly alignable cross-chain spillover cases, both fall into the p99 queue, indicating real-world hit value. We deliver a two-tier audit queue and a cross-chain linkage workflow for practical security operations.2026-01-24T08:05:39ZYao ZhaoZhang ShengShengchen DuanShen WangDaoyuan WuZhiyuan Wanhttp://arxiv.org/abs/2601.20653v1The Multiserver-Job Stochastic Recurrence Equation for Cloud Computing Performance Evaluation2026-01-28T14:36:54ZWe study the Multiserver-Job Queuing Model (MJQM) with general independent arrivals and service times under FCFS scheduling, using stochastic recurrence equations (SREs) and ergodic theory. We prove the monotonicity and separability properties of the MJQM SRE, enabling the application of the monotone-separable extension of Loynes' theorem and the formal definition of the MJQM stability condition. Based on these results, we introduce and implement two algorithms: one for drawing sub-perfect samples (SPS) of the system's workload and the second one to estimate the system's stability condition given the statistics of the jobs' input stream. The SPS algorithm allows for a massive GPU parallelization, greatly improving the efficiency of performance metrics evaluation. We also show that this approach extends to more complex systems, including MJQMs with typed resources.2026-01-28T14:36:54ZFrancois BaccelliDiletta OlliaroMarco Ajmone MarsanAndrea Marinhttp://arxiv.org/abs/2601.20537v1Colored Markov Modulated Fluid Queues2026-01-28T12:27:31ZMarkov-modulated fluid queues (MMFQs) are a powerful modeling framework for analyzing the performance of computer and communication systems. Their distinguishing feature is that the underlying Markov process evolves on a continuous state space, making them well suited to capture the dynamics of workloads, energy levels, and other performance-related quantities. Although classical MMFQs do not permit jumps in the fluid level, they can still be applied to analyze a wide range of jump processes.
In this paper, we generalize the MMFQ framework in a new direction by introducing {\bf colored MMFQs} and {\bf colored MMFQs with fluid jumps}. This enriched framework provides an additional form of memory: the color of incoming fluid can be used to keep track of the fluid level when certain events took place. This capability greatly enhances modeling flexibility and enables the analysis of queueing systems that would otherwise be intractable due to the curse of dimensionality or state-space explosion.2026-01-28T12:27:31ZBenny Van Houdthttp://arxiv.org/abs/2501.09144v4Rule-Based Graph Programs Matching the Time Complexity of Imperative Algorithms2026-01-27T12:26:46ZWe report on recent advances in rule-based graph programming, which allow us to match the time complexity of some fundamental imperative graph algorithms. In general, achieving the time complexity of graph algorithms implemented in conventional languages using a rule-based graph-transformation language is challenging due to the cost of graph matching. Previous work demonstrated that with rooted rules, certain algorithms can be implemented in the graph programming language GP 2 such that their runtime matches the time complexity of imperative implementations. However, this required input graphs to have a bounded node degree and (for some algorithms) to be connected. In this paper, we overcome these limitations by enhancing the graph data structure generated by the GP 2 compiler and exploiting the new structure in programs. We present three case studies: the first program checks whether input graphs are connected, the second program checks whether input graphs are acyclic, and the third program solves the single-source shortest-paths problem for graphs with integer edge-weights. The first two programs run in linear time on (possibly disconnected) input graphs with arbitrary node degrees. The third program runs in time $O(nm)$ on arbitrary input graphs, matching the time complexity of imperative implementations of the Bellman-Ford algorithm. For each program, we formally prove its correctness and time complexity, and provide runtime experiments on various graph classes.2025-01-15T20:52:37ZLMCSZiad Ismaili AlaouiDetlef Plumphttp://arxiv.org/abs/2509.01999v3Non-Asymptotic Performance Analysis of DOA Estimation Based on Real-Valued Root-MUSIC2026-01-27T07:11:15ZThis paper presents a systematic theoretical performance analysis of the Real-Valued root-MUSIC (RV-root-MUSIC) algorithm under non-asymptotic conditions. A well-known limitation of RV-root-MUSIC is the estimation ambiguity caused by mirror roots, which are typically suppressed using conventional beamforming (CBF). By leveraging the equivalent subspace constructed through the conjugate extension method and exploiting the equivalence of perturbations for true and mirror roots, this work provides a comprehensive study of three key aspects: noise subspace perturbation, true-root perturbation, and mirror-root perturbation. A statistical model is established, and generalized perturbation expressions are derived. Monte Carlo simulations confirm the correctness and effectiveness of the theoretical results. The analysis provides a rigorous foundation for parameter optimization in Direction-of-Arrival (DOA) estimation, with applications in radar, wireless communications, and intelligent sensing.2025-09-02T06:26:21ZAccepted to ICASSP 2026Junyang LiuWeicheng ZhaoQingping WangXiangtian MengMaria GrecoFulvio Gini