Wasure: A Modular Toolkit for Comprehensive WebAssembly Benchmarking

2026-02-05T09:49:58Z

WebAssembly (Wasm) has become a key compilation target for portable and efficient execution across diverse platforms. Benchmarking its performance, however, is a multi-dimensional challenge: it depends not only on the choice of runtime engines, but also on hardware architectures, application domains, source languages, benchmark suites, and runtime configurations. This paper introduces Wasure, a modular and extensible command-line toolkit that simplifies the execution and comparison of WebAssembly benchmarks. To complement performance evaluation, we also conducted a dynamic analysis of the benchmark suites included with Wasure. Our analysis reveals substantial differences in code coverage, control flow, and execution patterns, emphasizing the need for benchmark diversity. Wasure aims to support researchers and developers in conducting more systematic, transparent, and insightful evaluations of WebAssembly engines.

Emergence-as-Code for Self-Governing Reliable Systems

2026-02-05T09:04:33Z

SLO-as-code has made per-service} reliability declarative, but user experience is defined by journeys whose reliability is an emergent property of microservice topology, routing, redundancy, timeouts/fallbacks, shared failure domains, and tail amplification. As a result, journey objectives (e.g., "checkout p99 < 400 ms") are often maintained outside code and drift as the system evolves, forcing teams to either miss user expectations or over-provision and gate releases with ad-hoc heuristics. We propose Emergence-as-Code (EmaC), a vision for making journey reliability computable and governable via intent plus evidence. An EmaC spec declares journey intent (objective, control-flow operators, allowed actions) and binds it to atomic SLOs and telemetry. A runtime inference component consumes operational artifacts (e.g., tracing and traffic configuration) to synthesize a candidate journey model with provenance and confidence. From the last accepted model, the EmaC compiler/controller derives bounded journey SLOs and budgets under explicit correlation assumptions (optimistic independence vs. pessimistic shared fate), and emits control-plane artifacts (burn-rate alerts, rollout gates, action guards) that are reviewable in a Git workflow. An anonymized artifact repository provides a runnable example specification and generated outputs.

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

2026-02-05T05:37:04Z

The increasing adoption of large language models (LLMs) on heterogeneous computing platforms poses significant challenges to achieving high inference efficiency. To address these efficiency bottlenecks across diverse platforms, this paper proposes Opt4GPTQ, a practical optimization method designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering Optimization (SMB-Opt), which caches frequently accessed data in shared memory and employs single-threaded writes; Vectorized Memory Loading Optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly Optimization (ILA-Opt), which directly leverages hardwarenative vector half-precision addition and fused multiply-accumulate instructions. Experimental results show that Opt4GPTQ effectively improves performance across various models while maintaining original model accuracy, achieving throughput gains of up to 84.42%. This work highlights the critical role of platformlevel engineering in enabling efficient LLMs inference on emerging architectures and provides valuable methodologies for future heterogeneous platform adaptation.

Accelerating the Tesseract Decoder for Quantum Error Correction

2026-02-04T21:52:54Z

Quantum Error Correction (QEC) is essential for building robust, fault-tolerant quantum computers; however, the decoding process often presents a significant computational bottleneck. Tesseract is a novel Most-Likely-Error (MLE) decoder for QEC that employs the A* search algorithm to explore an exponentially large graph of error hypotheses, achieving high decoding speed and accuracy. This paper presents a systematic approach to optimizing the Tesseract decoder through low-level performance enhancements. Based on extensive profiling, we implemented four targeted optimization strategies, including the replacement of inefficient data structures, reorganization of memory layouts to improve cache hit rates, and the use of hardware-accelerated bit-wise operations. We achieved significant decoding speedups across a wide range of code families and configurations, including Color Codes, Bivariate-Bicycle Codes, Surface Codes, and Transversal CNOT Protocols. Our results demonstrate consistent speedups of approximately 2x for most code families, often exceeding 2.5x. Notably, we achieved a peak performance gain of over 5x for the most computationally demanding configurations of Bivariate-Bicycle Codes. These improvements make the Tesseract decoder more efficient and scalable, serving as a practical case study that highlights the importance of high-performance software engineering in QEC and providing a strong foundation for future research.

A-Graph: A Unified Graph Representation for At-Will Simulation across System Stacks

2026-02-04T18:37:05Z

As computer systems continue to diversify across technologies, architectures, applications, and beyond, the relevant design space has become larger and more complex. Given such trends, design space exploration (DSE) at early stages is critical to ensure agile development towards optimal performance and cost. Industry-grade EDA tools directly take in RTL code and report accurate results, but do not perform DSE. Recent works have attempted to explore the design space via simulation. However, most of these works are domain-specific and constrain the space that users are allowed to explore, offering limited flexibility between technologies, architecture, and applications. Moreover, they often demand high domain expertise to ensure high accuracy. To enable simulation that is agnostic to technology, architecture, and application at any granularity, we introduce Architecture-Graph (Agraph), a graph that unifies the system representation surrounding any arbitrary application, software, architecture, and circuit. Such a unified representation distinguishes Agraph from prior works, which focus on a single stack, allowing users to freely explore the design space across system stacks. To fully unleash the potential of Agraph, we further present Archx, a framework that implements Agraph. Archx is user-friendly in two ways. First, Archx has an easy-to-use programming interface to automatically generate and sweep design points under user constraints, boosting the programmability. Second, Archx adopts scope-based metric retrieval to analyze and understand each design point at any user-preferred hierarchy, enhancing the explainability. We conduct case studies that demonstrate Agraph's generalization across technologies, architecture, and applications with high simulation accuracy. Overall, we argue that Agraph and Archx serve as a foundation to simulate both performance and cost at will.

LoPace: A Lossless Optimized Prompt Accurate Compression Engine for Large Language Model Applications

2026-02-04T05:35:35Z

Large Language Models (LLMs) have changed the way natural language processing works, but it is still hard to store and manage prompts efficiently in production environments. This paper presents LoPace (Lossless Optimized Prompt Accurate Compression Engine), a novel compression framework designed specifically for prompt storage in LLM applications. LoPace uses three different ways to compress data: Zstandard-based compression, Byte-Pair Encoding (BPE) tokenization with binary packing, and a hybrid method that combines the two. We show that LoPace saves an average of 72.2\% of space while still allowing for 100\% lossless reconstruction by testing it on 386 different prompts, such as code snippets, markdown documentation, and structured content. The hybrid method always works better than each technique on its own. It gets mean compression ratios of 4.89x (range: 1.22--19.09x) and speeds of 3.3--10.7 MB/s. Our findings show that LoPace is ready for production, with a small memory footprint (0.35 MB on average) and great scalability for big databases and real-time LLM apps.

Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

2026-02-04T02:52:27Z

Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model's internal hidden states to assess the likelihood of generating sequences with specific meanings. Experiments on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.

WebSplatter: Enabling Cross-Device Efficient Gaussian Splatting in Web Browsers via WebGPU

2026-02-03T07:18:40Z

We present WebSplatter, an end-to-end GPU rendering pipeline for the heterogeneous web ecosystem. Unlike naive ports, WebSplatter introduces a wait-free hierarchical radix sort that circumvents the lack of global atomics in WebGPU, ensuring deterministic execution across diverse hardware. Furthermore, we propose an opacity-aware geometry culling stage that dynamically prunes splats before rasterization, significantly reducing overdraw and peak memory footprint. Evaluation demonstrates that WebSplatter consistently achieves 1.2$\times$ to 4.5$\times$ speedups over state-of-the-art web viewers.

A Latency-Constrained, Gated Recurrent Unit (GRU) Implementation in the Versal AI Engine

2026-02-02T13:34:23Z

This work explores the use of the AMD Xilinx Versal Adaptable Intelligent Engine (AIE) to accelerate Gated Recurrent Unit (GRU) inference for latency constrained applications. We present a custom workload distribution framework across the AIE's vector processors and propose a hybrid AIE - Programmable Logic (PL) design to optimize computational efficiency. Our approach explores the parallelization over the rows of the matrices by utilizing as many of the AIE vectorized processors effectively computing all the elements of the resulting vector at the same time, an alternative to cascade stream pipelining.

Implementation Challenges in Quantum Key Distribution

2026-02-02T00:30:44Z

In recent years, quantum computing technologies have steadily matured and have begun to find practical applications across various domains. One important area is network communication security, where Quantum Key Distribution (QKD) enables communicating parties to establish a shared secret that can then be used to generate symmetric keys for subsequent encryption and decryption. This study focuses on implementing and comparing two well-known QKD protocols, namely BB84 and E91, within an actual quantum computing environment. It also proposes the use of SX gate operations to generate uniform quantum superposition states. By leveraging the properties of quantum superposition and quantum entanglement, the study illustrates how communicating parties can securely obtain a shared secret while preventing adversaries from intercepting it. The experiments are conducted using the IBM Quantum Platform to demonstrate the feasibility of the BB84 and E91 protocols on actual quantum hardware. The evaluation considers several metrics, including entropy, Independent and Identically Distributed (IID), and error-rate verifications.

DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials

2026-01-31T22:36:41Z

Large-scale atomistic simulations are essential to bridge computational materials and chemistry to realistic materials and drug discovery applications. In the past few years, rapid developments of machine learning interatomic potentials (MLIPs) have offered a solution to scale up quantum mechanical calculations. Parallelizing these interatomic potentials across multiple devices poses a challenging, but promising approach to further extending simulation scales to real-world applications. In this work, we present DistMLIP, an efficient distributed inference platform for MLIPs based on zero-redundancy, graph-level parallelization. In contrast to conventional spatial partitioning parallelization, DistMLIP enables efficient MLIP parallelization through graph partitioning, allowing multi-device inference on flexible MLIP model architectures like multi-layer graph neural networks. DistMLIP presents an easy-to-use, flexible, plug-in interface that enables distributed inference of pre-existing MLIPs. We demonstrate DistMLIP on four widely used and state-of-the-art MLIPs: CHGNet, MACE, TensorNet, and eSEN. We show that DistMLIP can simulate atomic systems 3.4x larger and up to 8x faster compared to previous multi-GPU methods. We show that existing foundation potentials can perform near-million-atom calculations at the scale of a few seconds on 8 GPUs with DistMLIP.

WritePolicyBench: Benchmarking Memory Write Policies under Byte Budgets

2026-01-31T07:26:12Z

We introduce WritePolicyBench, a benchmark for evaluating memory write policies: decision rules that choose what to store, merge, and evict under a strict byte budget while processing a stream with document/API drift. The benchmark provides (i) task generators with controlled non-stationarity, (ii) an explicit action interface for external memory, (iii) a byte-accurate cost model, and (iv) standardized metrics that measure both task success and budget efficiency.

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

2026-01-30T23:24:17Z

Large Language Models (LLMs) have intensified the need for low-precision formats that enable efficient, large-scale inference. The Open Compute Project (OCP) Microscaling (MX) standard is attractive due to its favorable hardware efficiency, but its 4-bit variant (MXFP4) lags behind NVIDIA's NVFP4 in accuracy, limiting adoption. We introduce two software-only techniques, Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS), that improve MXFP4 quantization fidelity without requiring hardware changes. OAS reduces overall errors by increasing effective dynamic range under power-of-two block scaling, while MBS allocates higher-precision scaling at a coarser granularity to better preserve outliers. Across multiple LLMs and standard downstream benchmarks, OAS and MBS reduce the end-to-end accuracy gap between MXFP4 and NVFP4 from about 10% to below 1% on average, while incurring modest GEMM overhead (6.2% on average). These results re-establish MXFP4 as a practical alternative to NVFP4, enabling near-NVFP4 accuracy while retaining MX's hardware-efficiency advantages (e.g., 12% relative area savings in tensor cores).

Standardized Methods and Recommendations for Green Federated Learning

2026-01-30T21:46:36Z

Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric-Accelerated-Intelligence-Lab/carbon_footprint.

SuperCoder: Assembly Program Superoptimization with Large Language Models

2026-01-30T18:27:24Z

Superoptimization is the task of transforming a program into a faster one while preserving its input-output behavior. In this work, we investigate whether large language models (LLMs) can serve as superoptimizers, generating assembly programs that outperform code already optimized by industry-standard compilers. We construct the first large-scale benchmark for this problem, consisting of 8,072 assembly programs averaging 130 lines, in contrast to prior datasets restricted to 2-15 straight-line, loop-free programs. We evaluate 23 LLMs on this benchmark and find that the strongest baseline, Claude-opus-4, achieves a 51.5% test-passing rate and a 1.43x average speedup over gcc -O3. To further enhance performance, we fine-tune models with reinforcement learning, optimizing a reward function that integrates correctness and performance speedup. Starting from Qwen2.5-Coder-7B-Instruct (61.4% correctness, 1.10x speedup), the fine-tuned model SuperCoder attains 95.0% correctness and 1.46x average speedup, with additional improvement enabled by Best-of-N sampling and iterative refinement. Our results demonstrate, for the first time, that LLMs can be applied as superoptimizers for assembly programs, establishing a foundation for future research in program performance optimization beyond compiler heuristics.