https://arxiv.org/api/mI/RyzEla92LX494+h0H5orDLpA 2026-06-27T02:55:27Z 9951 1470 15 http://arxiv.org/abs/2506.22169v1 MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators 2025-06-27T12:31:24Z Operator fusion, a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the dynamicity of tensor dimension sizes could potentially lead to these operators becoming memory-bound, necessitating the generation of fused kernels, a task hindered by limited search spaces for fusion strategies, redundant memory access, and prolonged tuning time, leading to sub-optimal performance and inefficient deployment. We introduce MCFuser, a pioneering framework designed to overcome these obstacles by generating high-performance fused kernels for what we define as memory-bound compute-intensive (MBCI) operator chains. Leveraging high-level tiling expressions to delineate a comprehensive search space, coupled with Directed Acyclic Graph (DAG) analysis to eliminate redundant memory accesses, MCFuser streamlines kernel optimization. By implementing guidelines to prune the search space and incorporating an analytical performance model with a heuristic search, MCFuser not only significantly accelerates the tuning process but also demonstrates superior performance. Benchmarked against leading compilers like Ansor on NVIDIA A100 and RTX3080 GPUs, MCFuser achieves up to a 5.9x speedup in kernel performance and outpaces other baselines while reducing tuning time by over 70-fold, showcasing its agility. 2025-06-27T12:31:24Z 12 pages, accepted at SC 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024 Zheng Zhang Donglin Yang Xiaobo Zhou Dazhao Cheng 10.1109/SC41406.2024.00040 http://arxiv.org/abs/2008.09253v2 Describing Console I/O Behavior for Testing Student Submissions in Haskell 2025-06-26T13:12:32Z We present a small, formal language for specifying the behavior of simple console I/O programs. The design is driven by the concrete application case of testing interactive Haskell programs written by students. Specifications are structurally similar to lexical analysis regular expressions, but are augmented with features like global variables that track state and history of program runs, enabling expression of an interesting range of dynamic behavior. We give a semantics for our specification language based on acceptance of execution traces. From this semantics we derive a definition of the set of all traces valid for a given specification. Sampling that set enables us to mechanically check program behavior against specifications in a probabilistic fashion. Beyond testing, other possible uses of the specification language in an education context include related activities like providing more helpful feedback, generating sample solutions, and even generating random exercise tasks. 2020-08-21T01:21:59Z In Proceedings TFPIE 2019 and 2020, arXiv:2008.08923 EPTCS 321, 2020, pp. 19-36 Oliver Westphal Janis Voigtländer 10.4204/EPTCS.321.2 http://arxiv.org/abs/2506.20624v1 PhasePoly: An Optimization Framework forPhase Polynomials in Quantum Circuits 2025-06-25T17:13:16Z Quantum computing has transformative computational power to make classically intractable computing feasible. As the algorithms that achieve practical quantum advantage are beyond manual tuning, quantum circuit optimization has become extremely important and integrated into today's quantum software stack. This paper focuses on a critical type of quantum circuit optimization -- phase-polynomial optimization. Phase polynomials represents a class of building-block circuits that appear frequently in quantum modular exponentials (the most time-consuming component in Shor's factoring algorithm), in quantum approximation optimization algorithms (QAOA), and in Hamiltonian simulations. Compared to prior work on phase polynomials, we focus more on the impact of phase polynomial synthesis in the context of whole-circuit optimization, from single-block phase polynomials to multiple block phase polynomials, from greedy equivalent sub-circuit replacement strategies to a systematic parity matrix optimization approach, and from hardware-oblivious logical circuit optimization to hardware-friendly logical circuit optimization. We also provide a utility of our phase polynomial optimization framework to generate hardware-friendly building blocks. Our experiments demonstrate improvements of up to 50%-with an average total gate reduction of 34.92%-and reductions in the CNOT gate count of up to 48.57%, averaging 28.53%, for logical circuits. Additionally, for physical circuits, we achieve up to 47.65% CNOT gate reduction with an average reduction of 25.47% across a representative set of important benchmarks. 2025-06-25T17:13:16Z 14 pages, 12 figures Zihan Chen Henry Chen Yuwei Jin Minghao Guo Enhyeok Jang Jiakang Li Caitlin Chan Won Woo Ro Eddy Z. Zhang http://arxiv.org/abs/2506.20310v1 Unfolding Iterators: Specification and Verification of Higher-Order Iterators, in OCaml 2025-06-25T10:48:38Z Albeit being a central notion of every programming language, formally and modularly reasoning about iteration proves itself to be a non-trivial feat, specially in the context of higher-order iteration. In this paper, we present a generic approach to the specification and deductive verification of higher-order iterators, written in the OCaml language. Our methodology follows two key principles: first, the usage of the Gospel specification language to describe the general behaviour of any iteration schema; second, the usage of the Cameleer framework to deductively verify that every iteration client is correct with respect to its logical specification. To validate our approach we develop a set of verified case studies, ranging from classic list iterators to graph algorithms implemented in the widely used OCamlGraph library. 2025-06-25T10:48:38Z Ion Chirica Mário Pereira http://arxiv.org/abs/2506.20127v1 Dynamic Race Detection With O(1) Samples 2025-06-25T04:46:35Z Happens before-based dynamic analysis is the go-to technique for detecting data races in large scale software projects due to the absence of false positive reports. However, such analyses are expensive since they employ expensive vector clock updates at each event, rendering them usable only for in-house testing. In this paper, we present a sampling-based, randomized race detector that processes only constantly many events of the input trace even in the worst case. This is the first sub-linear time (i.e., running in o(n) time where n is the length of the trace) dynamic race detection algorithm; previous sampling based approaches like Pacer run in linear time (i.e., O(n)). Our algorithm is a property tester for HB-race detection -- it is sound in that it never reports any false positive, and on traces that are far, with respect to hamming distance, from any race-free trace, the algorithm detects an HB-race with high probability. Our experimental evaluation of the algorithm and its comparison with state-of-the-art deterministic and sampling based race detectors shows that the algorithm does indeed have significantly low running time, and detects races quite often. 2025-06-25T04:46:35Z Preprint version of the paper published in PACMPL, Volume 7, Issue POPL (POPL 2023). Official version available at https://doi.org/10.1145/3571238 In Proceedings of the ACM on Programming Languages (PACMPL), Volume 7, Issue POPL, January 2023, Article 45, 28 pages. ACM Mosaad Al Thokair Minjian Zhang Umang Mathur Mahesh Viswanathan 10.1145/3571238 http://arxiv.org/abs/2405.10611v2 A Certified Proof Checker for Deep Neural Network Verification in Imandra 2025-06-24T15:45:04Z Recent advances in the verification of deep neural networks (DNNs) have opened the way for a broader usage of DNN verification technology in many application areas, including safety-critical ones. However, DNN verifiers are themselves complex programs that have been shown to be susceptible to errors and numerical imprecision; this, in turn, has raised the question of trust in DNN verifiers. One prominent attempt to address this issue is enhancing DNN verifiers with the capability of producing certificates of their results that are subject to independent algorithmic checking. While formulations of Marabou certificate checking already exist on top of the state-of-the-art DNN verifier Marabou, they are implemented in C++, and that code itself raises the question of trust (e.g., in the precision of floating point calculations or guarantees for implementation soundness). Here, we present an alternative implementation of the Marabou certificate checking in Imandra -- an industrial functional programming language and an interactive theorem prover (ITP) -- that allows us to obtain full proof of certificate correctness. The significance of the result is two-fold. Firstly, it gives stronger independent guarantees for Marabou proofs. Secondly, it opens the way for the wider adoption of DNN verifiers in interactive theorem proving in the same way as many ITPs already incorporate SMT solvers. 2024-05-17T08:16:32Z Accepted at ITP 2025, Interactive Theorem Proving Remi Desmartin Omri Isac Grant Passmore Ekaterina Komendantskaya Kathrin Stark Guy Katz http://arxiv.org/abs/2405.12976v2 A Sound Type System for Secure Currency Flow 2025-06-24T13:10:41Z In this paper we focus on TinySol, a minimal calculus for Solidity smart contracts, introduced by Bartoletti et al. We start by rephrasing its syntax (to emphasise its object-oriented flavour) and give a new big-step operational semantics. We then use it to define two security properties, namely call integrity and noninterference. These two properties have some similarities in their definition, in that they both require that some part of a program is not influenced by the other part. However, we show that the two properties are actually incomparable. Nevertheless, we provide a type system for noninterference and show that well-typed programs satisfy call integrity as well; hence, programs that are accepted by our type system satisfy both properties. We finally discuss the practical usability of the type system and its limitations by means of some simple examples. 2024-05-21T17:57:33Z This is an extended and revised version of the paper appeared in ECOOP 2024 (see v1 of this arXiv paper) Luca Aceto Daniele Gorla Stian Lybech http://arxiv.org/abs/2506.19457v1 The Autonomous Data Language -- Concepts, Design and Formal Verification 2025-06-24T09:38:13Z Nowadays, the main advances in computational power are due to parallelism. However, most parallel languages have been designed with a focus on processors and threads. This makes dealing with data and memory in programs hard, which distances the implementation from its original algorithm. We propose a new paradigm for parallel programming, the data-autonomous paradigm, where computation is performed by autonomous data elements. Programs in this paradigm are focused on making the data collaborate in a highly parallel fashion. We furthermore present AuDaLa, the first data autonomous programming language, and provide a full formalisation that includes a type system and operational semantics. Programming in AuDaLa is very natural, as illustrated by examples, albeit in a style very different from sequential and contemporary parallel programming. Additionally, it lends itself for the formal verification of parallel programs, which we demonstrate. 2025-06-24T09:38:13Z 48 pages, preprint submitted to Elsevier Tom T. P. Franken Thomas Neele Jan Friso Groote 10.1016/j.tcs.2025.115560 http://arxiv.org/abs/2403.14606v3 The Elements of Differentiable Programming 2025-06-24T08:38:16Z Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs. 2024-03-21T17:55:16Z Draft version 3 Mathieu Blondel Vincent Roulet http://arxiv.org/abs/2210.14699v3 Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic? 2025-06-23T11:31:33Z Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement. 2022-10-26T13:28:14Z 53 pages, 3 Figures (not counted the subfigures), 16 Tables Journal of Systems and Software, Volume 230, December 2025, 112562 Jean-Baptiste Döderlein Nguessan Hermann Kouadio Mathieu Acher Djamel Eddine Khelladi Benoit Combemale 10.1016/j.jss.2025.112562 http://arxiv.org/abs/2410.15179v3 HPVM-HDC: A Heterogeneous Programming System for Accelerating Hyperdimensional Computing 2025-06-21T12:30:05Z Hyperdimensional Computing (HDC), a technique inspired by cognitive models of computation, has been proposed as an efficient and robust alternative basis for machine learning. HDC programs are often manually written in low-level and target specific languages targeting CPUs, GPUs, and FPGAs -- these codes cannot be easily retargeted onto HDC-specific accelerators. No previous programming system enables productive development of HDC programs and generates efficient code for several hardware targets. We propose a heterogeneous programming system for HDC: a novel programming language, HDC++, for writing applications using a unified programming model, including HDC-specific primitives to improve programmability, and a heterogeneous compiler, HPVM-HDC, that provides an intermediate representation for compiling HDC programs to many hardware targets. We implement two tuning optimizations, automatic binarization and reduction perforation, that exploit the error resilient nature of HDC. Our evaluation shows that HPVM-HDC generates performance-competitive code for CPUs and GPUs, achieving a geomean speed-up of 1.17x over optimized baseline CUDA implementations with a geomean reduction in total lines of code of 1.6x across CPUs and GPUs. Additionally, HPVM-HDC targets an HDC Digital ASIC and an HDC ReRAM accelerator simulator, enabling the first execution of HDC applications on these devices. 2024-10-19T18:48:52Z Russel Arbore Xavier Routh Abdul Rafae Noor Akash Kothari Haichao Yang Weihong Xu Sumukh Pinge Minxuan Zhou Tajana Rosing Vikram Adve http://arxiv.org/abs/2412.05784v4 ASC-Hook: fast and transparent system call hook for Arm 2025-06-21T00:50:48Z Intercepting system calls is crucial for tools that aim to modify or monitor application behavior. However, existing system call interception tools on the ARM platform still suffer from limitations in terms of performance and completeness. This paper presents an efficient and comprehensive binary rewriting framework, ASC-Hook, specifically designed for intercepting system calls on the ARM platform. ASC-Hook addresses two key challenges on the ARM architecture: the misalignment of the target address caused by directly replacing the SVC instruction with br x8, and the return to the original control flow after system call interception. This is achieved through a hybrid replacement strategy and our specially designed trampoline mechanism. By implementing multiple completeness strategies specifically for system calls, we ensured comprehensive and thorough interception. Experimental results show that ASC-Hook reduces overhead to at least 1/29 of that of existing system call interception tools. We conducted extensive performance evaluations of ASC-Hook, and the average performance loss for system call-intensive applications is 3.7\% . 2024-12-08T02:30:23Z Accepted to LCTES 2025 (26th ACM SIGPLAN/SIGBED Int. Conf. on Languages, Compilers & Tools for Embedded Systems); 11 pages (incl. appendix), 6 figures. DOI: 10.1145/3735452.3735524 Yang Shen National University of Defense Technology Min Xie National University of Defense Technology Wenzhe Zhang National University of Defense Technology Tao Wu Changsha University of Science and Technology 10.1145/3735452.3735524 http://arxiv.org/abs/2505.20314v3 $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction 2025-06-20T19:06:56Z I present a model of universal parallel computation called $Δ$-Nets, and a method to translate $λ$-terms into $Δ$-nets and back. Together, the model and the method constitute an algorithm for optimal parallel $λ$-reduction, solving the longstanding enigma with groundbreaking clarity. I show that the $λ$-calculus can be understood as a projection of $Δ$-Nets$-$one that severely restricts the structure of sharing, among other drawbacks. Unhindered by these restrictions, the $Δ$-Nets model opens the door to new parallel programming language implementations and computer architectures that are more efficient and performant than previously possible. 2025-05-22T22:23:59Z Daniel Augusto Rizzi Salvadori http://arxiv.org/abs/2506.16883v1 Low Overhead Allocation Sampling in a Garbage Collected Virtual Machine 2025-06-20T10:11:03Z Compared to the more commonly used time-based profiling, allocation profiling provides an alternate view of the execution of allocation heavy dynamically typed languages. However, profiling every single allocation in a program is very inefficient. We present a sampling allocation profiler that is deeply integrated into the garbage collector of PyPy, a Python virtual machine. This integration ensures tunable low overhead for the allocation profiler, which we measure and quantify. Enabling allocation sampling profiling with a sampling period of 4 MB leads to a maximum time overhead of 25% in our benchmarks, over un-profiled regular execution. 2025-06-20T10:11:03Z Christoph Jung C. F. Bolz-Tereick http://arxiv.org/abs/2504.15678v2 Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP) 2025-06-20T02:48:24Z Vector processing is crucial for boosting processor performance and efficiency, particularly with data-parallel tasks. The RISC-V "V" Vector Extension (RVV) enhances algorithm efficiency by supporting vector registers of dynamic sizes and their grouping. Nevertheless, for very long vectors, the static number of RVV vector registers and its power-of-two grouping can lead to performance restrictions. To counteract this limitation, this work introduces Zoozve, a RISC-V vector instruction extension that eliminates the need for strip-mining. Zoozve allows for flexible vector register length and count configurations to boost data computation parallelism. With a data-adaptive register allocation approach, Zoozve permits any register groupings and accurately aligns vector lengths, cutting down register overhead and alleviating performance declines from strip-mining. Additionally, the paper details Zoozve's compiler and hardware implementations using LLVM and SystemVerilog. Initial results indicate Zoozve yields a minimum 10.10$\times$ reduction in dynamic instruction count for fast Fourier transform (FFT), with a mere 5.2\% increase in overall silicon area. 2025-04-22T08:00:52Z 6 pages, 4 figures, LCTES'25 Proceedings of the 26th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. (2025), 51-56 Siyi Xu Limin Jiang Yintao Liu Yihao Shen Yi Shi Shan Cao Zhiyuan Jiang 10.1145/3735452.3735526