https://arxiv.org/api/mI/RyzEla92LX494+h0H5orDLpA2026-06-27T02:55:27Z9951147015http://arxiv.org/abs/2506.22169v1MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators2025-06-27T12:31:24ZOperator fusion, a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the dynamicity of tensor dimension sizes could potentially lead to these operators becoming memory-bound, necessitating the generation of fused kernels, a task hindered by limited search spaces for fusion strategies, redundant memory access, and prolonged tuning time, leading to sub-optimal performance and inefficient deployment.
We introduce MCFuser, a pioneering framework designed to overcome these obstacles by generating high-performance fused kernels for what we define as memory-bound compute-intensive (MBCI) operator chains. Leveraging high-level tiling expressions to delineate a comprehensive search space, coupled with Directed Acyclic Graph (DAG) analysis to eliminate redundant memory accesses, MCFuser streamlines kernel optimization. By implementing guidelines to prune the search space and incorporating an analytical performance model with a heuristic search, MCFuser not only significantly accelerates the tuning process but also demonstrates superior performance. Benchmarked against leading compilers like Ansor on NVIDIA A100 and RTX3080 GPUs, MCFuser achieves up to a 5.9x speedup in kernel performance and outpaces other baselines while reducing tuning time by over 70-fold, showcasing its agility.2025-06-27T12:31:24Z12 pages, accepted at SC 2024SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024Zheng ZhangDonglin YangXiaobo ZhouDazhao Cheng10.1109/SC41406.2024.00040http://arxiv.org/abs/2008.09253v2Describing Console I/O Behavior for Testing Student Submissions in Haskell2025-06-26T13:12:32ZWe present a small, formal language for specifying the behavior of simple console I/O programs. The design is driven by the concrete application case of testing interactive Haskell programs written by students. Specifications are structurally similar to lexical analysis regular expressions, but are augmented with features like global variables that track state and history of program runs, enabling expression of an interesting range of dynamic behavior. We give a semantics for our specification language based on acceptance of execution traces. From this semantics we derive a definition of the set of all traces valid for a given specification. Sampling that set enables us to mechanically check program behavior against specifications in a probabilistic fashion. Beyond testing, other possible uses of the specification language in an education context include related activities like providing more helpful feedback, generating sample solutions, and even generating random exercise tasks.2020-08-21T01:21:59ZIn Proceedings TFPIE 2019 and 2020, arXiv:2008.08923EPTCS 321, 2020, pp. 19-36Oliver WestphalJanis Voigtländer10.4204/EPTCS.321.2http://arxiv.org/abs/2506.20624v1PhasePoly: An Optimization Framework forPhase Polynomials in Quantum Circuits2025-06-25T17:13:16ZQuantum computing has transformative computational power to make classically intractable computing feasible. As the algorithms that achieve practical quantum advantage are beyond manual tuning, quantum circuit optimization has become extremely important and integrated into today's quantum software stack. This paper focuses on a critical type of quantum circuit optimization -- phase-polynomial optimization. Phase polynomials represents a class of building-block circuits that appear frequently in quantum modular exponentials (the most time-consuming component in Shor's factoring algorithm), in quantum approximation optimization algorithms (QAOA), and in Hamiltonian simulations. Compared to prior work on phase polynomials, we focus more on the impact of phase polynomial synthesis in the context of whole-circuit optimization, from single-block phase polynomials to multiple block phase polynomials, from greedy equivalent sub-circuit replacement strategies to a systematic parity matrix optimization approach, and from hardware-oblivious logical circuit optimization to hardware-friendly logical circuit optimization. We also provide a utility of our phase polynomial optimization framework to generate hardware-friendly building blocks. Our experiments demonstrate improvements of up to 50%-with an average total gate reduction of 34.92%-and reductions in the CNOT gate count of up to 48.57%, averaging 28.53%, for logical circuits. Additionally, for physical circuits, we achieve up to 47.65% CNOT gate reduction with an average reduction of 25.47% across a representative set of important benchmarks.2025-06-25T17:13:16Z14 pages, 12 figuresZihan ChenHenry ChenYuwei JinMinghao GuoEnhyeok JangJiakang LiCaitlin ChanWon Woo RoEddy Z. Zhanghttp://arxiv.org/abs/2506.20310v1Unfolding Iterators: Specification and Verification of Higher-Order Iterators, in OCaml2025-06-25T10:48:38ZAlbeit being a central notion of every programming language, formally and modularly reasoning about iteration proves itself to be a non-trivial feat, specially in the context of higher-order iteration. In this paper, we present a generic approach to the specification and deductive verification of higher-order iterators, written in the OCaml language. Our methodology follows two key principles: first, the usage of the Gospel specification language to describe the general behaviour of any iteration schema; second, the usage of the Cameleer framework to deductively verify that every iteration client is correct with respect to its logical specification. To validate our approach we develop a set of verified case studies, ranging from classic list iterators to graph algorithms implemented in the widely used OCamlGraph library.2025-06-25T10:48:38ZIon ChiricaMário Pereirahttp://arxiv.org/abs/2506.20127v1Dynamic Race Detection With O(1) Samples2025-06-25T04:46:35ZHappens before-based dynamic analysis is the go-to technique for detecting data races in large scale software projects due to the absence of false positive reports. However, such analyses are expensive since they employ expensive vector clock updates at each event, rendering them usable only for in-house testing. In this paper, we present a sampling-based, randomized race detector that processes only constantly many events of the input trace even in the worst case. This is the first sub-linear time (i.e., running in o(n) time where n is the length of the trace) dynamic race detection algorithm; previous sampling based approaches like Pacer run in linear time (i.e., O(n)). Our algorithm is a property tester for HB-race detection -- it is sound in that it never reports any false positive, and on traces that are far, with respect to hamming distance, from any race-free trace, the algorithm detects an HB-race with high probability. Our experimental evaluation of the algorithm and its comparison with state-of-the-art deterministic and sampling based race detectors shows that the algorithm does indeed have significantly low running time, and detects races quite often.2025-06-25T04:46:35ZPreprint version of the paper published in PACMPL, Volume 7, Issue POPL (POPL 2023). Official version available at https://doi.org/10.1145/3571238In Proceedings of the ACM on Programming Languages (PACMPL), Volume 7, Issue POPL, January 2023, Article 45, 28 pages. ACMMosaad Al ThokairMinjian ZhangUmang MathurMahesh Viswanathan10.1145/3571238http://arxiv.org/abs/2405.10611v2A Certified Proof Checker for Deep Neural Network Verification in Imandra2025-06-24T15:45:04ZRecent advances in the verification of deep neural networks (DNNs) have opened the way for a broader usage of DNN verification technology in many application areas, including safety-critical ones. However, DNN verifiers are themselves complex programs that have been shown to be susceptible to errors and numerical imprecision; this, in turn, has raised the question of trust in DNN verifiers. One prominent attempt to address this issue is enhancing DNN verifiers with the capability of producing certificates of their results that are subject to independent algorithmic checking. While formulations of Marabou certificate checking already exist on top of the state-of-the-art DNN verifier Marabou, they are implemented in C++, and that code itself raises the question of trust (e.g., in the precision of floating point calculations or guarantees for implementation soundness). Here, we present an alternative implementation of the Marabou certificate checking in Imandra -- an industrial functional programming language and an interactive theorem prover (ITP) -- that allows us to obtain full proof of certificate correctness. The significance of the result is two-fold. Firstly, it gives stronger independent guarantees for Marabou proofs. Secondly, it opens the way for the wider adoption of DNN verifiers in interactive theorem proving in the same way as many ITPs already incorporate SMT solvers.2024-05-17T08:16:32ZAccepted at ITP 2025, Interactive Theorem ProvingRemi DesmartinOmri IsacGrant PassmoreEkaterina KomendantskayaKathrin StarkGuy Katzhttp://arxiv.org/abs/2405.12976v2A Sound Type System for Secure Currency Flow2025-06-24T13:10:41ZIn this paper we focus on TinySol, a minimal calculus for Solidity smart contracts, introduced by Bartoletti et al. We start by rephrasing its syntax (to emphasise its object-oriented flavour) and give a new big-step operational semantics. We then use it to define two security properties, namely call integrity and noninterference. These two properties have some similarities in their definition, in that they both require that some part of a program is not influenced by the other part. However, we show that the two properties are actually incomparable. Nevertheless, we provide a type system for noninterference and show that well-typed programs satisfy call integrity as well; hence, programs that are accepted by our type system satisfy both properties. We finally discuss the practical usability of the type system and its limitations by means of some simple examples.2024-05-21T17:57:33ZThis is an extended and revised version of the paper appeared in ECOOP 2024 (see v1 of this arXiv paper)Luca AcetoDaniele GorlaStian Lybechhttp://arxiv.org/abs/2506.19457v1The Autonomous Data Language -- Concepts, Design and Formal Verification2025-06-24T09:38:13ZNowadays, the main advances in computational power are due to parallelism. However, most parallel languages have been designed with a focus on processors and threads. This makes dealing with data and memory in programs hard, which distances the implementation from its original algorithm. We propose a new paradigm for parallel programming, the data-autonomous paradigm, where computation is performed by autonomous data elements. Programs in this paradigm are focused on making the data collaborate in a highly parallel fashion. We furthermore present AuDaLa, the first data autonomous programming language, and provide a full formalisation that includes a type system and operational semantics. Programming in AuDaLa is very natural, as illustrated by examples, albeit in a style very different from sequential and contemporary parallel programming. Additionally, it lends itself for the formal verification of parallel programs, which we demonstrate.2025-06-24T09:38:13Z48 pages, preprint submitted to ElsevierTom T. P. FrankenThomas NeeleJan Friso Groote10.1016/j.tcs.2025.115560http://arxiv.org/abs/2403.14606v3The Elements of Differentiable Programming2025-06-24T08:38:16ZArtificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.2024-03-21T17:55:16ZDraft version 3Mathieu BlondelVincent Roulethttp://arxiv.org/abs/2210.14699v3Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?2025-06-23T11:31:33ZLanguage models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.2022-10-26T13:28:14Z53 pages, 3 Figures (not counted the subfigures), 16 TablesJournal of Systems and Software, Volume 230, December 2025, 112562Jean-Baptiste DöderleinNguessan Hermann KouadioMathieu AcherDjamel Eddine KhelladiBenoit Combemale10.1016/j.jss.2025.112562http://arxiv.org/abs/2410.15179v3HPVM-HDC: A Heterogeneous Programming System for Accelerating Hyperdimensional Computing2025-06-21T12:30:05ZHyperdimensional Computing (HDC), a technique inspired by cognitive models of computation, has been proposed as an efficient and robust alternative basis for machine learning. HDC programs are often manually written in low-level and target specific languages targeting CPUs, GPUs, and FPGAs -- these codes cannot be easily retargeted onto HDC-specific accelerators. No previous programming system enables productive development of HDC programs and generates efficient code for several hardware targets.
We propose a heterogeneous programming system for HDC: a novel programming language, HDC++, for writing applications using a unified programming model, including HDC-specific primitives to improve programmability, and a heterogeneous compiler, HPVM-HDC, that provides an intermediate representation for compiling HDC programs to many hardware targets. We implement two tuning optimizations, automatic binarization and reduction perforation, that exploit the error resilient nature of HDC. Our evaluation shows that HPVM-HDC generates performance-competitive code for CPUs and GPUs, achieving a geomean speed-up of 1.17x over optimized baseline CUDA implementations with a geomean reduction in total lines of code of 1.6x across CPUs and GPUs. Additionally, HPVM-HDC targets an HDC Digital ASIC and an HDC ReRAM accelerator simulator, enabling the first execution of HDC applications on these devices.2024-10-19T18:48:52ZRussel ArboreXavier RouthAbdul Rafae NoorAkash KothariHaichao YangWeihong XuSumukh PingeMinxuan ZhouTajana RosingVikram Advehttp://arxiv.org/abs/2412.05784v4ASC-Hook: fast and transparent system call hook for Arm2025-06-21T00:50:48ZIntercepting system calls is crucial for tools that aim to modify or monitor application behavior. However, existing system call interception tools on the ARM platform still suffer from limitations in terms of performance and completeness. This paper presents an efficient and comprehensive binary rewriting framework, ASC-Hook, specifically designed for intercepting system calls on the ARM platform. ASC-Hook addresses two key challenges on the ARM architecture: the misalignment of the target address caused by directly replacing the SVC instruction with br x8, and the return to the original control flow after system call interception. This is achieved through a hybrid replacement strategy and our specially designed trampoline mechanism. By implementing multiple completeness strategies specifically for system calls, we ensured comprehensive and thorough interception. Experimental results show that ASC-Hook reduces overhead to at least 1/29 of that of existing system call interception tools. We conducted extensive performance evaluations of ASC-Hook, and the average performance loss for system call-intensive applications is 3.7\% .2024-12-08T02:30:23ZAccepted to LCTES 2025 (26th ACM SIGPLAN/SIGBED Int. Conf. on Languages, Compilers & Tools for Embedded Systems); 11 pages (incl. appendix), 6 figures. DOI: 10.1145/3735452.3735524Yang ShenNational University of Defense TechnologyMin XieNational University of Defense TechnologyWenzhe ZhangNational University of Defense TechnologyTao WuChangsha University of Science and Technology10.1145/3735452.3735524http://arxiv.org/abs/2505.20314v3$Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction2025-06-20T19:06:56ZI present a model of universal parallel computation called $Δ$-Nets, and a method to translate $λ$-terms into $Δ$-nets and back. Together, the model and the method constitute an algorithm for optimal parallel $λ$-reduction, solving the longstanding enigma with groundbreaking clarity. I show that the $λ$-calculus can be understood as a projection of $Δ$-Nets$-$one that severely restricts the structure of sharing, among other drawbacks. Unhindered by these restrictions, the $Δ$-Nets model opens the door to new parallel programming language implementations and computer architectures that are more efficient and performant than previously possible.2025-05-22T22:23:59ZDaniel Augusto Rizzi Salvadorihttp://arxiv.org/abs/2506.16883v1Low Overhead Allocation Sampling in a Garbage Collected Virtual Machine2025-06-20T10:11:03ZCompared to the more commonly used time-based profiling, allocation profiling provides an alternate view of the execution of allocation heavy dynamically typed languages. However, profiling every single allocation in a program is very inefficient. We present a sampling allocation profiler that is deeply integrated into the garbage collector of PyPy, a Python virtual machine. This integration ensures tunable low overhead for the allocation profiler, which we measure and quantify. Enabling allocation sampling profiling with a sampling period of 4 MB leads to a maximum time overhead of 25% in our benchmarks, over un-profiled regular execution.2025-06-20T10:11:03ZChristoph JungC. F. Bolz-Tereickhttp://arxiv.org/abs/2504.15678v2Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP)2025-06-20T02:48:24ZVector processing is crucial for boosting processor performance and efficiency, particularly with data-parallel tasks. The RISC-V "V" Vector Extension (RVV) enhances algorithm efficiency by supporting vector registers of dynamic sizes and their grouping. Nevertheless, for very long vectors, the static number of RVV vector registers and its power-of-two grouping can lead to performance restrictions. To counteract this limitation, this work introduces Zoozve, a RISC-V vector instruction extension that eliminates the need for strip-mining. Zoozve allows for flexible vector register length and count configurations to boost data computation parallelism. With a data-adaptive register allocation approach, Zoozve permits any register groupings and accurately aligns vector lengths, cutting down register overhead and alleviating performance declines from strip-mining. Additionally, the paper details Zoozve's compiler and hardware implementations using LLVM and SystemVerilog. Initial results indicate Zoozve yields a minimum 10.10$\times$ reduction in dynamic instruction count for fast Fourier transform (FFT), with a mere 5.2\% increase in overall silicon area.2025-04-22T08:00:52Z6 pages, 4 figures, LCTES'25Proceedings of the 26th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. (2025), 51-56Siyi XuLimin JiangYintao LiuYihao ShenYi ShiShan CaoZhiyuan Jiang10.1145/3735452.3735526