https://arxiv.org/api/zDKvO2AP5s5pvTQBoTUF0AFcZbg 2026-06-28T19:55:12Z 9951 1665 15 http://arxiv.org/abs/2504.10813v1 Enhanced Data Race Prediction Through Modular Reasoning 2025-04-15T02:22:58Z

There are two orthogonal methodologies for efficient prediction of data races from concurrent program runs: commutativity and prefix reasoning. There are several instances of each methodology in the literature, with the goal of predicting data races using a streaming algorithm where the required memory does not grow proportional to the length of the observed run, but these instances were mostly created in an ad hoc manner, without much attention to their unifying underlying principles. In this paper, we identify and formalize these principles for each category with the ultimate goal of paving the way for combining them into a new algorithm which shares their efficiency characteristics but offers strictly more prediction power. In particular, we formalize three distinct classes of races predictable using commutativity reasoning, and compare them. We identify three different styles of prefix reasoning, and prove that they predict the same class of races, which provably contains all races predictable by any commutativity reasoning technique. Our key contribution is combining prefix reasoning and commutativity reasoning in a modular way to introduce a new class of races, granular prefix races, that are predictable in constant-space and linear time, in a streaming fashion. This class of races includes all races predictable using commutativity and prefix reasoning techniques. We present an improved constant-space algorithm for prefix reasoning alone based on the idea of antichains (from language theory). This improved algorithm is the stepping stone that is required to devise an efficient algorithm for prediction of granular prefix races. We present experimental results to demonstrate the expressive power and performance of our new algorithm.

2025-04-15T02:22:58Z Zhendong Ang Azadeh Farzan Umang Mathur http://arxiv.org/abs/2504.10314v1 Universal Algebra and Effectful Computation 2025-04-14T15:23:06Z

Abstract clones serve as an algebraic presentation of the syntax of a simple type theory. From the perspective of universal algebra, they define algebraic theories like those of groups, monoids and rings. This link allows one to study the language of simple type theory from the viewpoint of universal algebra. Programming languages, however, are much more complicated than simple type theory. Many useful features like reading, writing, and exception handling involve interacting with the environment; these are called side-effects. Algebraic presentations for languages with the appropriate syntax for handling effects are given by premulticategories and effectful multicategories. We study these structures with the aim of defining a suitable notion of an algebra. To achieve this goal, we proceed in two steps. First, we define a tensor on $[\to,\category{Set}]$, and show that this tensor along with the cartesian product gives the category a duoidal structure. Secondly, we introduce the novel notion of a multicategory enriched in a duoidal category which generalize the traditional notion of a multicategory. Further, we prove that an effectful multicategory is the same as a multicategory enriched in the duoidal category $[\to,\category{Set}]$. This result places multicategories and effectful multicategories on a similar footing, and provides a mechanism for transporting concepts from the theory of multicategories (which model pure computation) to the theory of effectful multicategories (which model effectful computation). As an example of this, we generalize the definition of a 2-morphism for multicategories to the duoidally enriched case. Our equivalence result then gives a natural definition of a 2-morphism for effectful multicategories, which we then use to define the notion of an algebra.

2025-04-14T15:23:06Z 98 pages, dissertation submitted towards the degree of MSc in Mathematics and Foundations of Computer Science, University of Oxford 2023 Nayan Rajesh http://arxiv.org/abs/2504.10159v1 Monadic type-and-effect soundness 2025-04-14T12:13:44Z

We introduce the abstract notions of "monadic operational semantics", a small-step semantics where computational effects are modularly modeled by a monad, and "type-and-effect system", including "effect types" whose interpretation lifts well-typedness to its monadic version. In this meta-theory, as usually done in the non-monadic case, we can express progress and subject reduction properties and provide a proof, given once and for all, that they imply soundness. The approach is illustrated on a lambda calculus with generic effects. We equip the calculus with an expressive type-and-effect system, and provide proofs of progress and subject reduction which are parametric on the interpretation of effect types. In this way, we obtain as instances many significant examples, such as checking exceptions, preventing/limiting non-determinism, constraining order/fairness of outputs on different locations. We also provide an extension with constructs to raise and handle computational effects, which can be instantiated to model different policies.

2025-04-14T12:13:44Z Francesco Dagnino Paola Giannini Elena Zucca http://arxiv.org/abs/2504.09870v1 Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures 2025-04-14T04:29:46Z

Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$\times$ higher performance and 6.4$\times$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.

2025-04-14T04:29:46Z 14 pages, 19 figures, under review Marco Siracusa Barcelona Supercomputing Center Olivia Hsu Stanford University Victor Soria-Pardos Barcelona Supercomputing Center Joshua Randall Arm Arnaud Grasset Arm Eric Biscondi Arm Doug Joseph Arm Randy Allen Barcelona Supercomputing Center Fredrik Kjolstad Stanford University Miquel Moretó Planas Barcelona Supercomputing Center Universitat Politècnica de Catalunya Adrià Armejach Barcelona Supercomputing Center Universitat Politècnica de Catalunya http://arxiv.org/abs/2502.11901v2 Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity 2025-04-13T23:38:44Z

Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-repair.

2025-02-17T15:24:11Z Dylan Zhang Justin Wang Tianran Sun http://arxiv.org/abs/2504.09234v1 Unleashing Optimizations in Dynamic Circuits through Branch Expansion 2025-04-12T14:18:27Z

Dynamic quantum circuits enable adaptive operations through intermediate measurements and classical feedback. Current transpilation toolchains, such as Qiskit and T$\ket{\text{ket}}$, however, fail to fully exploit branch-specific simplifications. In this work, we propose recursive branch expansion as a novel technique which systematically expands and refines conditional branches. Our method complements existing transpilers by creating additional opportunities for branch-specific simplifications without altering the overall circuit functionality. Using randomly generated circuits with varying patterns and scales, we demonstrate that our method consistently reduces the depth and gate count of execution paths of dynamic circuits. We also showcase the potential of our method to enable optimizations on error-corrected circuits.

2025-04-12T14:18:27Z Accepted by Computing Frontiers 2025 Yanbin Chen 10.1145/3719276.3725187 http://arxiv.org/abs/2504.08946v1 Incremental Bidirectional Typing via Order Maintenance 2025-04-11T19:57:58Z

Live programming environments provide various semantic services, including type checking and evaluation, continuously as the user is editing the program. The live paradigm promises to improve the developer experience, but liveness is an implementation challenge particularly when working with large programs. This paper specifies and efficiently implements a system the is able to incrementally update type information for a live program in response to fine-grained program edits. This information includes type error marks and information about the expected and actual type on every expression. The system is specified type-theoretically as a small-step dynamics that propagates updates through the marked and annotated program. Most updates flow according to a base bidirectional type system. Additional pointers are maintained to connect bound variables to their binding locations, with edits traversing these pointers directly. Order maintenance data structures are employed to efficiently maintain these pointers and to prioritize the order of update propagation. We prove this system is equivalent to naive re-analysis in the Agda theorem prover, along with other important metatheoretic properties. We then implement it efficiently in OCaml, detailing a number of impactful optimizations. We evaluate this implementation's performance with a large stress-test and find that it is able to achieve dramatic speed-ups of 275.96$\times$ compared to from-scratch reanalysis.

2025-04-11T19:57:58Z 35 pages, 16 figures Thomas J. Porter Marisa Kirisame Ivan Wei Pavel Panchekha Cyrus Omar http://arxiv.org/abs/2410.13262v2 Membership Testing for Semantic Regular Expressions 2025-04-11T03:43:11Z

SMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem: First, We present a two-pass NFA-based algorithm to determine whether a string $w$ matches a semantic regular expression (SemRE) $r$ in $O(|r|^2 |w|^2 + |r| |w|^3)$ time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in $O(|r|^2 |w|^2)$ time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a $\approx 2 \times$ overhead over the time needed for interaction with the oracle. Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an $Ω(|w|^2)$ lower bound on the number of oracle queries necessary to make this determination.

2024-10-17T06:33:16Z Yifei Huang Matin Amini Alexis Le Glaunec Konstantinos Mamouras Mukund Raghothaman 10.1145/3729300 http://arxiv.org/abs/2504.08126v1 The nature of loops in programming 2025-04-10T20:58:55Z

In program semantics and verification, reasoning about loops is complicated by the need to produce two separate mathematical arguments: an invariant, for functional properties (ignoring termination); and a variant, for termination (ignoring functional properties). A single and simple definition is possible, removing this split. A loop is just the limit (a variant of the reflexive transitive closure) of a Noetherian (well-founded) relation. To prove the loop correct there is no need to devise an invariant and a variant; it suffices to identify the relation, yielding both partial correctness and termination. The present note develops the (small) theory and applies it to standard loop examples and proofs of their correctness.

2025-04-10T20:58:55Z Bertrand Meyer http://arxiv.org/abs/2503.22832v2 L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution 2025-04-10T18:45:37Z

Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve "level-0" reasoning and potential directions to build more reliable reasoning systems.

2025-03-28T18:54:56Z Simeng Sun Cheng-Ping Hsieh Faisal Ladhak Erik Arakelyan Santiago Akle Serano Boris Ginsburg http://arxiv.org/abs/2502.18728v2 Scaling Optimization Over Uncertainty via Compilation 2025-04-10T17:57:26Z

Probabilistic inference is fundamentally hard, yet many tasks require optimization on top of inference, which is even harder. We present a new optimization-via-compilation strategy to scalably solve a certain class of such problems. In particular, we introduce a new intermediate representation (IR), binary decision diagrams weighted by a novel notion of branch-and-bound semiring, that enables a scalable branch-and-bound based optimization procedure. This IR automatically factorizes problems through program structure and prunes suboptimal values via a straightforward branch-and-bound style algorithm to find optima. Additionally, the IR is naturally amenable to staged compilation, allowing the programmer to query for optima mid-compilation to inform further executions of the program. We showcase the effectiveness and flexibility of the IR by implementing two performant languages that both compile to it: dappl and pineappl. dappl is a functional language that solves maximum expected utility problems with first-class support for rewards, decision making, and conditioning. pineappl is an imperative language that performs exact probabilistic inference with support for nested marginal maximum a posteriori (MMAP) optimization via staging.

2025-02-26T00:45:13Z 51 pages, 23 Figures, Accepted to OOPSLA R1 Minsung Cho John Gouwar Steven Holtzen 10.1145/3720500 http://arxiv.org/abs/2403.15122v3 A Hybrid Approach to Semi-automated Rust Verification 2025-04-10T11:04:50Z

We propose a hybrid approach to end-to-end Rust verification where the proof effort is split into powerful automated verification of safe Rust and targeted semi-automated verification of unsafe Rust. To this end, we present Gillian-Rust, a proof-of-concept semi-automated verification tool built on top of the Gillian platform that can reason about type safety and functional correctness of unsafe code. Gillian-Rust automates a rich separation logic for real-world Rust, embedding the lifetime logic of RustBelt and the parametric prophecies of RustHornBelt, and is able to verify real-world Rust standard library code with only minor annotations and with verification times orders of magnitude faster than those of comparable tools. We link Gillian-Rust with Creusot, a state-of-the-art verifier for safe Rust, by providing a systematic encoding of unsafe code specifications that Creusot can use but cannot verify, demonstrating the feasibility of our hybrid approach.

2024-03-22T11:24:31Z 28 pages, extended version of a PLDI'25 paper Sacha-Élie Ayoun Xavier Denis Petar Maksimović Philippa Gardner http://arxiv.org/abs/2311.02769v3 COGNAC: Circuit Optimization via Gradients and Noise-Aware Compilation 2025-04-09T19:39:13Z

We present COGNAC, a novel strategy for compiling quantum circuits based on numerical optimization algorithms from scientific computing. Observing that shorter-duration "partially entangling" gates tend to be less noisy than the typical "maximally entangling" gates, we use a simple and versatile noise model to construct a differentiable cost function. Standard gradient-based optimization algorithms running on a GPU can then quickly converge to a local optimum that closely approximates the target unitary. By reducing rotation angles to zero, COGNAC removes gates from a circuit, producing smaller quantum circuits. We have implemented this technique as a general-purpose Qiskit compiler plugin and compared performance with state-of-the-art optimizers on a variety of standard benchmarks. Testing our compiled circuits on superconducting quantum hardware, we find that COGNAC's optimizations produce circuits that are substantially less noisy than those produced by existing optimizers. These runtime performance gains come without a major compile-time cost, as COGNAC's parallelism allows it to retain a competitive optimization speed.

2023-11-05T20:59:27Z 17 pages, 10 figures Finn Voichick Leonidas Lampropoulos Robert Rand http://arxiv.org/abs/2504.07004v1 Task-Based Tensor Computations on Modern GPUs 2025-04-09T16:24:15Z

Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called \emph{tasks} that operate on \emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a \emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.

2025-04-09T16:24:15Z Rohan Yadav Michael Garland Alex Aiken Michael Bauer http://arxiv.org/abs/2504.06975v1 AWDIT: An Optimal Weak Database Isolation Tester 2025-04-09T15:30:09Z

In order to achieve low latency, high throughput, and partition tolerance, modern databases forgo strong transaction isolation for weak isolation guarantees. However, several production databases have been found to suffer from isolation bugs, breaking their data-consistency contract. Black-box testing is a prominent technique for detecting isolation bugs, by checking whether histories of database transactions adhere to a prescribed isolation level. Testing databases on realistic workloads of large size requires isolation testers to be as efficient as possible, a requirement that has initiated a study of the complexity of isolation testing. Although testing strong isolation has been known to be NP-complete, weak isolation levels were recently shown to be testable in polynomial time, which has propelled the scalability of testing tools. However, existing testers have a large polynomial complexity, restricting testing to workloads of only moderate size, which is not typical of large-scale databases. In this work, we develop AWDIT, a highly-efficient and provably optimal tester for weak database isolation. Given a history $H$ of size $n$ and $k$ sessions, AWDIT tests whether H satisfies the most common weak isolation levels of Read Committed (RC), Read Atomic (RA), and Causal Consistency (CC) in time $O(n^{3/2})$, $O(n^{3/2})$, and $O(n \cdot k)$, respectively, improving significantly over the state of the art. Moreover, we prove that AWDIT is essentially optimal, in the sense that there is a conditional lower bound of $n^{3/2}$ for any weak isolation level between RC and CC. Our experiments show that AWDIT is significantly faster than existing, highly optimized testers; e.g., for the $\sim$20% largest histories, AWDIT obtains an average speedup of $245\times$, $193\times$, and $62\times$ for RC, RA, and CC, respectively, over the best baseline.

2025-04-09T15:30:09Z 31 pages, 9 figures, 1 table; accepted for PLDI 2025 Lasse Møldrup Andreas Pavlogiannis 10.1145/3729339