https://arxiv.org/api/zDKvO2AP5s5pvTQBoTUF0AFcZbg2026-06-28T19:55:12Z9951166515http://arxiv.org/abs/2504.10813v1Enhanced Data Race Prediction Through Modular Reasoning2025-04-15T02:22:58ZThere are two orthogonal methodologies for efficient prediction of data races from concurrent program runs: commutativity and prefix reasoning. There are several instances of each methodology in the literature, with the goal of predicting data races using a streaming algorithm where the required memory does not grow proportional to the length of the observed run, but these instances were mostly created in an ad hoc manner, without much attention to their unifying underlying principles. In this paper, we identify and formalize these principles for each category with the ultimate goal of paving the way for combining them into a new algorithm which shares their efficiency characteristics but offers strictly more prediction power. In particular, we formalize three distinct classes of races predictable using commutativity reasoning, and compare them. We identify three different styles of prefix reasoning, and prove that they predict the same class of races, which provably contains all races predictable by any commutativity reasoning technique.
Our key contribution is combining prefix reasoning and commutativity reasoning in a modular way to introduce a new class of races, granular prefix races, that are predictable in constant-space and linear time, in a streaming fashion. This class of races includes all races predictable using commutativity and prefix reasoning techniques. We present an improved constant-space algorithm for prefix reasoning alone based on the idea of antichains (from language theory). This improved algorithm is the stepping stone that is required to devise an efficient algorithm for prediction of granular prefix races. We present experimental results to demonstrate the expressive power and performance of our new algorithm.2025-04-15T02:22:58ZZhendong AngAzadeh FarzanUmang Mathurhttp://arxiv.org/abs/2504.10314v1Universal Algebra and Effectful Computation2025-04-14T15:23:06ZAbstract clones serve as an algebraic presentation of the syntax of a simple type theory. From the perspective of universal algebra, they define algebraic theories like those of groups, monoids and rings. This link allows one to study the language of simple type theory from the viewpoint of universal algebra.
Programming languages, however, are much more complicated than simple type theory. Many useful features like reading, writing, and exception handling involve interacting with the environment; these are called side-effects. Algebraic presentations for languages with the appropriate syntax for handling effects are given by premulticategories and effectful multicategories. We study these structures with the aim of defining a suitable notion of an algebra.
To achieve this goal, we proceed in two steps. First, we define a tensor on $[\to,\category{Set}]$, and show that this tensor along with the cartesian product gives the category a duoidal structure. Secondly, we introduce the novel notion of a multicategory enriched in a duoidal category which generalize the traditional notion of a multicategory. Further, we prove that an effectful multicategory is the same as a multicategory enriched in the duoidal category $[\to,\category{Set}]$. This result places multicategories and effectful multicategories on a similar footing, and provides a mechanism for transporting concepts from the theory of multicategories (which model pure computation) to the theory of effectful multicategories (which model effectful computation). As an example of this, we generalize the definition of a 2-morphism for multicategories to the duoidally enriched case. Our equivalence result then gives a natural definition of a 2-morphism for effectful multicategories, which we then use to define the notion of an algebra.2025-04-14T15:23:06Z98 pages, dissertation submitted towards the degree of MSc in Mathematics and Foundations of Computer Science, University of Oxford 2023Nayan Rajeshhttp://arxiv.org/abs/2504.10159v1Monadic type-and-effect soundness2025-04-14T12:13:44ZWe introduce the abstract notions of "monadic operational semantics", a small-step semantics where computational effects are modularly modeled by a monad, and "type-and-effect system", including "effect types" whose interpretation lifts well-typedness to its monadic version. In this meta-theory, as usually done in the non-monadic case, we can express progress and subject reduction properties and provide a proof, given once and for all, that they imply soundness. The approach is illustrated on a lambda calculus with generic effects. We equip the calculus with an expressive type-and-effect system, and provide proofs of progress and subject reduction which are parametric on the interpretation of effect types. In this way, we obtain as instances many significant examples, such as checking exceptions, preventing/limiting non-determinism, constraining order/fairness of outputs on different locations. We also provide an extension with constructs to raise and handle computational effects, which can be instantiated to model different policies.2025-04-14T12:13:44ZFrancesco DagninoPaola GianniniElena Zuccahttp://arxiv.org/abs/2504.09870v1Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures2025-04-14T04:29:46ZIrregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$\times$ higher performance and 6.4$\times$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.2025-04-14T04:29:46Z14 pages, 19 figures, under reviewMarco SiracusaBarcelona Supercomputing CenterOlivia HsuStanford UniversityVictor Soria-PardosBarcelona Supercomputing CenterJoshua RandallArmArnaud GrassetArmEric BiscondiArmDoug JosephArmRandy AllenBarcelona Supercomputing CenterFredrik KjolstadStanford UniversityMiquel Moretó PlanasBarcelona Supercomputing CenterUniversitat Politècnica de CatalunyaAdrià ArmejachBarcelona Supercomputing CenterUniversitat Politècnica de Catalunyahttp://arxiv.org/abs/2502.11901v2Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity2025-04-13T23:38:44ZExisting LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-repair.2025-02-17T15:24:11ZDylan ZhangJustin WangTianran Sunhttp://arxiv.org/abs/2504.09234v1Unleashing Optimizations in Dynamic Circuits through Branch Expansion2025-04-12T14:18:27ZDynamic quantum circuits enable adaptive operations through intermediate measurements and classical feedback. Current transpilation toolchains, such as Qiskit and T$\ket{\text{ket}}$, however, fail to fully exploit branch-specific simplifications. In this work, we propose recursive branch expansion as a novel technique which systematically expands and refines conditional branches. Our method complements existing transpilers by creating additional opportunities for branch-specific simplifications without altering the overall circuit functionality. Using randomly generated circuits with varying patterns and scales, we demonstrate that our method consistently reduces the depth and gate count of execution paths of dynamic circuits. We also showcase the potential of our method to enable optimizations on error-corrected circuits.2025-04-12T14:18:27ZAccepted by Computing Frontiers 2025Yanbin Chen10.1145/3719276.3725187http://arxiv.org/abs/2504.08946v1Incremental Bidirectional Typing via Order Maintenance2025-04-11T19:57:58ZLive programming environments provide various semantic services, including type checking and evaluation, continuously as the user is editing the program. The live paradigm promises to improve the developer experience, but liveness is an implementation challenge particularly when working with large programs. This paper specifies and efficiently implements a system the is able to incrementally update type information for a live program in response to fine-grained program edits. This information includes type error marks and information about the expected and actual type on every expression. The system is specified type-theoretically as a small-step dynamics that propagates updates through the marked and annotated program. Most updates flow according to a base bidirectional type system. Additional pointers are maintained to connect bound variables to their binding locations, with edits traversing these pointers directly. Order maintenance data structures are employed to efficiently maintain these pointers and to prioritize the order of update propagation. We prove this system is equivalent to naive re-analysis in the Agda theorem prover, along with other important metatheoretic properties. We then implement it efficiently in OCaml, detailing a number of impactful optimizations. We evaluate this implementation's performance with a large stress-test and find that it is able to achieve dramatic speed-ups of 275.96$\times$ compared to from-scratch reanalysis.2025-04-11T19:57:58Z35 pages, 16 figuresThomas J. PorterMarisa KirisameIvan WeiPavel PanchekhaCyrus Omarhttp://arxiv.org/abs/2410.13262v2Membership Testing for Semantic Regular Expressions2025-04-11T03:43:11ZSMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem:
First, We present a two-pass NFA-based algorithm to determine whether a string $w$ matches a semantic regular expression (SemRE) $r$ in $O(|r|^2 |w|^2 + |r| |w|^3)$ time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in $O(|r|^2 |w|^2)$ time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a $\approx 2 \times$ overhead over the time needed for interaction with the oracle.
Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an $Ω(|w|^2)$ lower bound on the number of oracle queries necessary to make this determination.2024-10-17T06:33:16ZYifei HuangMatin AminiAlexis Le GlaunecKonstantinos MamourasMukund Raghothaman10.1145/3729300http://arxiv.org/abs/2504.08126v1The nature of loops in programming2025-04-10T20:58:55ZIn program semantics and verification, reasoning about loops is complicated by the need to produce two separate mathematical arguments: an invariant, for functional properties (ignoring termination); and a variant, for termination (ignoring functional properties). A single and simple definition is possible, removing this split. A loop is just the limit (a variant of the reflexive transitive closure) of a Noetherian (well-founded) relation. To prove the loop correct there is no need to devise an invariant and a variant; it suffices to identify the relation, yielding both partial correctness and termination. The present note develops the (small) theory and applies it to standard loop examples and proofs of their correctness.2025-04-10T20:58:55ZBertrand Meyerhttp://arxiv.org/abs/2503.22832v2L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution2025-04-10T18:45:37ZComplex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve "level-0" reasoning and potential directions to build more reliable reasoning systems.2025-03-28T18:54:56ZSimeng SunCheng-Ping HsiehFaisal LadhakErik ArakelyanSantiago Akle SeranoBoris Ginsburghttp://arxiv.org/abs/2502.18728v2Scaling Optimization Over Uncertainty via Compilation2025-04-10T17:57:26ZProbabilistic inference is fundamentally hard, yet many tasks require optimization on top of inference, which is even harder. We present a new optimization-via-compilation strategy to scalably solve a certain class of such problems. In particular, we introduce a new intermediate representation (IR), binary decision diagrams weighted by a novel notion of branch-and-bound semiring, that enables a scalable branch-and-bound based optimization procedure. This IR automatically factorizes problems through program structure and prunes suboptimal values via a straightforward branch-and-bound style algorithm to find optima. Additionally, the IR is naturally amenable to staged compilation, allowing the programmer to query for optima mid-compilation to inform further executions of the program. We showcase the effectiveness and flexibility of the IR by implementing two performant languages that both compile to it: dappl and pineappl. dappl is a functional language that solves maximum expected utility problems with first-class support for rewards, decision making, and conditioning. pineappl is an imperative language that performs exact probabilistic inference with support for nested marginal maximum a posteriori (MMAP) optimization via staging.2025-02-26T00:45:13Z51 pages, 23 Figures, Accepted to OOPSLA R1Minsung ChoJohn GouwarSteven Holtzen10.1145/3720500http://arxiv.org/abs/2403.15122v3A Hybrid Approach to Semi-automated Rust Verification2025-04-10T11:04:50ZWe propose a hybrid approach to end-to-end Rust verification where the proof effort is split into powerful automated verification of safe Rust and targeted semi-automated verification of unsafe Rust. To this end, we present Gillian-Rust, a proof-of-concept semi-automated verification tool built on top of the Gillian platform that can reason about type safety and functional correctness of unsafe code. Gillian-Rust automates a rich separation logic for real-world Rust, embedding the lifetime logic of RustBelt and the parametric prophecies of RustHornBelt, and is able to verify real-world Rust standard library code with only minor annotations and with verification times orders of magnitude faster than those of comparable tools. We link Gillian-Rust with Creusot, a state-of-the-art verifier for safe Rust, by providing a systematic encoding of unsafe code specifications that Creusot can use but cannot verify, demonstrating the feasibility of our hybrid approach.2024-03-22T11:24:31Z28 pages, extended version of a PLDI'25 paperSacha-Élie AyounXavier DenisPetar MaksimovićPhilippa Gardnerhttp://arxiv.org/abs/2311.02769v3COGNAC: Circuit Optimization via Gradients and Noise-Aware Compilation2025-04-09T19:39:13ZWe present COGNAC, a novel strategy for compiling quantum circuits based on numerical optimization algorithms from scientific computing. Observing that shorter-duration "partially entangling" gates tend to be less noisy than the typical "maximally entangling" gates, we use a simple and versatile noise model to construct a differentiable cost function. Standard gradient-based optimization algorithms running on a GPU can then quickly converge to a local optimum that closely approximates the target unitary. By reducing rotation angles to zero, COGNAC removes gates from a circuit, producing smaller quantum circuits. We have implemented this technique as a general-purpose Qiskit compiler plugin and compared performance with state-of-the-art optimizers on a variety of standard benchmarks. Testing our compiled circuits on superconducting quantum hardware, we find that COGNAC's optimizations produce circuits that are substantially less noisy than those produced by existing optimizers. These runtime performance gains come without a major compile-time cost, as COGNAC's parallelism allows it to retain a competitive optimization speed.2023-11-05T20:59:27Z17 pages, 10 figuresFinn VoichickLeonidas LampropoulosRobert Randhttp://arxiv.org/abs/2504.07004v1Task-Based Tensor Computations on Modern GPUs2025-04-09T16:24:15ZDomain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called \emph{tasks} that operate on \emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a \emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.2025-04-09T16:24:15ZRohan YadavMichael GarlandAlex AikenMichael Bauerhttp://arxiv.org/abs/2504.06975v1AWDIT: An Optimal Weak Database Isolation Tester2025-04-09T15:30:09ZIn order to achieve low latency, high throughput, and partition tolerance, modern databases forgo strong transaction isolation for weak isolation guarantees. However, several production databases have been found to suffer from isolation bugs, breaking their data-consistency contract. Black-box testing is a prominent technique for detecting isolation bugs, by checking whether histories of database transactions adhere to a prescribed isolation level.
Testing databases on realistic workloads of large size requires isolation testers to be as efficient as possible, a requirement that has initiated a study of the complexity of isolation testing. Although testing strong isolation has been known to be NP-complete, weak isolation levels were recently shown to be testable in polynomial time, which has propelled the scalability of testing tools. However, existing testers have a large polynomial complexity, restricting testing to workloads of only moderate size, which is not typical of large-scale databases.
In this work, we develop AWDIT, a highly-efficient and provably optimal tester for weak database isolation. Given a history $H$ of size $n$ and $k$ sessions, AWDIT tests whether H satisfies the most common weak isolation levels of Read Committed (RC), Read Atomic (RA), and Causal Consistency (CC) in time $O(n^{3/2})$, $O(n^{3/2})$, and $O(n \cdot k)$, respectively, improving significantly over the state of the art. Moreover, we prove that AWDIT is essentially optimal, in the sense that there is a conditional lower bound of $n^{3/2}$ for any weak isolation level between RC and CC. Our experiments show that AWDIT is significantly faster than existing, highly optimized testers; e.g., for the $\sim$20% largest histories, AWDIT obtains an average speedup of $245\times$, $193\times$, and $62\times$ for RC, RA, and CC, respectively, over the best baseline.2025-04-09T15:30:09Z31 pages, 9 figures, 1 table; accepted for PLDI 2025Lasse MøldrupAndreas Pavlogiannis10.1145/3729339