Likelihood hacking in probabilistic program synthesis

2026-03-25T09:39:15Z

When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}_{\text{safe}}$ satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement $\mathcal{L}_{\text{safe}}$'s conditions as $\texttt{SafeStan}$, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.

ReSyn: A Generalized Recursive Regular Expression Synthesis Framework

2026-03-25T01:29:42Z

Existing Programming-By-Example (PBE) systems often rely on simplified benchmarks that fail to capture the high structural complexity-such as deeper nesting and frequent Unions-of real-world regexes. To overcome the resulting performance drop, we propose ReSyn, a synthesizer-agnostic divide-and-conquer framework that decomposes complex synthesis problems into manageable sub-problems. We also introduce Set2Regex, a parameter-efficient synthesizer capturing the permutation invariance of examples. Experimental results demonstrate that ReSyn significantly boosts accuracy across various synthesizers, and its combination with Set2Regex establishes a new state-of-the-art on challenging real-world benchmark.

Automated Discovery of Test Oracles for Database Management Systems Using LLMs

2026-03-24T22:49:57Z

Since 2020, automated testing for Database Management Systems (DBMSs) has flourished, uncovering hundreds of bugs in widely-used systems. A cornerstone of these techniques is test oracle, which typically implements a mechanism to generate equivalent query pairs, thereby identifying bugs by checking the consistency between their results. However, while applying these oracles can be automated, their design remains a fundamentally manual endeavor. This paper explores the use of large language models (LLMs) to automate the discovery and instantiation of test oracles, addressing a long-standing bottleneck towards fully automated DBMS testing. Although LLMs demonstrate impressive creativity, they are prone to hallucinations that can produce numerous false positive bug reports. Furthermore, their significant monetary cost and latency mean that LLM invocations should be limited to ensure that bug detection is efficient and economical. To this end, we introduce Argus, a novel framework built upon the core concept of the Constrained Abstract Query - a SQL skeleton containing placeholders and their associated instantiation conditions (e.g., requiring a placeholder to be filled by a boolean column). Argus uses LLMs to generate pairs of these skeletons that are asserted to be semantically equivalent. This equivalence is then formally proven using a SQL equivalence solver to ensure soundness. Finally, the placeholders within the verified skeletons are instantiated with concrete, reusable SQL snippets that are also synthesized by LLMs to efficiently produce complex test cases. We implemented Argus and evaluated it on five extensively tested DBMSs, discovering 40 previously unknown bugs, 35 of which are logic bugs, with 36 confirmed and 26 already fixed by the developers.

Lipschitz-Based Robustness Certification Under Floating-Point Execution

2026-03-24T22:18:52Z

Sensitivity-based robustness certification has emerged as a practical approach for certifying neural network robustness, including in settings that require verifiable guarantees. A key advantage of these methods is that certification is performed by concrete numerical computation (rather than symbolic reasoning) and scales efficiently with network size. However, as with the vast majority of prior work on robustness certification and verification, the soundness of these methods is typically proved with respect to a semantic model that assumes exact real arithmetic. In reality deployed neural network implementations execute using floating-point arithmetic. This mismatch creates a semantic gap between certified robustness properties and the behaviour of the executed system. As motivating evidence, we exhibit concrete counterexamples showing that real arithmetic robustness guarantees can fail under floating-point execution, even for previously verified certifiers. Discrepancies become pronounced at lower-precision formats such as float16, and under adversarially constructed models reach semantically meaningful perturbation radii at float32. We then develop a formal, compositional theory relating real arithmetic Lipschitz-based sensitivity bounds to the sensitivity of floating-point execution under standard rounding-error models, specialised to feed-forward neural networks with ReLU activations. We derive sound conditions for robustness under floating-point execution, including bounds on certificate degradation and sufficient conditions for the absence of overflow. We formalize the theory and its main soundness results, and implement an executable certifier based on these principles, which we empirically evaluate to demonstrate its practicality.

Semantics for 2D Rasterization

2026-03-24T20:14:34Z

Rasterization is the process of determining the color of every pixel drawn by an application. Powerful rasterization libraries like Skia, CoreGraphics, and Direct2D put exceptional effort into drawing, blending, and rendering efficiently. Yet applications are still hindered by the inefficient sequences of operations that they ask these libraries to perform. Even Google Chrome, a highly optimized program co-developed with the Skia rasterization library, still produces inefficient instruction sequences even on the top 100 most visited websites. The underlying reason for this inefficiency is that rasterization libraries have complex semantics and opaque and non-obvious execution models. To address this issue, we introduce $μ$Skia, a formal semantics for the Skia 2D graphics library, and mechanize this semantics in Lean. $μ$Skia covers language and graphics features like canvas state, the layer stack, blending, and color filters, and the semantics itself is split into three strata to separate concerns and enable extensibility. We then identify four patterns of sub-optimal Skia code produced by Google Chrome, and then write replacements for each pattern. $μ$Skia allows us to verify the replacements are correct, including identifying numerous tricky side conditions. We then develop a high-performance Skia optimizer that applies these patterns to speed up rasterization. On 99 Skia programs gathered from the top 100 websites, this optimizer yields a speedup of 18.7% over Skia's most modern GPU backend, while taking at most 32 $μ$s for optimization. The speedups persist across a variety of websites, Skia backends, and GPUs. To provide true, end-to-end verification, optimization traces produced by the optimizer are loaded back into the $μ$Skia semantics and translation validated in Lean.

A formalization of System I with type Top in Agda

2026-03-24T18:55:01Z

System I is a recently introduced simply-typed lambda calculus with pairs where isomorphic types are considered equal. In this work we propose a variant of System I with the type Top, and present a complete formalization of this calculus in Agda, which includes the proofs of progress and strong normalization.

Let Functions Speak: Lightweight Parametric Polymorphism via Domain and Range Types

2026-03-24T15:57:08Z

Dynamic languages such as Python and JavaScript widely use function decorators to extend behavior. In TypeScript, a common way to type such patterns uses Parameters and ReturnType. In practice, this idiom relies on a function-type bound for T that is expressed using the unsafe type any, which weakens static guarantees. At the core is a standard typing principle: application is justified only when the callee is exposed as an arrow type. We present F<:DR, a calculus that adds domain and range projection types, Dom(T) and Range(T), for arbitrary types T. These projections permit typing applications through abstract function types: an argument of type Dom(T) witnesses callability, and the result is typed as Range(T). This design complements, rather than replaces, standard arrow-based application, which remains admissible via subtyping in System F<:. We mechanize F<:DR in Rocq and prove semantic type soundness using logical relations with path selection, which delays projection interpretation until function structure is resolved. The same technique extends to additional projection types, illustrated for primitive pairs, i.e., product types.

Towards Industrial-scale Product Configuration

2026-03-24T11:54:52Z

We address the challenge of product configuration in the context of increasing customer demand for diverse and complex products. We propose a solution through a curated selection of product model benchmarks formulated in the COOM language, divided into three fragments of increasing complexity. Each fragment is accompanied by a corresponding bike model example, and additional scalable product models are included in the COOM suite, along with relevant resources. We outline an ASP-based workflow for solving COOM-based configuration problems, highlighting its adaptability to different paradigms and alternative ASP solutions. The COOM Suite aims to provide a comprehensive, accessible, and representative set of examples that can serve as a common ground for stakeholders in the field of product configuration.

Cypher is Turing-Complete: A Formal Proof via 2-Counter Machine Simulation

2026-03-24T09:07:16Z

We prove that Cypher 25, the graph query language of Neo4j, is Turing-complete. The proof shows that a single RETURN statement using reduce(), CASE expressions, and list comprehensions can simulate any 2-counter machine (Minsky 1967). We address the bounded-step objection via two complementary resolutions and present a third graph-native simulation using quantified path patterns (QPP) with allReduce. All three constructions were verified on a live Neo4j Aura instance.

Parallelizable Feynman-Kac Models for Universal Probabilistic Programming

2026-03-23T18:29:26Z

We study provably correct and efficient instantiations of Sequential Monte Carlo (SMC) inference in the context of formal operational semantics of Probabilistic Programs (PPs). We focus on universal PPs featuring sampling from arbitrary measures and conditioning/reweighting in unbounded loops. We first equip Probabilistic Program Graphs (PPGs), an automata-theoretic description format of PPs, with an expectation-based semantics over infinite execution traces, which also incorporates trace weights. We then prove a finite approximation theorem that provides bounds to this semantics based on expectations taken over finite, fixed-length traces. This enables us to frame our semantics within a Feynman-Kac (FK) model, and ensures the consistency of the Particle Filtering (PF) algorithm, an instance of SMC, with respect to our semantics. Building on these results, we introduce VPF, a vectorized version of the PF algorithm tailored to PPGs and our semantics. Experiments conducted with a proof-of-concept implementation of VPF show very promising results compared to state-of-the-art PP inference tools.

Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

2026-03-23T17:23:32Z

Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version of LiveCodeBench that we introduce. We release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.

Set-Theoretic Types for Erlang: Theory, Implementation, and Evaluation

2026-03-23T14:36:41Z

Erlang's dynamic typing discipline can lead to runtime errors that persist even after process restarts. Some of these runtime errors could be prevented through static type checking. While Erlang provides a type specification language, the compiler does not enforce these types, thereby limiting their role to documentation purposes. Type checking Erlang code is challenging due to language features such as dynamic type tests, subtyping, equi-recursive types, polymorphism, intersection types in signatures, and untagged union types. This work presents a set-theoretic type system for Erlang which captures the core features of Erlang's existing type language. The formal type system guarantees type soundness, and ensures that type checking remains decidable. Additionally, an implementation of a type checker is provided, supporting all features of the Erlang type language and most term-level language constructs. A case study with modules from Erlang's standard library, an external project, and the type checker itself demonstrates its effectiveness in verifying real-world Erlang code.

A Simple and Efficient Implementation of Strong Call by Need by an Abstract Machine

2026-03-23T13:08:06Z

Strong call-by-need combines full normalization with the sharing discipline of lazy evaluation, yet no prior implementation achieved both simplicity and efficiency. We introduce RKNL, an abstract machine that realizes strong call-by-need with bilinear overhead. The machine has been derived automatically from a higher-order evaluator that uses the technique of memothunks to implement laziness. By employing an off-the-shelf transformation tool implementing the ``functional correspondence'' between higher-order interpreters and abstract machines, we obtained a simple and concise description of the machine. We prove that the resulting machine conservatively extends the lazy version of Krivine machine for the weak call-by-need strategy, and that it simulates the normal-order strategy in a bilinear number of steps, i.e., linear in both the number of beta-reductions and the size of the input term.

Instruction Set and Language for Symbolic Regression

2026-03-23T11:21:53Z

A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string -- a complete labeled-DAG isomorphism invariant -- that collapses all the equivalent representations into a single canonical form.

Multi types and reasonable space

2026-03-22T17:18:45Z

Accattoli, Dal Lago, and Vanoni have recently proved that the space used by the Space KAM, a variant of the Krivine abstract machine, is a reasonable space cost model for the lambda-calculus accounting for logarithmic space, solving a longstanding open problem. In this paper, we provide a new system of multi types (a variant of intersection types) and extract from multi type derivations the space used by the Space KAM, capturing into a type system the space complexity of the abstract machine. Additionally, we show how to capture also the time of the Space KAM, which is a reasonable time cost model, via minor changes to the type system.