https://arxiv.org/api/wk0IB+cWs6xOgX/0ZijMuD14Kpk 2026-06-14T02:19:03Z 9885 210 15 http://arxiv.org/abs/2605.06184v1 Teaching LLMs Program Semantics via Symbolic Execution Traces 2026-05-07T13:01:06Z

We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with program length. To close this gap, we train on formal verification artifacts: running the Soteria symbolic execution engine on generic open-source C code and using the resulting traces for continued pretraining of Qwen3-8B. Just ${\sim}$3,000 bug traces combined with chain-of-thought reasoning at inference time improve violation detection by over 17 percentage points, producing one of the most balanced accuracy profiles among evaluated models. On violation detection, the trained 8B model outperforms the 4$\times$ larger Qwen3-32B without thinking and approaches it in overall accuracy. The interaction between trace training and chain-of-thought is superadditive: neither alone provides meaningful gains, but their combination does. Improvements transfer across all five property types, including ones the training traces do not target. Our 28 configurations confirm the gains stem from trace semantics, not code volume, and that trace curation and format matter.

2026-05-07T13:01:06Z Jonas Bayer Stefan Zetzsche Olivier Bouissou Remi Delmas Michael Tautschnig Soonho Kong http://arxiv.org/abs/2605.08243v1 GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching 2026-05-07T09:10:17Z

Synthesizing Mixed-Boolean Arithmetic (MBA) expressions from input-output examples is central to program deobfuscation and also useful for compiler optimization, reverse engineering, and cryptanalysis. Existing MBA synthesizers are typically CPU-based and scale poorly on large specifications or complex targets. Recent GPU-accelerated synthesis methods achieve large speedups in qualitative settings, but they depend on caching observationally equivalent candidates; this strategy breaks down for MBA because candidate outputs are quantitative bitvectors and the behavioral space is enormous. We present SIMBA (Synthesis of Mixed-Boolean Arithmetic), a GPU-accelerated MBA synthesizer built around cache-free bottom-up enumeration. SIMBA avoids language caches entirely and uses a GPU-oriented enumeration design that keeps work local and highly parallel. In experiments, SIMBA is substantially faster than prior MBA synthesis tools, handles larger specifications, and reaches expression sizes that existing methods fail to solve. These results establish cache-free GPU synthesis as a practical and scalable approach for quantitative domains, and identify it as a strong alternative to cache-centric designs.

2026-05-07T09:10:17Z Gabriel Bathie Baptiste Mouillon Nathanaël Fijalkow http://arxiv.org/abs/2305.08486v3 Rely-Guarantee Reasoning for Causally Consistent Shared Memory (Extended Version) 2026-05-07T09:02:11Z

Rely-guarantee (RG) is a highly influential compositional proof technique for concurrent programs, which was originally developed assuming a sequentially consistent shared memory. In this paper, we first generalize RG to make it parametric with respect to the underlying memory model by introducing an RG framework that is applicable to any model axiomatically characterized by Hoare triples. Second, we instantiate this framework for reasoning about concurrent programs under causally consistent memory, which is formulated using a recently proposed potential-based operational semantics, thereby providing the first reasoning technique for such semantics. The proposed program logic, which we call Piccolo, employs a novel assertion language allowing one to specify ordered sequences of states that each thread may reach. We employ Piccolo for multiple litmus tests, as well as for an adaptation of Peterson's algorithm for mutual exclusion to causally consistent memory.

2023-05-15T09:43:11Z Extended version of paper to appear in CAV 2023 Ori Lahav Brijesh Dongol Heike Wehrheim http://arxiv.org/abs/2604.27644v2 ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning 2026-05-07T08:46:02Z

We propose a paradigm shift toward open-ended curriculum self-play: rather than learning to answer on a fixed prompt set, a unified policy learns to question: generating verifiable problems, solving them, and turning verifier feedback into self-improvement without human-annotated solutions. We introduce ANCORA, in which the policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions, anchored by three load-bearing mechanisms: a two-level group-relative update coupling Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT projecting the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG whose policy-induced problem set can provably expand under self-composition. Without these stabilizers, sparse verifier feedback drives Proposer collapse even under MLRL-aligned rewards; with them, ANCORA bootstraps a verifiable curriculum from zero human solutions. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in test-time training (TTT, 0-shot), outperforming PSV self-play by 15.8 points despite PSV's 1-shot inference; in a transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.

2026-04-30T09:35:57Z v2: Updated abstract; strengthened the proof of Proposition 4.1; corrected minor typos; corrected author list Chengcao Yang http://arxiv.org/abs/2605.05282v1 Beyond BLEU: A Semantic Evaluation Method for Code Translation 2026-05-06T17:14:33Z

Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores show negligible correlation with semantic correctness (r = -0.127 to 0.354), demonstrating that syntactic metrics fail to predict functional accuracy.

2026-05-06T17:14:33Z Julius Näumann Sven Keidel Amir Molzam Sharifloo Mira Mezini http://arxiv.org/abs/2604.23088v2 Code Broker: A Multi-Agent System for Automated Code Quality Assessment 2026-05-06T16:26:44Z

We present Code Broker, a multi agent system built on Google s Agent Development Kit ADK that analyses Python source code from individual files, local directory trees, or remote GitHub repositories and generates structured, actionable quality assessment reports. The system realises a hierarchical five agent architecture in which a root orchestrator coordinates a sequential pipeline agent that, in turn, dispatches three specialised agents concurrently a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising their findings through an Improvement Recommender. Reports quantify four quality dimensions correctness, security, style, and maintainability on a normalised scale and are rendered in both Markdown and HTML for integration into diverse developer workflows. Code Broker fuses LLM based semantic reasoning with deterministic static analysis signals from Pylint, employs asynchronous execution with exponential backoff retry logic to improve robustness under transient API failures, and explores lightweight session memory for retaining and querying prior assessment context across runs. We frame this paper as a technical report on system design, prompt engineering, and tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases of varying scale. The results indicate that parallel specialised agents produce readable, developer oriented feedback that complements traditional linting, while also foregrounding current limitations in evaluation depth, security tooling, large repository handling, and the exclusive reliance on in memory persistence. All code and reproducibility materials are publicly available: https://github.com/Samir-atra/agents_intensive_dev.

2026-04-25T00:53:59Z 9 pages, 2 figures, 2 tables, 33 references Samer Attrah http://arxiv.org/abs/2605.04933v1 Interaction Tree Semantics for RISC-V: Bridging Compiler and Hardware Verification 2026-05-06T13:58:30Z

The Instruction Set Architecture (ISA) is the contract between compilers and processors; proving this contract formally demands cross-level connection to existing mechanized compilers and hardware implementations. As an open, modular ISA gaining adoption across embedded, mobile, and cloud platforms, RISC-V makes a formally verified ISA specification particularly valuable. However, existing formal RISC-V specifications focus on hardware tooling rather than cross-level verification: they provide no machine-checked instruction-level properties and lack support for verifying this contract across levels. We address these limitations with a formal semantics of the RISC-V ISA in Rocq, built on Interaction Trees (ITrees). By leveraging ITree bisimulation and refinement, our semantics enables cross-level verification from compiler IR to hardware within a single framework. Our formalization covers a wide spectrum of RISC-V extensions. The correctness of individual instruction semantics is backed by machine-checked lemmas in Rocq. We further validate it by extracting an executable simulator that passes all standard RISC-V test suites. Three case studies demonstrate the effectiveness of our semantics for cross-level verification: first, we prove semantic equivalence via bisimulation between LLVM IR and RISC-V code on an array access pattern via Vellvm (LLVM ITree semantics); second, we apply translation validation to a specific instruction reordering for macro-operation fusion, distinguishing safe reorderings from those that break program-counter-relative addressing; third, we prove that a Kôika hardware ALU correctly implements all R-type integer operations (e.g., ADD, SUB, AND) against our ISA contract.

2026-05-06T13:58:30Z Shuanglong Kan Sebastian Ertel http://arxiv.org/abs/2101.06757v9 Higher Order Automatic Differentiation of Higher Order Functions 2026-05-06T09:27:49Z

We present semantic correctness proofs of automatic differentiation (AD). We consider a forward-mode AD method on a higher-order language with algebraic data types and we characterise it as the unique structure-preserving macro given a choice of derivatives for basic operations. We describe a rich semantics for differentiable programming based on diffeological spaces. We show that it interprets our language and we phrase what it means for the AD method to be correct with respect to this semantics. We show that our characterisation of AD gives rise to an elegant semantic proof of its correctness based on a gluing construction on diffeological spaces. We explain how this is in essence a logical relations argument. Throughout we show how the analysis extends to AD methods for computing higher-order derivatives using a Taylor approximation.

2021-01-17T19:24:46Z arXiv admin note: substantial text overlap with arXiv:2001.02209 Logical Methods in Computer Science, Volume 18, Issue 1 (March 22, 2022) lmcs:7106 Mathieu Huot Sam Staton Matthijs Vákár 10.46298/lmcs-18(1:41)2022 http://arxiv.org/abs/1405.0033v5 Syntax and Semantics of Linear Dependent Types 2026-05-06T09:16:19Z

A type theory is presented that combines (intuitionistic) linear types with type dependency, thus properly generalising both intuitionistic dependent type theory and full linear logic. A syntax and complete categorical semantics are developed, the latter in terms of (strict) indexed symmetric monoidal categories with comprehension. Various optional type formers are treated in a modular way. In particular, we will see that the historically much-debated multiplicative quantifiers and identity types arise naturally from categorical considerations. These new multiplicative connectives are further characterised by several identities relating them to the usual connectives from dependent type theory and linear logic. Finally, one important class of models, given by families with values in some symmetric monoidal category, is investigated in detail.

2014-04-30T20:58:03Z Matthijs Vákár http://arxiv.org/abs/2605.04377v1 Towards Formal Verification of Hybrid Synchronous Programs with Refinement Types 2026-05-06T01:05:43Z

Cyber-physical systems (CPS) such as autonomous cars, aircraft, and robots are often also safety-critical; thus it is imperative that they operate as intended with a high degree of certainty. Formal verification has been employed to verify the software controlling these systems, but due to their complexity, is usually performed on an abstract model rather than the executable code. Synchronous programming languages extended with differential equations promise both rigorous modeling and sufficient expressiveness to implement executable controller code, and recent developments have introduced formal verification of strictly discrete-time programs. Extending these verification techniques to hybrid systems enables precise modeling of the environment for a wider variety of programs to be both verified and executed. We formalize the operational semantics of initial value problems and zero-crossing detection expressed in a synchronous programming language, extend its type system for verification thereof, and prove its soundness.

2026-05-06T01:05:43Z NASA Formal Methods Serra Z. Dane Jiawei Chen Marc Pouzet Jean-Baptiste Jeannin http://arxiv.org/abs/2605.04232v1 Probabilistic Floating-Point Round-Off Analysis via Concentration Inequalities 2026-05-05T19:19:23Z

Floating-point round-off errors are ubiquitous in numerically intensive programs arising in fields such as scientific computing and optimization. As floating-point errors potentially lead to unexpected and catastrophic program failures, one must derive guaranteed round-off thresholds to ensure the correctness of these programs. However, deterministic round-off thresholds tend to be too conservative to be usable in practice, since they often involve large round-off errors that occur with small probability. Probabilistic thresholds relax deterministic ones by specifying that the probability of the round-off error exceeding a threshold is below a given confidence. In this work, we propose a novel approach to probabilistic round-off analysis, by applying concentration inequalities over the Taylor expansion from FPTaylor (TOPLAS 2018). A major obstacle in applying concentration inequalities is that the Taylor expansion involves absolute value operators that make the calculation of the expected values of the first order partial differential terms difficult. Our first step to overcome this obstacle is a sound over-approximation that removes the absolute value operators in polynomial expressions. Then, we show how to handle fractional expressions by a transformation into polynomial case. Finally, we show how to improve our approach with range partitioning. Our approach is scalable since the key computational part is the calculation of expected values of polynomial expressions with independent variables, for which the linear and independence properties of expectation boost the computation. Experimental results show that our approach is orders of magnitude more time efficient, while producing thresholds with comparable precision against the state of the art.

2026-05-05T19:19:23Z Long version of the eponymous OOPSLA 2026 paper Yichen Tao Hongfei Fu Jiawei Chen Jean-Baptiste Jeannin http://arxiv.org/abs/2605.02790v2 Compositional Neural-Cyber-Physical System Verification in the Interactive Theorem Prover of Your Choice 2026-05-05T08:20:52Z

Formal verification of neuro-symbolic cyber-physical systems, such as drones, medical devices and robots, is complicated. Neural components must be trained to be optimal with respect to the available data as well as the safety specifications, and then verified using specialised solvers. Symbolic models of the "cyber" and "physical" behaviour of the system must be constructed and verified in interactive theorem provers (ITPs), often requiring mature mathematical libraries to reason about the interplay of discrete and continuous dynamics, preferably obtaining infinite time-horizon guarantees. Finally, the results of the two already challenging verification tasks need to be integrated into a single proof in a coherent and consistent way, whilst preserving deployability of the resulting model. In this paper we present a compositional methodology for constructing such proofs. The Vehicle framework provides a functional, domain-specific language for specifying, training, and verifying neural components. We extend Vehicle to allow integration with any ITP with minimal effort. First, we describe how Vehicle's standard bidirectional type checker can be reused to transpile neural specifications into an intermediate representation targeting multiple theorem provers. Second, we integrate Vehicle with Rocq, Isabelle/HOL, Agda and the industrial prover Imandra; and showcase a generic infinite time-horizon safety proof of a discrete cyber-physical system with a neural network controller in each ITP. Finally, we use the Mathematical Components libraries in Rocq to verify infinite time-horizon safety of a medical device, modelled as a continuous cyber-physical system with a neural controller. To our knowledge, this is the first result of this kind in a general purpose ITP; and a result that was only feasible thanks to the compositionality provided by Vehicle's functional interface.

2026-05-04T16:33:49Z Matthew L. Daggitt Ekaterina Komendantskaya Alistair Sirman Alessandro Bruni Samuel Teuber Josh Smart Grant Passmore http://arxiv.org/abs/2604.24797v2 The Network Structure of Mathlib 2026-05-04T23:21:56Z

The ongoing development of Lean 4's Mathlib has produced a macroscopic structural complexity that interweaves logical, mathematical, and infrastructural dependencies. We present a network analysis of this library, extracting its dependency structure into a multilayer graph of 308,129 declarations, 8.4 million edges, and 7,563 modules. By introducing graph decompositions that isolate explicit edges from those synthesized by the compiler or driven by proofs, we quantify the structural properties of formalized mathematics. Our analysis reveals three findings. First, taxonomies designed by humans diverge from logical structures, exhibiting a 50.9% coupling across namespaces. Second, developers utilize a median of 1.6% of the imported scope. Third, formalization compresses semantic hierarchies, with network centrality capturing language infrastructure rather than mathematical relevance.

2026-04-26T19:46:06Z Xinze Li Nanyun Peng Simone Severini Patrick Shafto http://arxiv.org/abs/2605.03143v1 Pact: A Choreographic Language for Agentic Ecosystems 2026-05-04T20:32:40Z

Recent advances in large language models have led to the rise of software systems (i.e. agents) that execute with increasing autonomy on behalf of users in open, multi-party settings, interacting with untrusted counterparts and managing private information. Choreographic programming offers correct-by-construction protocol-design for such settings, but assumes cooperative participants -- it has no notion of agent self-interest, that is, why an agent will follow a protocol. In this talk we introduce Pact, a choreographic language extended with operations to describe agent choices and preferences, drawing from the rich literature of game theory. Every Pact protocol maps to a formal game, allowing protocol designers to reason about game-theoretic properties of their protocols, such as solving for decision policies. We present Pact's design and a preliminary implementation -- a bounded-rational solver that computes decision policies over Pact protocols -- and findings from applying this language to multi-party coordination with self-interested agentic participants.

2026-05-04T20:32:40Z To be presented at the 2nd International Workshop on Choreographic Programming (CP 2026) Kiran Gopinathan Jack Feser Michelangelo Naim Zenna Tavares Eli Bingham http://arxiv.org/abs/2302.01979v4 Hybrid Multiparty Session Types -- Full Version 2026-05-04T18:45:28Z

Multiparty session types (MPST) are a specification and verification framework for distributed message-passing systems. The communication protocol of the system is specified as a global type, from which a collection of local types (local process implementations) is obtained by endpoint projection. A global type is a single disciplining entity for the whole system, specified by one designer that has full knowledge of the communication protocol. On the other hand, distributed systems are often described in terms of their components: a different designer is in charge of providing a subprotocol for each component. The problem of modular specification of global protocols has been addressed in the literature, but the state of the art focuses only on dual input/output compatibility. Our work overcomes this limitation. We propose the first MPST theory of multiparty compositionality for distributed protocol specification that is semantics-preserving, allows the composition of two or more components, and retains full MPST expressiveness. We introduce hybrid types for describing subprotocols interacting with each other, define a novel compatibility relation, explicitly describe an algorithm for composing multiple subprotocols into a well-formed global type, and prove that compositionality preserves projection, thus retaining semantic guarantees, such as liveness and deadlock freedom. Finally, we test our work against real-world case studies and we smoothly extend our novel compatibility to MPST with delegation and explicit connections.

2023-02-03T19:54:17Z Lorenzo Gheri Nobuko Yoshida