https://arxiv.org/api/wk0IB+cWs6xOgX/0ZijMuD14Kpk2026-06-14T02:19:03Z988521015http://arxiv.org/abs/2605.06184v1Teaching LLMs Program Semantics via Symbolic Execution Traces2026-05-07T13:01:06ZWe introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with program length. To close this gap, we train on formal verification artifacts: running the Soteria symbolic execution engine on generic open-source C code and using the resulting traces for continued pretraining of Qwen3-8B. Just ${\sim}$3,000 bug traces combined with chain-of-thought reasoning at inference time improve violation detection by over 17 percentage points, producing one of the most balanced accuracy profiles among evaluated models. On violation detection, the trained 8B model outperforms the 4$\times$ larger Qwen3-32B without thinking and approaches it in overall accuracy. The interaction between trace training and chain-of-thought is superadditive: neither alone provides meaningful gains, but their combination does. Improvements transfer across all five property types, including ones the training traces do not target. Our 28 configurations confirm the gains stem from trace semantics, not code volume, and that trace curation and format matter.2026-05-07T13:01:06ZJonas BayerStefan ZetzscheOlivier BouissouRemi DelmasMichael TautschnigSoonho Konghttp://arxiv.org/abs/2605.08243v1GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching2026-05-07T09:10:17ZSynthesizing Mixed-Boolean Arithmetic (MBA) expressions from input-output examples is central to program deobfuscation and also useful for compiler optimization, reverse engineering, and cryptanalysis. Existing MBA synthesizers are typically CPU-based and scale poorly on large specifications or complex targets. Recent GPU-accelerated synthesis methods achieve large speedups in qualitative settings, but they depend on caching observationally equivalent candidates; this strategy breaks down for MBA because candidate outputs are quantitative bitvectors and the behavioral space is enormous. We present SIMBA (Synthesis of Mixed-Boolean Arithmetic), a GPU-accelerated MBA synthesizer built around cache-free bottom-up enumeration. SIMBA avoids language caches entirely and uses a GPU-oriented enumeration design that keeps work local and highly parallel. In experiments, SIMBA is substantially faster than prior MBA synthesis tools, handles larger specifications, and reaches expression sizes that existing methods fail to solve. These results establish cache-free GPU synthesis as a practical and scalable approach for quantitative domains, and identify it as a strong alternative to cache-centric designs.2026-05-07T09:10:17ZGabriel BathieBaptiste MouillonNathanaël Fijalkowhttp://arxiv.org/abs/2305.08486v3Rely-Guarantee Reasoning for Causally Consistent Shared Memory (Extended Version)2026-05-07T09:02:11ZRely-guarantee (RG) is a highly influential compositional proof technique for concurrent programs, which was originally developed assuming a sequentially consistent shared memory. In this paper, we first generalize RG to make it parametric with respect to the underlying memory model by introducing an RG framework that is applicable to any model axiomatically characterized by Hoare triples. Second, we instantiate this framework for reasoning about concurrent programs under causally consistent memory, which is formulated using a recently proposed potential-based operational semantics, thereby providing the first reasoning technique for such semantics. The proposed program logic, which we call Piccolo, employs a novel assertion language allowing one to specify ordered sequences of states that each thread may reach. We employ Piccolo for multiple litmus tests, as well as for an adaptation of Peterson's algorithm for mutual exclusion to causally consistent memory.2023-05-15T09:43:11ZExtended version of paper to appear in CAV 2023Ori LahavBrijesh DongolHeike Wehrheimhttp://arxiv.org/abs/2604.27644v2ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning2026-05-07T08:46:02ZWe propose a paradigm shift toward open-ended curriculum self-play: rather than learning to answer on a fixed prompt set, a unified policy learns to question: generating verifiable problems, solving them, and turning verifier feedback into self-improvement without human-annotated solutions. We introduce ANCORA, in which the policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions, anchored by three load-bearing mechanisms: a two-level group-relative update coupling Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT projecting the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG whose policy-induced problem set can provably expand under self-composition. Without these stabilizers, sparse verifier feedback drives Proposer collapse even under MLRL-aligned rewards; with them, ANCORA bootstraps a verifiable curriculum from zero human solutions. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in test-time training (TTT, 0-shot), outperforming PSV self-play by 15.8 points despite PSV's 1-shot inference; in a transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.2026-04-30T09:35:57Zv2: Updated abstract; strengthened the proof of Proposition 4.1; corrected minor typos; corrected author listChengcao Yanghttp://arxiv.org/abs/2605.05282v1Beyond BLEU: A Semantic Evaluation Method for Code Translation2026-05-06T17:14:33ZCode translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores show negligible correlation with semantic correctness (r = -0.127 to 0.354), demonstrating that syntactic metrics fail to predict functional accuracy.2026-05-06T17:14:33ZJulius NäumannSven KeidelAmir Molzam ShariflooMira Mezinihttp://arxiv.org/abs/2604.23088v2Code Broker: A Multi-Agent System for Automated Code Quality Assessment2026-05-06T16:26:44ZWe present Code Broker, a multi agent system built on Google s Agent Development Kit ADK that analyses Python source code from individual files, local directory trees, or remote GitHub repositories and generates structured, actionable quality assessment reports. The system realises a hierarchical five agent architecture in which a root orchestrator coordinates a sequential pipeline agent that, in turn, dispatches three specialised agents concurrently a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising their findings through an Improvement Recommender. Reports quantify four quality dimensions correctness, security, style, and maintainability on a normalised scale and are rendered in both Markdown and HTML for integration into diverse developer workflows. Code Broker fuses LLM based semantic reasoning with deterministic static analysis signals from Pylint, employs asynchronous execution with exponential backoff retry logic to improve robustness under transient API failures, and explores lightweight session memory for retaining and querying prior assessment context across runs. We frame this paper as a technical report on system design, prompt engineering, and tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases of varying scale. The results indicate that parallel specialised agents produce readable, developer oriented feedback that complements traditional linting, while also foregrounding current limitations in evaluation depth, security tooling, large repository handling, and the exclusive reliance on in memory persistence. All code and reproducibility materials are publicly available: https://github.com/Samir-atra/agents_intensive_dev.2026-04-25T00:53:59Z9 pages, 2 figures, 2 tables, 33 referencesSamer Attrahhttp://arxiv.org/abs/2605.04933v1Interaction Tree Semantics for RISC-V: Bridging Compiler and Hardware Verification2026-05-06T13:58:30ZThe Instruction Set Architecture (ISA) is the contract between compilers and processors; proving this contract formally demands cross-level connection to existing mechanized compilers and hardware implementations. As an open, modular ISA gaining adoption across embedded, mobile, and cloud platforms, RISC-V makes a formally verified ISA specification particularly valuable. However, existing formal RISC-V specifications focus on hardware tooling rather than cross-level verification: they provide no machine-checked instruction-level properties and lack support for verifying this contract across levels.
We address these limitations with a formal semantics of the RISC-V ISA in Rocq, built on Interaction Trees (ITrees). By leveraging ITree bisimulation and refinement, our semantics enables cross-level verification from compiler IR to hardware within a single framework. Our formalization covers a wide spectrum of RISC-V extensions. The correctness of individual instruction semantics is backed by machine-checked lemmas in Rocq. We further validate it by extracting an executable simulator that passes all standard RISC-V test suites. Three case studies demonstrate the effectiveness of our semantics for cross-level verification: first, we prove semantic equivalence via bisimulation between LLVM IR and RISC-V code on an array access pattern via Vellvm (LLVM ITree semantics); second, we apply translation validation to a specific instruction reordering for macro-operation fusion, distinguishing safe reorderings from those that break program-counter-relative addressing; third, we prove that a Kôika hardware ALU correctly implements all R-type integer operations (e.g., ADD, SUB, AND) against our ISA contract.2026-05-06T13:58:30ZShuanglong KanSebastian Ertelhttp://arxiv.org/abs/2101.06757v9Higher Order Automatic Differentiation of Higher Order Functions2026-05-06T09:27:49ZWe present semantic correctness proofs of automatic differentiation (AD). We consider a forward-mode AD method on a higher-order language with algebraic data types and we characterise it as the unique structure-preserving macro given a choice of derivatives for basic operations. We describe a rich semantics for differentiable programming based on diffeological spaces. We show that it interprets our language and we phrase what it means for the AD method to be correct with respect to this semantics. We show that our characterisation of AD gives rise to an elegant semantic proof of its correctness based on a gluing construction on diffeological spaces. We explain how this is in essence a logical relations argument. Throughout we show how the analysis extends to AD methods for computing higher-order derivatives using a Taylor approximation.2021-01-17T19:24:46ZarXiv admin note: substantial text overlap with arXiv:2001.02209Logical Methods in Computer Science, Volume 18, Issue 1 (March 22, 2022) lmcs:7106Mathieu HuotSam StatonMatthijs Vákár10.46298/lmcs-18(1:41)2022http://arxiv.org/abs/1405.0033v5Syntax and Semantics of Linear Dependent Types2026-05-06T09:16:19ZA type theory is presented that combines (intuitionistic) linear types with type dependency, thus properly generalising both intuitionistic dependent type theory and full linear logic. A syntax and complete categorical semantics are developed, the latter in terms of (strict) indexed symmetric monoidal categories with comprehension. Various optional type formers are treated in a modular way. In particular, we will see that the historically much-debated multiplicative quantifiers and identity types arise naturally from categorical considerations. These new multiplicative connectives are further characterised by several identities relating them to the usual connectives from dependent type theory and linear logic. Finally, one important class of models, given by families with values in some symmetric monoidal category, is investigated in detail.2014-04-30T20:58:03ZMatthijs Vákárhttp://arxiv.org/abs/2605.04377v1Towards Formal Verification of Hybrid Synchronous Programs with Refinement Types2026-05-06T01:05:43ZCyber-physical systems (CPS) such as autonomous cars, aircraft, and robots are often also safety-critical; thus it is imperative that they operate as intended with a high degree of certainty. Formal verification has been employed to verify the software controlling these systems, but due to their complexity, is usually performed on an abstract model rather than the executable code. Synchronous programming languages extended with differential equations promise both rigorous modeling and sufficient expressiveness to implement executable controller code, and recent developments have introduced formal verification of strictly discrete-time programs. Extending these verification techniques to hybrid systems enables precise modeling of the environment for a wider variety of programs to be both verified and executed. We formalize the operational semantics of initial value problems and zero-crossing detection expressed in a synchronous programming language, extend its type system for verification thereof, and prove its soundness.2026-05-06T01:05:43ZNASA Formal MethodsSerra Z. DaneJiawei ChenMarc PouzetJean-Baptiste Jeanninhttp://arxiv.org/abs/2605.04232v1Probabilistic Floating-Point Round-Off Analysis via Concentration Inequalities2026-05-05T19:19:23ZFloating-point round-off errors are ubiquitous in numerically intensive programs arising in fields such as scientific computing and optimization. As floating-point errors potentially lead to unexpected and catastrophic program failures, one must derive guaranteed round-off thresholds to ensure the correctness of these programs. However, deterministic round-off thresholds tend to be too conservative to be usable in practice, since they often involve large round-off errors that occur with small probability. Probabilistic thresholds relax deterministic ones by specifying that the probability of the round-off error exceeding a threshold is below a given confidence.
In this work, we propose a novel approach to probabilistic round-off analysis, by applying concentration inequalities over the Taylor expansion from FPTaylor (TOPLAS 2018). A major obstacle in applying concentration inequalities is that the Taylor expansion involves absolute value operators that make the calculation of the expected values of the first order partial differential terms difficult. Our first step to overcome this obstacle is a sound over-approximation that removes the absolute value operators in polynomial expressions. Then, we show how to handle fractional expressions by a transformation into polynomial case. Finally, we show how to improve our approach with range partitioning. Our approach is scalable since the key computational part is the calculation of expected values of polynomial expressions with independent variables, for which the linear and independence properties of expectation boost the computation. Experimental results show that our approach is orders of magnitude more time efficient, while producing thresholds with comparable precision against the state of the art.2026-05-05T19:19:23ZLong version of the eponymous OOPSLA 2026 paperYichen TaoHongfei FuJiawei ChenJean-Baptiste Jeanninhttp://arxiv.org/abs/2605.02790v2Compositional Neural-Cyber-Physical System Verification in the Interactive Theorem Prover of Your Choice2026-05-05T08:20:52ZFormal verification of neuro-symbolic cyber-physical systems, such as drones, medical devices and robots, is complicated. Neural components must be trained to be optimal with respect to the available data as well as the safety specifications, and then verified using specialised solvers. Symbolic models of the "cyber" and "physical" behaviour of the system must be constructed and verified in interactive theorem provers (ITPs), often requiring mature mathematical libraries to reason about the interplay of discrete and continuous dynamics, preferably obtaining infinite time-horizon guarantees. Finally, the results of the two already challenging verification tasks need to be integrated into a single proof in a coherent and consistent way, whilst preserving deployability of the resulting model.
In this paper we present a compositional methodology for constructing such proofs. The Vehicle framework provides a functional, domain-specific language for specifying, training, and verifying neural components. We extend Vehicle to allow integration with any ITP with minimal effort. First, we describe how Vehicle's standard bidirectional type checker can be reused to transpile neural specifications into an intermediate representation targeting multiple theorem provers. Second, we integrate Vehicle with Rocq, Isabelle/HOL, Agda and the industrial prover Imandra; and showcase a generic infinite time-horizon safety proof of a discrete cyber-physical system with a neural network controller in each ITP. Finally, we use the Mathematical Components libraries in Rocq to verify infinite time-horizon safety of a medical device, modelled as a continuous cyber-physical system with a neural controller. To our knowledge, this is the first result of this kind in a general purpose ITP; and a result that was only feasible thanks to the compositionality provided by Vehicle's functional interface.2026-05-04T16:33:49ZMatthew L. DaggittEkaterina KomendantskayaAlistair SirmanAlessandro BruniSamuel TeuberJosh SmartGrant Passmorehttp://arxiv.org/abs/2604.24797v2The Network Structure of Mathlib2026-05-04T23:21:56ZThe ongoing development of Lean 4's Mathlib has produced a macroscopic structural complexity that interweaves logical, mathematical, and infrastructural dependencies. We present a network analysis of this library, extracting its dependency structure into a multilayer graph of 308,129 declarations, 8.4 million edges, and 7,563 modules. By introducing graph decompositions that isolate explicit edges from those synthesized by the compiler or driven by proofs, we quantify the structural properties of formalized mathematics. Our analysis reveals three findings. First, taxonomies designed by humans diverge from logical structures, exhibiting a 50.9% coupling across namespaces. Second, developers utilize a median of 1.6% of the imported scope. Third, formalization compresses semantic hierarchies, with network centrality capturing language infrastructure rather than mathematical relevance.2026-04-26T19:46:06ZXinze LiNanyun PengSimone SeveriniPatrick Shaftohttp://arxiv.org/abs/2605.03143v1Pact: A Choreographic Language for Agentic Ecosystems2026-05-04T20:32:40ZRecent advances in large language models have led to the rise of software systems (i.e. agents) that execute with increasing autonomy on behalf of users in open, multi-party settings, interacting with untrusted counterparts and managing private information. Choreographic programming offers correct-by-construction protocol-design for such settings, but assumes cooperative participants -- it has no notion of agent self-interest, that is, why an agent will follow a protocol.
In this talk we introduce Pact, a choreographic language extended with operations to describe agent choices and preferences, drawing from the rich literature of game theory. Every Pact protocol maps to a formal game, allowing protocol designers to reason about game-theoretic properties of their protocols, such as solving for decision policies. We present Pact's design and a preliminary implementation -- a bounded-rational solver that computes decision policies over Pact protocols -- and findings from applying this language to multi-party coordination with self-interested agentic participants.2026-05-04T20:32:40ZTo be presented at the 2nd International Workshop on Choreographic Programming (CP 2026)Kiran GopinathanJack FeserMichelangelo NaimZenna TavaresEli Binghamhttp://arxiv.org/abs/2302.01979v4Hybrid Multiparty Session Types -- Full Version2026-05-04T18:45:28ZMultiparty session types (MPST) are a specification and verification framework for distributed message-passing systems. The communication protocol of the system is specified as a global type, from which a collection of local types (local process implementations) is obtained by endpoint projection. A global type is a single disciplining entity for the whole system, specified by one designer that has full knowledge of the communication protocol. On the other hand, distributed systems are often described in terms of their components: a different designer is in charge of providing a subprotocol for each component. The problem of modular specification of global protocols has been addressed in the literature, but the state of the art focuses only on dual input/output compatibility. Our work overcomes this limitation. We propose the first MPST theory of multiparty compositionality for distributed protocol specification that is semantics-preserving, allows the composition of two or more components, and retains full MPST expressiveness. We introduce hybrid types for describing subprotocols interacting with each other, define a novel compatibility relation, explicitly describe an algorithm for composing multiple subprotocols into a well-formed global type, and prove that compositionality preserves projection, thus retaining semantic guarantees, such as liveness and deadlock freedom. Finally, we test our work against real-world case studies and we smoothly extend our novel compatibility to MPST with delegation and explicit connections.2023-02-03T19:54:17ZLorenzo GheriNobuko Yoshida