https://arxiv.org/api/Rn1GLNQZ1s0sWZY0bVqMnnhiLIs2026-06-13T14:23:57Z98856015http://arxiv.org/abs/2606.01928v1Teaching Synchronous Dataflow Modelling with Learn-Heptagon2026-06-01T08:58:49ZLustre is a synchronous dataflow language designed to implement safety-critical embedded software. In addition to writing executable programs, the language doubles as a program logic, used for writing specification as synchronous observers or assume-guarantee contracts that specify properties of these programs. These specifications may be used during testing or proved exhaustively using model-checking tools. We taught a course on Lustre to last year engineering students. To streamline the learning experience and avoid technical issues, we developped an online application, Learn-Heptagon, which allows for writing, simulating, and proving properties of Lustre programs. This paper presents the application and the associated lesson plan.2026-06-01T08:58:49ZPierre-Loïc GarocheBasile Pesinhttp://arxiv.org/abs/2602.16291v10A Calculus of Inheritance2026-06-01T07:26:27ZJust as the $λ$-calculus uses three primitives (abstraction, application, variable) as the foundation of functional programming, inheritance-calculus uses three primitives (record, definition, inheritance) as the foundation of declarative programming. By unifying modules, classes, objects, methods, fields, and locals under a single record abstraction, the calculus models inheritance simply as set union. Consequently, composition is inherently commutative, idempotent, and associative, structurally eliminating the multiple-inheritance linearization problem. Its semantics is first-order, denotational, and computable by tabling, even for cyclic inheritance hierarchies. These three properties extend to the $λ$-calculus, since Böhm tree equivalence is fully abstract for the first-iteration approximation of a sublanguage of inheritance-calculus. As a corollary, this establishes a convergence hierarchy $\text{eager} \subsetneq \text{lazy} \subsetneq \text{fixpoint}$ among $λ$-calculi sharing the same $λ$-syntax.
Inheritance-calculus is distilled from MIXINv2, a practical implementation in which the same code acts as different function colors; ordinary arithmetic yields the relational semantics of logic programming; $\mathtt{this}$ resolves to multiple targets; and programs are immune to nonextensibility in the sense of the Expression Problem. This makes inheritance-calculus strictly more expressive than the $λ$-calculus in both common sense and Felleisen's sense.2026-02-18T09:17:20ZBo Yanghttp://arxiv.org/abs/2606.01522v1Type-Error Ablation and AI Coding Agents2026-06-01T01:09:13ZProgramming language implementors have designed error messages with one consumer in mind: the human programmer. Human-factors research has consistently found that programmers engage with error messages poorly -- they skim, miss key information, and are easily overwhelmed. The practical consequence has been a strong design pressure toward brevity: messages should be terse enough that programmers will actually read them.
AI coding agents are now a second, fundamentally different consumer of error messages. Unlike humans, agents do not tire, lose attention, or find length cognitively overwhelming. This raises a question the programming-language community has not previously had reason to ask: should error-message detail be calibrated differently for AI agents than for humans?
We investigate this question through a controlled experiment using Shplait, an ML-style statically typed language. We construct a suite of programs containing a single deliberate type error each, and measure how often an AI agent repairs them under ablation: a detailed error context using the unification stack; a proximate error location; a minimal type error; and a dynamic (test suite) error only. An automated oracle uses a test suite to classify each repair attempt as a type error, semantically incorrect, or semantically correct.
We find concrete evidence that more detailed error messages improve an agent's ability to fix type errors. We also find that the presence of a type system appears to help more than only test suite failure reports. As a secondary finding, in cases where an agent successfully fixes the type error, the resulting program passes all semantic tests most of the time -- lending empirical support to a widely held folk belief about typed languages. We also see evidence that leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated.2026-06-01T01:09:13ZShriram KrishnamurthiMatthew Flatthttp://arxiv.org/abs/2606.02651v1From Rocq to Metal: A Pipeline for Formally Verified Microcontroller Firmware2026-05-31T21:14:54ZEnforcing invariants in safety-critical systems is increasingly urgent as AI-generated code becomes widespread. Unfortunately, the runtimes required to support high-level specification languages are too large for most embedded targets. In this article, we show how formally verified firmware is achievable today. We built Encore!, a bare-metal Continuation Passing Style (CPS) virtual machine (VM) that runs Rocq-extracted Scheme on microcontrollers. We also show how to structure firmware as a pure state-transition function, making its core fully provable in Rocq while keeping the unverified host layer constant regardless of firmware complexity. Large Language Model (LLM)-assisted tactic synthesis fits naturally into this workflow: formal theorem statements replace manual code review, allowing AI-generated firmware to prove itself.2026-05-31T21:14:54ZValentin BergeronKarolina Gornahttp://arxiv.org/abs/2504.21291v3Efficiency of Analysis of Transitive Relations using Query-Driven, Ground-and-Solve, and Fact-Driven Inference2026-05-31T08:24:25ZLogic rules allow analysis of complex relationships to be expressed easily, especially for transitive relations in critical applications. However, understanding and predicting the efficiency of different inference methods remain challenging, even for simplest rules given different kinds of input data.
This paper analyzes the efficiency of all three types of well-known inference methods -- query-driven, ground-and-solve, and fact-driven -- along with their respective optimizations, and compares with optimal complexities for the first time, for analyzing transitive graph relations. We also experiment with rule systems widely considered to have the best performance. We analyze all well-known rule variants and widely varying input graphs. The results include precisely calculated optimal time complexities; comparative analysis across different inference methods, rule variants, and graph types; confirmation with performance experiments; as well as discovery of a performance bug.2025-04-30T03:55:48ZYanhong A. LiuJohn IdogunScott D. StollerYi Tonghttp://arxiv.org/abs/2606.01076v1Exploiting Multiple Abstract Call Patterns for Optimizing Run-Time Checks2026-05-31T07:44:59ZIn strongly-typed languages, types are verified at compile time, while dynamically typed languages, such as Prolog, perform type consistency checks entirely at run-time. Extending dynamic languages with assertions allows expressing both classical types and more general properties, providing high expressiveness, but at the cost of run-time overhead. Abstract interpretation allows safely approximating such program properties at compile time, which has been used to reduce the number of properties that require run-time checks, while still reporting unverified properties that can guide further static analyses, testing, or domain refinement. In this work, we first study how to selectively integrate the run-time semantics of assertion properties into a multivariant, top-down, goal-directed abstract interpretation algorithm. We then show how multiple inferred calling patterns can be exploited to reduce the number of properties that must be checked at run-time, thus minimizing the overhead. Finally, we report on an implementation of our approach in the Ciao system and provide performance results supporting that better results can be obtained than with the previously reported techniques.2026-05-31T07:44:59ZTo appear in Theory and Practice of Logic ProgrammingDaniela FerreiroDaniel Jurjo-RivasMarco CiccalèJose F. MoralesPedro López-GarcíaManuel Hermenegildohttp://arxiv.org/abs/2606.00692v1Grid Programs: A Two-Dimensional, Variable-Free Model of Computation2026-05-30T12:02:07ZWe introduce Grid Programs, a novel model of computation in which programs are finite two-dimensional arrangements of instructions on an integer grid rather than linear sequences of statements. Three properties distinguish this model fundamentally from classical frameworks: (i) programs are planar structures through which an instruction pointer moves in the four cardinal directions; (ii) there are no syntax constraints, so any assignment of instructions to grid cells constitutes a valid program; and (iii) the model uses no named variables or explicit memory addresses. Program state is maintained through a data stack, an address stack, and a circularly doubly linked list accessed via three named pointers. Control flow is achieved spatially, with branching encoded as perpendicular turns of the instruction pointer. The address stack stores triplets (cell row, cell column, direction), enabling precise restoration of both position and heading after branches, loops, and function calls. We give a formal operational semantics, present a representative instruction set covering arithmetic, control flow, and linked-list manipulation, and work through several detailed examples, including an absolute-value function, a factorial computation, a linear-search algorithm, a string-reversal program, and a while-loop summation. We establish that Grid Programs are Turing-complete by simulating an arbitrary register machine, and we discuss their relationship to prior two-dimensional languages such as Befunge and Funge-98, to stack-based languages such as Forth and PostScript, and to dataflow and spatial computation models. Grid Programs offer a fresh vantage point for exploring the design space of computation, with potential applications in visual programming environments, cellular-automaton-inspired hardware, and obfuscation-resistant code.2026-05-30T12:02:07ZEzequiel López-Rubiohttp://arxiv.org/abs/2606.00601v1ScanWeaver: Compiler-Driven Parallelization of Affine Recurrences via Associative Scan Lowering2026-05-30T07:58:02ZSelective state-space models such as Mamba highlight the practical importance of input-dependent scan recurrences, which preserve linear-time sequence modeling while improving language modeling capabilities. However, these recurrences introduce stricter sequential dependencies than classical structured SSMs, limiting parallel execution on modern accelerators.
We present \textbf{ScanWeaver}, a compiler framework that transforms recurrence-based computations into associative scan representations and lowers them end-to-end to executable GPU programs. We use Mamba-style selective scan as a motivating example of a broader class of affine recurrences that arise in modern ML workloads. Rather than targeting a single model family, ScanWeaver elevates this recurrence structure to a first-class compiler abstraction, enabling systematic MLIR-based lowering to compiler-generated Blelloch scan execution on GPUs.
Across forward selective-scan workloads with matched local recurrence semantics, we validate affine recurrence decomposition, Blelloch lowering, MLIR GPU lowering, executable artifact generation, and actual GPU execution from generated MLIR. We benchmark the resulting ScanWeaver GPU execution against PyTorch and CUDA sequential baselines, and include the Mamba kernel as a fused production baseline for systems context.2026-05-30T07:58:02Z11 pages, 1 figureQiying WuPavel Zolnikovhttp://arxiv.org/abs/2606.00220v1SEMBridge: Tagless-Final Program Semantics with Weakest-Precondition and Bounded-Checking Interpretations2026-05-29T18:00:06ZFormal methods provide rigorous accounts of program behavior, but practical software engineering often works through executable libraries, tests, and incremental design. This paper presents SEMBridge, a small tagless-final framework for generating weakest-precondition and bounded-checking interpretations from the same executable object programs. Instead of committing a program semantics to one abstract syntax tree and then writing separate traversals, object programs are written once against a semantic interface and interpreted into multiple meanings: readable code, concrete execution, predicate transformers, bounded counterexample search, and future proof-assistant or SMT back ends. The Python prototype implements a loop-free imperative core with assignments, conditionals, assumptions, and assertions. Across five example programs, the same tagless-final definitions generated executable state transformers and verification conditions that passed bounded checking over domains up to 729 states. The contribution is not a Scala code-generation system or a new verifier, but a compact architecture for keeping executable semantics, weakest-precondition artifacts, and bounded validation synchronized.2026-05-29T18:00:06ZEric Lianghttp://arxiv.org/abs/2510.03415v3LLMs Lean on Priors, Not Programming Language Semantics2026-05-29T17:06:15ZRecent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from pretraining. Program execution provides a canonical instance: formal semantics define behavior through symbolic transition rules that can be systematically altered under distribution shift. We investigate whether LLMs can condition their reasoning on formal semantics through program execution and introduce PLSemanticsBench, pairing featherweight C programs with two semantic systems -- small-step operational semantics and K semantics -- and probing four capabilities: composing rules for final states, selecting rules when state is unmutated, sustaining such conditioning over long traces, and following supplied rules under novel semantics. To decouple semantic reasoning from syntactic familiarity, we redefine familiar operators to induce symbol-meaning conflict and introduce novel symbols defined only through the supplied rules, and stress-test models on Human-Written, LLM-Translated, and Fuzzer-Generated splits with increasing structural complexity. Across 11 frontier LLMs, strong final-state accuracy under standard semantics (up to 90%) drops sharply -- by as much as 40--60% points -- under semantic mutations and increasing structural complexity. Only a handful of models achieve non-zero long-horizon conditioning accuracy, and even the best systems reach just 35%. Together, these results suggest that contemporary LLMs often rely on pretrained lexical associations rather than systematically conditioning on supplied formal rules. PLSemanticsBench is publicly available at https://EngineeringSoftware.github.io/PLSemanticsBench.2025-10-03T18:23:26ZAccepted at ICML 2026Aditya ThimmaiahJiyang ZhangJayanth SrinivasaJunyi Jessy LiMilos Gligorichttp://arxiv.org/abs/2605.31517v1Practical Algebraic Stepping with Scoped Filters2026-05-29T16:34:37ZAlgebraic steppers help students learn functional programming by displaying evaluation as a sequence of small-step reductions, but even simple programs produce long traces in which key ideas are buried under mundane reductions. This paper presents the filtered stepper calculus, a formal framework that gives users scoped, pattern-based control over which reduction steps are shown or hidden. Users annotate programs with lightweight filter expressions that match on the structure of redexes. Filters compose via lexical scoping so that inner filters override outer ones. We prove preservation, progress, and a simulation theorem establishing that the filtered stepper agrees with the underlying unfiltered semantics, and mechanize all proofs in Agda. We implement the calculus in the Hazel live programming environment, including its support for stepping programs with holes and type errors. To do so, we reconcile Hazel's internal environment-based evaluator with the substitution-based presentation expected in the classroom. We deploy the system in a university programming languages course. Our evaluation shows that students adopt the stepper organically, though more advanced uses of filters may require further instruction, and that instructors can use filters to craft focused traces for use in lectures.2026-05-29T16:34:37Z30 pages, 10 figures, 2 tablesHaoxiang FeiMatthew KeenanCyrus Omarhttp://arxiv.org/abs/2606.12445v1SAT, MaxSAT, and SMT for QLDPC Distance Computation: A Large-Scale Empirical Study2026-05-29T15:34:33ZExact distance computation for quantum LDPC (QLDPC) codes plays a central role in validating candidate fault-tolerant quantum-code constructions, yet the computational structure of this problem remains poorly understood. Despite substantial recent progress in QLDPC design, it remains unclear which algorithmic principles govern the practical scalability of exact distance computation and which classes of exact solvers are best suited to this task. To address these questions, we conduct a systematic study of SAT- and MaxSAT-based formulations for exact QLDPC distance computation across representative codes. We further compare these formulations against several established exact-distance approaches in order to better understand the algorithmic landscape of exact QLDPC distance computation. Our study challenges and refines several prevailing intuitions about exact QLDPC distance computation. First, despite the XOR-rich structure of QLDPC parity checks, practical scalability appears to be governed more by the handling of cardinality constraints and optimization bounds than by parity reasoning alone. Accordingly, XOR-aware reasoning does not provide a systematic advantage across our benchmark suite. Second, Brouwer-Zimmermann-style search, long regarded as the benchmark paradigm for exact distance computation in sparse classical codes, no longer maintains its traditional scalability advantage in the QLDPC setting. This finding challenges the expectation that techniques successful for sparse classical codes remain dominant for QLDPC codes. Third, substantial qualitative differences arise even among MaxSAT solvers themselves. Branch-and-bound MaxSAT significantly outperforms unsat-core-based MaxSAT on challenging benchmarks, demonstrating that solver architecture and optimization strategy play a decisive role in practical scalability.2026-05-29T15:34:33Z15 pages of main text and 28 pages of appendix. 3 figuresYu-Fang ChenSeyed Mohammad Reza JafariChing-Yi Laihttp://arxiv.org/abs/2605.31389v1Neuroforger: certified violation witnesses for smart contracts verification via LLMs2026-05-29T14:54:51ZRecent large language models (LLMs) incorporate reasoning capabilities that allow them to perform well in predicting whether a smart contract respects a certain property, suggesting a complementary approach to traditional formal-methods-based techniques for smart contract verification. However, the application of LLMs in such context has two major issues: 1) properties expressed in natural language are intrinsically ambiguous, and 2) answers returned by LLMs have no guarantee of correctness. In this paper, we address both issues simultaneously by: 1) introducing a new formal specification language that extends Solidity with abstract types, and 2) designing a workflow that combines LLMs with type checking and concrete execution to generate and validate violation witnesses (i.e., counterexamples). The key idea is to represent a specification as a Solidity test with (existentially quantified) variables of abstract type; finding an instantiation of these variables to concrete values (of the correct type) concretizes the test into an executable counterexample (PoC) for the target property. We implemented our procedure in the tool Neuroforger, experimentally evaluating it on a smart-contract verification dataset drawn from literature, obtaining promising results that demonstrate its potential applicability in the wild.2026-05-29T14:54:51ZMassimo BartolettiEnrico Lipparinihttp://arxiv.org/abs/2405.16708v5Higher-order bialgebraic semantics2026-05-29T14:51:23ZCompositionality proofs in higher-order languages are notoriously involved, and general semantic frameworks guaranteeing compositionality are hard to come by. In particular, Turi and Plotkin's bialgebraic abstract GSOS framework, which provides off-the-shelf compositionality results for first-order languages, so far does not apply to higher-order languages. In the present work, we develop a theory of abstract GSOS specifications for higher-order languages, in effect transferring the core principles of Turi and Plotkin's framework to a higher-order setting. In our theory, the operational semantics of higher-order languages is represented by certain dinatural transformations that we term (pointed) higher-order GSOS laws. We give a general compositionality result that applies to all systems specified in this way and discuss how compositionality of combinatory logics and the lambda-calculus w.r.t. a strong variant of Abramsky's applicative bisimilarity are obtained as instances.2024-05-26T21:59:43ZExtended and updated version of arXiv:2210.13387Journal of Functional Programming, Volume 36 (June 1, 2026) jfp:17738Sergey GoncharovStefan MiliusLutz SchröderStelios TsampasHenning Urbat10.46298/jfp.17738http://arxiv.org/abs/2604.18587v2Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs2026-05-29T11:05:30ZLarge language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported $\sim$8B and $\sim$32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.2026-03-13T01:33:20ZGuchan LiRui TianHongning Wang