https://arxiv.org/api/vMt7tBhoWvkgsDmLrAAAfxkmqpA 2026-06-14T18:29:11Z 9885 450 15 http://arxiv.org/abs/2509.04253v3 When Lifetimes Liberate: A Type System for Arenas with Higher-Order Reachability Tracking 2026-03-28T06:57:26Z Static resource management in languages remains challenging due to tensions among control, expressiveness, and flexibility. Region-based systems [Grossman et al . 2002; Tofte et al. 2001] offer bulk deallocation via lexically scoped regions, where all allocations follow a stack discipline. However, both regions and their resources are second-class, and neither can escape its scope nor be freely returned. Ownership and linear type systems, exemplified by Rust [Clarke et al. 2013], offer non-lexical lifetimes and robust static guarantees, but rely on invariants that limit higher-order patterns and expressive sharing. In this work, we propose a new type system that unifies these strengths. Our system treats all heap-allocated resources as first-class values, while allowing programmers to control lifetime and granularity through three allocation modes: (1) fresh allocation for individual, non-lexical references; (2) subsequent coallocation grouping resources collectively within shadow arenas; and (3) scoped allocation with lexically bounded lifetimes following stack discipline. Regardless of mode, all resources share a uniform type and have no distinction for generic abstractions, preserving the higher-order parametric nature of the language. Obtaining static safety in higher-order languages with flexible sharing is nontrivial. We address this by extending reachability types [Wei et al. 2024] to collectively track first-class resources, and by adopting flow-insensitive deallocation reasoning for selective stack discipline. These mechanisms yield Aq<: and {A}q<: atop, both formalized and proven type safe and memory safe in Rocq. 2025-09-04T14:28:41Z Siyuan He Songlin Jia Yuyan Bao Tiark Rompf http://arxiv.org/abs/2603.27107v1 Bit-Vector CHC Solving for Binary Analysis and Binary Analysis for Bit-Vector CHC Solving 2026-03-28T03:25:50Z For high-assurance software, source-level reasoning is insufficient: we need binary-level guarantees. Despite constrained Horn clause (CHC) solving being one of the most popular forms of automated verification, prior work has not evaluated the viability of CHC solving for binary analysis. To fill this gap, we assemble a pipeline that encodes binary analysis problems as CHCs in the SMT logic of quantifier-free bit vectors, and show that off-the-shelf CHC solvers achieve reasonable success on binaries compiled from 983 C invariant inference benchmarks: a portfolio solves 59.5% and 66.1% of the problems derived from the unoptimized and optimized binaries, respectively -- roughly equal to the success rate of a leading C verifier on the source code (60.1%). Moreover, we show that binary analysis provides a valuable source of bit-vector CHC benchmarks (which are in short supply): binary-derived problems differ from existing benchmarks both structurally and in solver success rates and rankings. Augmenting CHC solving competitions with binary-derived benchmarks will encourage solver developers to improve bit-vector reasoning, in turn making CHC solving a more effective tool for binary analysis. 2026-03-28T03:25:50Z 23 pages, 6 figures, to appear in the 18th NASA Formal Methods Symposium (NFM 2026) Aaron Bembenek The University of Melbourne Toby Murray The University of Melbourne http://arxiv.org/abs/2603.15855v2 Mixing Visual and Textual Code 2026-03-27T23:15:52Z The dominant programming languages support nothing but linear text to express domain-specific geometric ideas. What is needed are hybrid languages that allow developers to create visual syntactic constructs so that they can express their ideas with a mix of textual and visual syntax tailored to an application domain. This mix must put the two kinds of syntax on equal footing and, just as importantly, the extended language must not disrupt a programmer's typical workflow. This means that any new visual syntax should be a proper language extension that is composable with other language features. Furthermore, the extensions should also preserve static reasoning about the program. This paper presents Hybrid ClojureScript the first such hybrid programming language. Hybrid ClojureScript allows programmers to add visual interactive syntax and to embed instances of this syntax within a program's text. An enhanced hybrid IDE can then display these embedded instances as mini-GUIs that programmers interact with, while other IDEs will show a textual representation of the syntax. The paper argues the necessity of such an extensibility mechanism, demonstrates the adoptability of the design, and discusses what might be needed to use the design in other languages. 2026-03-16T19:38:07Z to be published in JFP Leif Andersen Michael Ballantyne Cameron Moy Matthias Felleisen Stephen Chang http://arxiv.org/abs/2603.27015v1 Sheaf-Cohomological Program Analysis: Unifying Bug Finding, Equivalence, and Verification via Čech Cohomology 2026-03-27T22:10:58Z We present a framework in which program analysis -- type checking, bug finding, and equivalence verification -- is organized as computing the Čech cohomology of a semantic presheaf over a program's site category. The presheaf assigns refinement-type information to observation sites and restricts it along data-flow morphisms. The cohomology group $H^{0}$ is the space of globally consistent typings. The first cohomology group $H^{1}$ classifies gluing obstructions -- bugs, type errors, and equivalence failures -- each localized to a specific pair of disagreeing sites. This formulation yields three concrete results unavailable in prior work: (1) the rank of $H^1$ over $F_{2}$ counts the minimum independent fixes; (2) $H_{1}(U, Iso) = 0$ is sound and complete for behavioral equivalence; (3) Mayer-Vietoris enables compositional, incremental obstruction counting. We implement the framework in Deppy, a Python analysis tool, and evaluate it on a suite of 375~benchmarks: 133~bug-detection programs, 134~equivalence pairs, and 108~specification-satisfaction checks. Deppy achieves {100% bug-detection recall} (69% precision, F1 = 81%), 99% equivalence accuracy with zero false equivalences, and 98% spec accuracy with zero false satisfactions -- outperforming mypy and pyright, which report zero findings on unannotated code. The analysis models Python semantics as algebraic geometry: variables live on the generic fiber (non-None) unless on the closed nullable subscheme, integers form Spec($\mathbb{Z}$) with no bounded section (no overflow), and short-circuit evaluation defines an open-set topology on the presheaf. 2026-03-27T22:10:58Z Halley Young http://arxiv.org/abs/2604.18593v1 HELIX: Verified compilation of cyber-physical control systems to LLVM IR 2026-03-27T21:36:49Z This paper presents the design of HELIX, an end-to-end verified code generation system with a focus on the intersection of high-performance and high-assurance numerical computing. The code generation can be fine-tuned to generate efficient code for a broad set of computer architectures while providing formal guarantees of the correctness of such generated code. Using a real-life example of a cyber-physical robot system, this paper demonstrates how, by using HELIX, one can start from a high-level mathematical formulation of the problem, apply a series of algebraic transformations that target intermediate languages, and generate an efficient imperative implementation. This is done while formally verifying semantic preservation from the original formulation down to LLVM IR. The method we used for high-performance code compilation is the algebraic transformation of vector and matrix computations into a dataflow optimised for parallel or vectorised processing on target hardware. The abstraction used to formalise and verify this technique is an operator language and accompanying semantics-preserving term rewriting. We use sparse vector abstraction to represent partial computations, enabling us to use algebraic reasoning to prove parallel decomposition properties. HELIX's verification infrastructure comprises multiple intermediate languages and verification approaches, all implemented in the Coq proof assistant. In particular, it uses verified term rewriting, translation validation, metaprogramming, verified compilation, layered monadic interpreters; it also supports application-specific uses of (verified) numerical analysis as we demonstrate via the running example. 2026-03-27T21:36:49Z Vadim Zaliva Yannick Zakowski Ilia Zaichuk Valerii Huhnin Calvin Beck Irene Yoon Steve Zdancewic http://arxiv.org/abs/2603.26996v1 FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified? 2026-03-27T21:14:53Z We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models. 2026-03-27T21:14:53Z Accepted at ICLR 2026 Workshop: VerifAI-2: The Second Workshop on AI Verification in the Wild. Live leaderboard hosted here: https://www.vals.ai/benchmarks/proof_bench Nikil Ravi Kexing Ying Vasilii Nesterov Rayan Krishnan Elif Uskuplu Bingyu Xia Janitha Aswedige Langston Nashold http://arxiv.org/abs/2603.26139v1 On the computational complexity of JavaScript regex matching 2026-03-27T07:49:36Z Despite widespread use, the complexity class of modern regular expression matching was not well-understood. Previous work proved that regular expression matching with backreferences and lookarounds was PSPACE-complete, but the proof was not mechanized and applied to an abstract regex language. This paper clarifies the question for JavaScript regular expressions. In this paper, we prove the following new results, with most core proofs mechanized in the Rocq proof assistant. We prove that JavaScript regex matching is indeed PSPACE-hard, even without negative lookarounds, and OptP-hard as well; that JavaScript regex matching without lower-bounded quantifiers (i.e. quantifiers with a non-zero minimum number of repetitions) is PSPACE-complete; and that JavaScript regex matching without lower-bounded quantifiers and without lookarounds is OptP-complete. 2026-03-27T07:49:36Z Victor Deng Aurèle Barrière Clément Pit-Claudel http://arxiv.org/abs/2603.25337v2 On Representability of Multiple-Valued Functions by Linear Lambda Terms Typed with Second-order Polymorphic Type System 2026-03-27T02:16:22Z We show that any multiple-valued function can be represented by a linear lambda term typed in a second-order polymorphic type system, using two distinct styles. The first is a circuit style, which mimics combinational circuits in switching theory. The second is an inductive style, which follows a more traditional mathematical approach. We also discuss several optimizations for these representations. Furthermore, we present a case study that demonstrates the potential applications of our approach across various domains. 2026-03-26T11:29:10Z Satoshi Matsuoka http://arxiv.org/abs/2603.24940v1 Evaluating adaptive and generative AI-based feedback and recommendations in a knowledge-graph-integrated programming learning system 2026-03-26T02:04:36Z This paper introduces the design and development of a framework that integrates a large language model (LLM) with a retrieval-augmented generation (RAG) approach leveraging both a knowledge graph and user interaction history. The framework is incorporated into a previously developed adaptive learning support system to assess learners' code, generate formative feedback, and recommend exercises. Moerover, this study examines learner preferences across three instructional modes; adaptive, Generative AI (GenAI), and hybrid GenAI-adaptive. An experimental study was conducted to compare the learning performance and perception of the learners, and the effectiveness of these three modes using four key log features derived from 4956 code submissions across all experimental groups. The analysis results show that learners receiving feedback from GenAI modes had significantly more correct code and fewer code submissions missing essential programming logic than those receiving feedback from adaptive mode. In particular, the hybrid GenAI-adaptive mode achieved the highest number of correct submissions and the fewest incorrect or incomplete attempts, outperforming both the adaptive-only and GenAI-only modes. Questionnaire responses further indicated that GenAI-generated feedback was widely perceived as helpful, while all modes were rated positively for ease of use and usefulness. These results suggest that the hybrid GenAI-adaptive mode outperforms the other two modes across all measured log features. 2026-03-26T02:04:36Z Computers and Education: Artificial Intelligence, Volume 10, June 2026, 100526 Lalita Na Nongkhai Jingyun Wang Adam Wynn Takahiko Mendori 10.1016/j.caeai.2025.100526 http://arxiv.org/abs/2406.07737v2 The Future of AI-Driven Software Engineering 2026-03-26T00:24:30Z A paradigm shift is underway in Software Engineering, with AI systems such as LLMs playing an increasingly important role in boosting software development productivity. This trend is anticipated to persist. In the next years, we expect a growing symbiotic partnership between human software developers and AI. The Software Engineering research community cannot afford to overlook this trend; we must address the key research challenges posed by the integration of AI into the software development process. In this paper, we present our vision of the future of software development in an AI-driven world and explore the key challenges that our research community should address to realize this vision. 2024-06-11T21:46:19Z **Note** Published in ACM Transactions on Software Engineering and Methodology (TOSEM) ACM Transactions on Software Engineering and Methodology, Volume 34, Issue 5, Article 120 (May 2025) Valerio Terragni Annie Vella Partha Roop Kelly Blincoe 10.1145/3715003 http://arxiv.org/abs/2603.24812v1 Numerical Superoptimization for Library Learning 2026-03-25T20:47:15Z Numerical software depends on fast, accurate implementations of mathematical primitives like sin, exp, and log. Modern superoptimizers can optimize floating-point kernels against a given set of such primitives, but a more fundamental question remains open: which new primitives are worth implementing in the first place? We formulate this as numerical library learning: given a workload of floating-point kernels, identify the mathematical primitives whose expert implementations would most improve speed and accuracy. Our key insight is that numerical superoptimizers already have the machinery well-suited to this problem. Their search procedures happen to enumerate candidate primitives, their equivalence procedures can generalize and deduplicate candidates, and their cost models can estimate counterfactual utility: how much the workload would improve if a given primitive were available. We present GrowLibm, which repurposes the Herbie superoptimizer as a numerical library learner. GrowLibm mines candidate primitives from the superoptimizer's intermediate search results, ranks them by counterfactual utility, and prunes redundant candidates. Across three scientific applications (PROJ, CoolProp, and Basilisk), GrowLibm identifies compact, reusable primitives that can be implemented effectively using standard numerical techniques. When Herbie is extended with these expert implementations, kernel speed improves by up to 2.2x at fixed accuracy, and maximum achievable accuracy also improves, in one case from 56.0% to 93.5%. We also prototype an LLVM matcher that recognizes learned primitives in optimized IR, recovering 26 replacement sites across five PROJ projections and improving end-to-end application performance by up to 5%. 2026-03-25T20:47:15Z Jonas Regehr Mitch Briles Zachary Tatlock Pavel Panchekha http://arxiv.org/abs/2505.17703v3 Gradient-Based Program Repair: Fixing Bugs in Continuous Program Spaces 2026-03-25T20:19:39Z Automatic program repair seeks to generate correct code from buggy programs, with most approaches searching the correct program in a discrete, symbolic space of source code tokens. This symbolic search is fundamentally limited by its inability to directly reason about program behavior. We introduce Gradient-Based Program Repair (GBPR), a new approach that recasts program repair as continuous optimization in a differentiable numerical program space. Our core insight is to compile symbolic programs into differentiable numerical representations, enabling search in the numerical program space directly guided by program behavior. To evaluate GBPR, we present RaspBugs, a new benchmark of 1,466 buggy symbolic RASP programs and their respective numerical representations. Our experiments demonstrate that GBPR can effectively repair buggy symbolic programs by gradient-based optimization in the numerical program space, with convincing repair trajectories. To our knowledge, we are the first to state program repair as continuous optimization in a numerical program space. Our work demonstrates the feasibility of this direction for program repair research, bridging continuous optimization and program behavior. 2025-05-23T10:12:09Z André Silva Gustav Thorén Martin Monperrus http://arxiv.org/abs/2604.00039v1 Transformers for Program Termination 2026-03-25T17:40:26Z Determining whether a program terminates is a core challenge in program analysis with direct implications for correctness, verification, and security. We investigate whether transformer architectures can recognise termination patterns directly from source code and how their strengths can be amplified through ensembles. To overcome the extreme scarcity of non-terminating examples, we design an ensemble framework of compact transformer encoders, systematically trained with a suite of imbalance-aware loss functions and class-aware sampling techniques. By combining models trained with distinct loss functions, our ensembles achieve substantially stronger performance than any single transformer, outperforming both powerful off-the-shelf LLMs and graph-based methods. Finally, we introduce an attribution pipeline that produces syntax-aware explanations for the termination estimation. 2026-03-25T17:40:26Z 12 pages Yoav Alon Cristina David http://arxiv.org/abs/2511.08462v4 QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities 2026-03-25T15:48:20Z Static analysis tools provide a powerful means to detect security vulnerabilities by specifying queries that encode vulnerable code patterns. However, writing such queries is challenging and requires diverse expertise in security and program analysis. To address this challenge, we present QLCoder - an agentic framework that automatically synthesizes queries in CodeQL, a powerful static analysis engine, directly from a given CVE metadata. QLCode embeds an LLM in a synthesis loop with execution feedback, while constraining its reasoning using a custom MCP interface that allows structured interaction with a Language Server Protocol (for syntax guidance) and a RAG database (for semantic retrieval of queries and documentation). This approach allows QLCoder to generate syntactically and semantically valid security queries. We evaluate QLCode on 176 existing CVEs across 111 Java projects. Building upon the Claude Code agent framework, QLCoder synthesizes correct queries that detect the CVE in the vulnerable but not in the patched versions for 53.4% of CVEs. In comparison, using only Claude Code synthesizes 10% correct queries. QLCoder code is available publicly at https://github.com/neuralprogram/QLCoder. 2025-11-11T17:06:04Z Claire Wang Ziyang Li Saikat Dutta Mayur Naik http://arxiv.org/abs/2603.24199v1 Adapting the MVVM pattern to C++ frontends and Agda-based backends 2026-03-25T11:21:11Z Using agda2hs and ad-hoc Haskell FFI bindings, writing Qt applications in C++ with Agda- or Haskell-based backends (possibly including correctness proofs) is already possible. However, there was no repeatable methodology to do so, nor to use \emph{arbitrary} Haskell built-in libraries in Agda code. We present a well-documented, general methodology to address this, applying the ideas of the Model-View-ViewModel architecture to models implemented in functional languages. This is augmented by a software development kit providing easy installation and automated compilation. For obstacles arising, we provide solutions and ideas that are novel contributions by themselves. We describe and compare solutions for using arbitrary Haskell built-ins in Agda code, highlighting their advantages and disadvantages. Also, for user interruption, we present a new Haskell future design that, to the best of our knowledge, is the first to provide for arbitrary interruption and the first to provide for interruption via direct FFI calls from C and C++. Finally, we prove with benchmarks that the agda2hs compiler at the base of our methodology is viable when compared to other solutions, specifically to the OCaml extraction feature of Rocq and the default MAlonzo backend of Agda. 2026-03-25T11:21:11Z Viktor Csimma