https://arxiv.org/api/d1ec3b+eKjpDNabTvcmhGbyCvCU2026-06-26T08:37:19Z9951121515http://arxiv.org/abs/2403.14064v3Lean4Lean: Verifying a Typechecker for Lean, in Lean2025-09-14T11:58:14ZIn this paper we present a new "external checker" for the Lean theorem prover, written in Lean itself. This is the first complete typechecker for Lean 4 other than the reference implementation in C++ used by Lean itself, and our new checker is competitive with the original, running between 20% and 50% slower and usable to verify all of Lean's mathlib library, forming an additional step in Lean's aim to self-host the full elaborator and compiler. Moreover, because the checker is written in a language which admits formal verification, it is possible to state and prove properties about the kernel itself, and we report on progress to formalize the Lean type theory abstractly and prove some theorems about it. Finally, we combine these to get a proof of correctness of parts of the kernel. We plan to use this project to help justify any future changes to the kernel and type theory and ensure unsoundness does not sneak in through either the abstract theory or implementation bugs. The verification is already paying off, as one soundness bug has been spotted and fixed as a result of this work.2024-03-21T01:22:52Z14 pages, submitted to CPP 2026Mario Carneirohttp://arxiv.org/abs/2509.11065v1ViScratch: Using Large Language Models and Gameplay Videos for Automated Feedback in Scratch2025-09-14T03:12:44ZBlock-based programming environments such as Scratch are increasingly popular in programming education, in particular for young learners. While the use of blocks helps prevent syntax errors, semantic bugs remain common and difficult to debug. Existing tools for Scratch debugging rely heavily on predefined rules or user manual inputs, and crucially, they ignore the platform's inherently visual nature.
We introduce ViScratch, the first multimodal feedback generation system for Scratch that leverages both the project's block code and its generated gameplay video to diagnose and repair bugs. ViScratch uses a two-stage pipeline: a vision-language model first aligns visual symptoms with code structure to identify a single critical issue, then proposes minimal, abstract syntax tree level repairs that are verified via execution in the Scratch virtual machine.
We evaluate ViScratch on a set of real-world Scratch projects against state-of-the-art LLM-based tools and human testers. Results show that gameplay video is a crucial debugging signal: ViScratch substantially outperforms prior tools in both bug identification and repair quality, even without access to project descriptions or goals. This work demonstrates that video can serve as a first-class specification in visual programming environments, opening new directions for LLM-based debugging beyond symbolic code alone.2025-09-14T03:12:44ZYuan SiDaming LiHanyuan ShiJialu Zhanghttp://arxiv.org/abs/2502.07958v2Actor Capabilities for Message Ordering (Extended Version)2025-09-13T19:51:47ZActor systems are a flexible model of concurrent and distributed programming, which are efficiently implementable, and avoid many classic concurrency bugs by construction. However actor systems must still deal with the challenge of messages arriving in unexpected orderings.
We describe an approach to restricting the orders in which actors send messages to each other, by equipping actor references -- the handle used to address another actor -- with a protocol restricting which message types can be sent to another actor and in which order using that particular actor reference. This endows the actor references with the properties of static (flow-sensitive) capabilities, which we call actor capabilities.
By sending other actors only restricted actor references, they may control which messages are sent in which orders by other actors. Rules for duplicating (splitting) actor references ensure that these restrictions apply even in the presence of delegation. The capabilities themselves restrict message ordering, which may form the foundation for stronger forms of reasoning. We demonstrate this by layering an effect system over the base type system, where the relationships enforced between the actor capabilities and the effects of an actor's behaviour ensure that an actor's behaviour is always prepared to handle any message that may arrive.2025-02-11T21:11:58ZExtended version (with proof sketches) of a chapter for Gul Agha's festschriftColin S. Gordonhttp://arxiv.org/abs/2509.10819v1Arguzz: Testing zkVMs for Soundness and Completeness Bugs2025-09-13T14:36:49ZZero-knowledge virtual machines (zkVMs) are increasingly deployed in decentralized applications and blockchain rollups since they enable verifiable off-chain computation. These VMs execute general-purpose programs, frequently written in Rust, and produce succinct cryptographic proofs. However, zkVMs are complex, and bugs in their constraint systems or execution logic can cause critical soundness (accepting invalid executions) or completeness (rejecting valid ones) issues.
We present Arguzz, the first automated tool for testing zkVMs for soundness and completeness bugs. To detect such bugs, Arguzz combines a novel variant of metamorphic testing with fault injection. In particular, it generates semantically equivalent program pairs, merges them into a single Rust program with a known output, and runs it inside a zkVM. By injecting faults into the VM, Arguzz mimics malicious or buggy provers to uncover overly weak constraints.
We used Arguzz to test six real-world zkVMs (RISC Zero, Nexus, Jolt, SP1, OpenVM, and Pico) and found eleven bugs in three of them. One RISC Zero bug resulted in a $50,000 bounty, despite prior audits, demonstrating the critical need for systematic testing of zkVMs.2025-09-13T14:36:49ZChristoph HochrainerValentin WüstholzMaria Christakishttp://arxiv.org/abs/2509.10694v1Verifying Computational Graphs in Production-Grade Distributed Machine Learning Frameworks2025-09-12T21:14:29ZModern machine learning frameworks support very large models by incorporating parallelism and optimization techniques. Yet, these very techniques add new layers of complexity, introducing silent errors that severely degrade model performance. Existing solutions are either ad hoc or too costly for production.
We present Scalify, a lightweight framework that exposes silent errors by verifying semantic equivalence of computational graphs using equality saturation and Datalog-style reasoning. To scale, Scalify partitions graphs with parallel rewriting and layer memoization, reuses rewrite templates, and augments equality saturation with relational reasoning and symbolic bijection inference. It further localizes discrepancies to precise code sites, turning verification results into actionable debugging guidance. Scalify verifies models as large as Llama-3.1-405B within minutes on a commodity machine and exposed five unknown bugs in Amazon production machine learning frameworks.2025-09-12T21:14:29ZKahfi S. ZulkifliWenbo QianShaowei ZhuYuan ZhouZhen ZhangChang Louhttp://arxiv.org/abs/2504.20291v2Quantum Gate Decomposition: A Study of Compilation Time vs. Execution Time Trade-offs2025-09-11T23:31:43ZSimilar to classical programming, high-level quantum programming languages generate code that cannot be executed directly by quantum hardware and must be compiled. However, unlike classical code, quantum programs must be compiled before each execution, making the trade-off between compilation time and execution time particularly significant. In this paper, we address the first step of quantum compilation: multi-qubit gate decomposition. We analyze the trade-offs of state-of-the-art decomposition algorithms by implementing them in the Ket quantum programming platform and collecting numerical performance data. This is the first study to both implement and analyze the current state-of-the-art decomposition methods within a single platform. Based on our findings, we propose two compilation profiles: one optimized for minimizing compilation time and another for minimizing quantum execution time. Our results provide valuable insights for both quantum compiler developers and quantum programmers, helping them make informed decisions about gate decomposition strategies and their impact on overall performance.2025-04-28T22:24:55ZEvandro C. R. RosaJerusa MarchiEduardo I. DuzzioniRafael de Santiagohttp://arxiv.org/abs/2508.16848v2Syntactic Completions with Material Obligations2025-09-11T22:40:03ZCode editors provide essential services that help developers understand, navigate, and modify programs. However, these services often fail in the presence of syntax errors. Existing syntax error recovery techniques, like panic mode and multi-option repairs, are either too coarse, e.g. in deleting large swathes of code, or lead to a proliferation of possible completions. This paper introduces $\texttt{tall}~\texttt{tylr}$, an error-handling parser and editor generator that completes malformed code with $\textit{syntactic obligations}$ that abstract over many possible completions. These obligations generalize the familiar notion of holes in structure editors to cover missing operands, operators, delimiters, and sort transitions.
$\texttt{tall}~\texttt{tylr}$ is backed by a novel theory of tile-based parsing, conceptually organized around a $\textit{molder}$ that turns tokens into tiles and a $\textit{melder}$ that completes and parses tiles into terms using an error-handling generalization of operator-precedence parsing. We formalize melding as a parsing calculus, $\textsf{meldr}$, that completes input tiles with additional obligations such that it can be parsed into a well-formed term, with success guaranteed over all inputs. We further describe how $\texttt{tall}~\texttt{tylr}$ implements molding and completion-ranking using the principle of $\textit{minimizing obligations}$.
Obligations offer a useful way to scaffold internal program representations, but in $\texttt{tall}~\texttt{tylr}$ we go further to investigate the potential of $\textit{materializing}$ these obligations visually to the programmer. We conduct a user study to evaluate the extent to which an editor like $\texttt{tall}~\texttt{tylr}$ that materializes syntactic obligations might be usable and useful, finding both points of positivity and interesting new avenues for future work.2025-08-23T00:22:31ZDavid MoonAndrew BlinnThomas J. PorterCyrus Omar10.1145/3763182http://arxiv.org/abs/2508.15264v2Exploring the Theory and Practice of Concurrency in the Entity-Component-System Pattern2025-09-11T02:55:06Z The Entity-Component-System (ECS) software design pattern, long used in game development, encourages a clean separation of identity (entities), data properties (components), and computational behaviors (systems). Programs written using the ECS pattern are naturally concurrent, and the pattern offers modularity, flexibility, and performance benefits that have led to a proliferation of ECS frameworks. Nevertheless, the ECS pattern is little-known and not well understood outside of a few domains. Existing explanations of the ECS pattern tend to be mired in the concrete details of particular ECS frameworks, or they explain the pattern in terms of imperfect metaphors or in terms of what it is not. We seek a rigorous understanding of the ECS pattern via the design of a formal model, Core ECS, that abstracts away the details of specific implementations to reveal the essence of software using the ECS pattern. We identify a class of Core ECS programs that behave deterministically regardless of scheduling, enabling use of the ECS pattern as a deterministic-by-construction concurrent programming model. With Core ECS as a point of comparison, we then survey several real-world ECS frameworks and find that they all leave opportunities for deterministic concurrency unexploited. Our findings point out a space for new ECS implementation techniques that better leverage such opportunities.2025-08-21T05:55:07ZThis is an extended version (with appendices) of the OOPSLA 2025 paperPatrick RedmondJonathan CastelloJosé Manuel Calderón TrillaLindsey Kuper10.1145/3763050http://arxiv.org/abs/2509.09059v1Dependent-Type-Preserving Memory Allocation2025-09-10T23:45:19ZDependently typed programming languages such as Coq, Agda, Idris, and F*, allow programmers to write detailed specifications of their programs and prove their programs meet these specifications. However, these specifications can be violated during compilation since they are erased after type checking. External programs linked with the compiled program can violate the specifications of the original program and change the behavior of the compiled program -- even when compiled with a verified compiler. For example, since Coq does not allow explicitly allocating memory, a programmer might link their Coq program with a C program that can allocate memory. Even if the Coq program is compiled with a verified compiler, the external C program can still violate the memory-safe specification of the Coq program by providing an uninitialized pointer to memory. This error could be ruled out by type checking in a language expressive enough to indicate whether memory is initialized versus uninitialized. Linking with a program with an uninitialized pointer could be considered ill-typed, and our linking process could prevent linking with ill-typed programs. To facilitate type checking during linking, we can use type-preserving compilation, which preserves the types through the compilation process. In this ongoing work, we develop a typed intermediate language that supports dependent memory allocation, as well as a dependent-type-preserving compiler pass for memory allocation.2025-09-10T23:45:19ZSubmitted and received second place at the Student Research Competition at Principles of Programming Languages 2022Paulette KoronkevichWilliam J. Bowmanhttp://arxiv.org/abs/2509.09019v1Towards Verified Compilation of Floating-point Optimization in Scientific Computing Programs2025-09-10T21:27:35ZScientific computing programs often undergo aggressive compiler optimization to achieve high performance and efficient resource utilization. While performance is critical, we also need to ensure that these optimizations are correct. In this paper, we focus on a specific class of optimizations, floating-point optimizations, notably due to fast math, at the LLVM IR level. We present a preliminary work, which leverages the Verified LLVM framework in the Rocq theorem prover, to prove the correctness of Fused-Multiply-Add (FMA) optimization for a basic block implementing the arithmetic expression $a * b + c$ . We then propose ways to extend this preliminary results by adding more program features and fast math floating-point optimizations.2025-09-10T21:27:35ZMohit TekriwalJohn Sarracinohttp://arxiv.org/abs/2509.08804v1Approximate Algorithms for Verifying Differential Privacy with Gaussian Distributions2025-09-10T17:37:56ZThe verification of differential privacy algorithms that employ Gaussian distributions is little understood. This paper tackles the challenge of verifying such programs by introducing a novel approach to approximating probability distributions of loop-free programs that sample from both discrete and continuous distributions with computable probability density functions, including Gaussian and Laplace. We establish that verifying $(ε,δ)$-differential privacy for these programs is \emph{almost decidable}, meaning the problem is decidable for all values of $δ$ except those in a finite set. Our verification algorithm is based on computing probabilities to any desired precision by combining integral approximations, and tail probability bounds. The proposed methods are implemented in the tool, DipApprox, using the FLINT library for high-precision integral computations, and incorporate optimizations to enhance scalability. We validate {\ourtool} on fundamental privacy-preserving algorithms, such as Gaussian variants of the Sparse Vector Technique and Noisy Max, demonstrating its effectiveness in both confirming privacy guarantees and detecting violations.2025-09-10T17:37:56ZAn extended abstract appears in CCS 2025Bishnu BhusalRohit ChadhaA. Prasad SistlaMahesh Viswanathan10.1145/3719027.3765043http://arxiv.org/abs/2509.08727v1Securing Cryptographic Software via Typed Assembly Language (Extended Version)2025-09-10T16:17:31ZAuthors of cryptographic software are well aware that their code should not leak secrets through its timing behavior, and, until 2018, they believed that following industry-standard constant-time coding guidelines was sufficient. However, the revelation of the Spectre family of speculative execution attacks injected new complexities.
To block speculative attacks, prior work has proposed annotating the program's source code to mark secret data, with hardware using this information to decide when to speculate (i.e., when only public values are involved) or not (when secrets are in play). While these solutions are able to track secret information stored on the heap, they suffer from limitations that prevent them from correctly tracking secrets on the stack, at a cost in performance.
This paper introduces SecSep, a transformation framework that rewrites assembly programs so that they partition secret and public data on the stack. By moving from the source-code level to assembly rewriting, SecSep is able to address limitations of prior work. The key challenge in performing this assembly rewriting stems from the loss of semantic information through the lengthy compilation process. The key innovation of our methodology is a new variant of typed assembly language (TAL), Octal, which allows us to address this challenge. Assembly rewriting is driven by compile-time inference within Octal. We apply our technique to cryptographic programs and demonstrate that it enables secure speculation efficiently, incurring a low average overhead of $1.2\%$.2025-09-10T16:17:31ZShixin SongTingzhen DongKosi NwabuezeJulian ZandersAndres ErbsenAdam ChlipalaMengjia Yanhttp://arxiv.org/abs/2507.17233v2Hiord#: An Approach to the Specification and Verification of Higher-Order (C)LP Programs2025-09-10T15:58:41ZHigher-order constructs enable more expressive and concise code by allowing procedures to be parameterized by other procedures. Assertions allow expressing partial program specifications, which can be verified either at compile time (statically) or run time (dynamically). In higher-order programs, assertions can also describe higher-order arguments. While in the context of (constraint) logic programming ((C)LP), run-time verification of higher-order assertions has received some attention, compile-time verification remains relatively unexplored. We propose a novel approach for statically verifying higher-order (C)LP programs with higher-order assertions. Although we use the Ciao assertion language for illustration, our approach is quite general and we believe is applicable to similar contexts. Higher-order arguments are described using predicate properties -- a special kind of property which exploits the (Ciao) assertion language. We refine the syntax and semantics of these properties and introduce an abstract criterion to determine conformance to a predicate property at compile time, based on a semantic order relation comparing the predicate property with the predicate assertions. We then show how to handle these properties using an abstract interpretation-based static analyzer for programs with first-order assertions by reducing predicate properties to first-order properties. Finally, we report on a prototype implementation and evaluate it through various examples within the Ciao system.2025-07-23T05:57:15ZAccepted for publication in Theory and Practice of Logic Programming (TPLP)Marco CiccalèDaniel Jurjo-RivasJose F. MoralesPedro López-GarcíaManuel V. Hermenegildo10.1017/S147106842510015Xhttp://arxiv.org/abs/2509.08182v1XML Prompting as Grammar-Constrained Interaction: Fixed-Point Semantics, Convergence Guarantees, and Human-AI Protocols2025-09-09T23:03:53ZStructured prompting with XML tags has emerged as an effective way to steer large language models (LLMs) toward parseable, schema-adherent outputs in real-world systems. We develop a logic-first treatment of XML prompting that unifies (i) grammar-constrained decoding, (ii) fixed-point semantics over lattices of hierarchical prompts, and (iii) convergent human-AI interaction loops. We formalize a complete lattice of XML trees under a refinement order and prove that monotone prompt-to-prompt operators admit least fixed points (Knaster-Tarski) that characterize steady-state protocols; under a task-aware contraction metric on trees, we further prove Banach-style convergence of iterative guidance. We instantiate these results with context-free grammars (CFGs) for XML schemas and show how constrained decoding guarantees well-formedness while preserving task performance. A set of multi-layer human-AI interaction recipes demonstrates practical deployment patterns, including multi-pass "plan $\to$ verify $\to$ revise" routines and agentic tool use. We provide mathematically complete proofs and tie our framework to recent advances in grammar-aligned decoding, chain-of-verification, and programmatic prompting.2025-09-09T23:03:53Z7 pages, multiple XML promptsFaruk AlpayTaylan Alpayhttp://arxiv.org/abs/2509.07763v1What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects2025-09-09T13:58:46ZContext. Code refactoring improves software quality without changing external behavior. Despite its advantages, its benefits are hindered by the considerable cost of time, resources, and continuous effort it demands. Aim. Understanding why developers refactor, and which metrics capture these motivations, may support wider and more effective use of refactoring in practice. Method. We performed a large-scale empirical study to analyze developers refactoring activity, leveraging Large Language Models (LLMs) to identify underlying motivations from version control data, comparing our findings with previous motivations reported in the literature. Results. LLMs matched human judgment in 80% of cases, but aligned with literature-based motivations in only 47%. They enriched 22% of motivations with more detailed rationale, often highlighting readability, clarity, and structural improvements. Most motivations were pragmatic, focused on simplification and maintainability. While metrics related to developer experience and code readability ranked highest, their correlation with motivation categories was weak. Conclusions. We conclude that LLMs effectively capture surface-level motivations but struggle with architectural reasoning. Their value lies in providing localized explanations, which, when combined with software metrics, can form hybrid approaches. Such integration offers a promising path toward prioritizing refactoring more systematically and balancing short-term improvements with long-term architectural goals.2025-09-09T13:58:46ZMikel RobredoMatteo EspositoFabio PalombaRafael PeñalozaValentina Lenarduzzi