https://arxiv.org/api/2UmHkYgRiT37jKCIZ9bVBO0GkEU 2026-06-26T09:47:36Z 9951 1230 15 http://arxiv.org/abs/2509.07609v1 What's in the Box: Ergonomic and Expressive Capture Tracking over Generic Data Structures (Extended Version) 2025-09-09T11:34:40Z

Capturing types in Scala unify static effect and resource tracking with object capabilities, enabling lightweight effect polymorphism with minimal notational overhead. However, their expressiveness has been insufficient for tracking capabilities embedded in generic data structures, preventing them from scaling to the standard collections library -- an essential prerequisite for broader adoption. This limitation stems from the inability to name capabilities within the system's notion of box types. This paper develops System Capless, a new foundation for capturing types that provides the theoretical basis for reach capabilities (rcaps), a novel mechanism for naming "what's in the box." The calculus refines the universal capability notion into a new scheme with existential and universal capture set quantification. Intuitively, rcaps witness existentially quantified capture sets inside the boxes of generic types in a way that does not require exposing existential capture types in the surface language. We have fully mechanized the formal metatheory of System Capless in Lean, including proofs of type soundness and scope safety. System Capless supports the same lightweight notation of capturing types plus rcaps, as certified by a type-preserving translation, and also enables fully optional explicit capture-set quantification to increase expressiveness. Finally, we present a full reimplementation of capture checking in Scala 3 based on System Capless and migrate the entire Scala collections library and an asynchronous programming library to evaluate its practicality and ergonomics. Our results demonstrate that reach capabilities enable the adoption of capture checking in production code with minimal changes and minimal-to-zero notational overhead in a vast majority of cases.

2025-09-09T11:34:40Z Yichen Xu Oliver Bračevac Cao Nguyen Pham Martin Odersky 10.1145/3763112 http://arxiv.org/abs/2509.07551v1 Fast and Extensible Hybrid Embeddings with Micros 2025-09-09T09:41:20Z

Macro embedding is a popular approach to defining extensible shallow embeddings of object languages in Scheme like host languages. While macro embedding has even been shown to enable implementing extensible typed languages in systems like Racket, it comes at a cost: compile-time performance. In this paper, we revisit micros - syntax to intermediate representation (IR) transformers, rather than source syntax to source syntax transformers (macros). Micro embedding enables stopping at an IR, producing a deep embedding and enabling high performance compile-time functions over an efficient IR, before shallowly embedding the IR back into source syntax. Combining micros with several design patterns to enable the IR and functions over it to be extensible, we achieve extensible hybrid embedding of statically typed languages with significantly improved compile-time compared to macro-embedding approaches. We describe our design patterns and propose new abstractions packaging these patterns.

2025-09-09T09:41:20Z 13 pages Sean Bocirnea William J. Bowman 10.1145/3759537.3762696 http://arxiv.org/abs/2508.15137v3 Software Model Checking via Summary-Guided Search (Extended Version) 2025-09-08T23:15:20Z

In this work, we describe a new software model-checking algorithm called GPS. GPS treats the task of model checking a program as a directed search of the program states, guided by a compositional, summary-based static analysis. The summaries produced by static analysis are used both to prune away infeasible paths and to drive test generation to reach new, unexplored program states. GPS can find both proofs of safety and counter-examples to safety (i.e., inputs that trigger bugs), and features a novel two-layered search strategy that renders it particularly efficient at finding bugs in programs featuring long, input-dependent error paths. To make GPS refutationally complete (in the sense that it will find an error if one exists, if it is allotted enough time), we introduce an instrumentation technique and show that it helps GPS achieve refutation-completeness without sacrificing overall performance. We benchmarked GPS on a diverse suite of benchmarks including programs from the Software Verification Competition (SV-COMP), from prior literature, as well as synthetic programs based on examples in this paper. We found that our implementation of GPS outperforms state-of-the-art software model checkers (including the top performers in SV-COMP ReachSafety-Loops category), both in terms of the number of benchmarks solved and in terms of running time.

2025-08-21T00:19:43Z Extended version of paper in OOPSLA 2025 (with typo and stylistic fixes compared to v2 manuscript). 37 pages Proc. ACM Program. Lang. 9, OOPSLA2, Article 364 (October 2025) Ruijie Fang Zachary Kincaid Thomas Reps 10.1145/3763142 http://arxiv.org/abs/2507.03659v3 Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs 2025-09-08T19:11:06Z

Debugging and repairing faults when programs fail to formally verify can be complex and time-consuming. Automated Program Repair (APR) can ease this burden by automatically identifying and fixing faults. However, traditional APR techniques often rely on test suites for validation, but these may not capture all possible scenarios. In contrast, formal specifications provide strong correctness criteria, enabling more effective automated repair. In this paper, we present an APR tool for Dafny, a verification-aware programming language that uses formal specifications - including pre-conditions, post-conditions, and invariants - as oracles for fault localization and repair. Assuming the correctness of the specifications and focusing on arithmetic bugs, we localize faults through a series of steps, which include using Hoare logic to determine the state of each statement within the program, and applying Large Language Models (LLMs) to synthesize candidate fixes. The models considered are GPT-4o mini, Llama 3, Mistral 7B, and Llemma 7B. We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs. Our tool achieves 89.6% fault localization coverage and GPT-4o mini yields the highest repair success rate of 74.18%. These results highlight the potential of combining formal reasoning with LLM-based program synthesis for automated program repair.

2025-07-04T15:36:12Z Valentina Wu Alexandra Mendes Alexandre Abreu http://arxiv.org/abs/2509.06872v1 Mechanized Metatheory of Forward Reasoning for End-to-End Linearizability Proofs 2025-09-08T16:42:02Z

In the past decade, many techniques have been developed to prove linearizability, the gold standard of correctness for concurrent data structures. Intuitively, linearizability requires that every operation on a concurrent data structure appears to take place instantaneously, even when interleaved with other operations. Most recently, Jayanti et al. presented the first sound and complete "forward reasoning" technique for proving linearizability that relates the behavior of a concurrent data structure to a reference atomic data structure as time moves forward. This technique can be used to produce machine-checked proofs of linearizability in TLA+. However, while Jayanti et al.'s approach is shown to be sound and complete, a mechanization of this important metatheoretic result is still outstanding. As a result, it is not possible to produce verified end-to-end proofs of linearizability. To reduce the size of this trusted computing base, we formalize this forward reasoning technique and mechanize proofs of its soundness and completeness in Rocq. As a case study, we use the approach to produce a verified end-to-end proof of linearizability for a simple concurrent register.

2025-09-08T16:42:02Z Zachary Kent Ugur Y. Yavuz Siddhartha Jayanti Stephanie Balzer Guy Blelloch http://arxiv.org/abs/2509.06845v1 MIO: Multiverse Debugging in the Face of Input/Output -- Extended Version with Additional Appendices 2025-09-08T16:15:18Z

Debugging non-deterministic programs on microcontrollers is notoriously challenging, especially when bugs manifest in unpredictable, input-dependent execution paths. A recent approach, called multiverse debugging, makes it easier to debug non-deterministic programs by allowing programmers to explore all potential execution paths. Current multiverse debuggers enable both forward and backward traversal of program paths, and some facilitate jumping to any previously visited states, potentially branching into alternative execution paths within the state space. Unfortunately, debugging programs that involve input/output operations using existing multiverse debuggers can reveal inaccessible program states, i.e. states which are not encountered during regular execution. This can significantly hinder the debugging process, as the programmer may spend substantial time exploring and examining inaccessible program states, or worse, may mistakenly assume a bug is present in the code, when in fact, the issue is caused by the debugger. This paper presents a novel approach to multiverse debugging, which can accommodate a broad spectrum of input/output operations. We provide the semantics of our approach and prove the correctness of our debugger, ensuring that despite having support for a wide range of input/output operations the debugger will only explore those program states which can be reached during regular execution. We have developed a prototype, called MIO, leveraging the WARDuino WebAssembly virtual machine to demonstrate the feasibility and efficiency of our techniques. As a demonstration of the approach we highlight a color dial built with a Lego Mindstorms motor, and color sensor, providing a tangible example of how our approach enables multiverse debugging for programs running on an STM32 microcontroller.

2025-09-08T16:15:18Z This extended version provides auxiliary material to the article of the same title that will appear in the ACM Digital Library as part of the PACMPL issue for OOPSLA 2025 Tom Lauwaerts Maarten Steevens Christophe Scholliers 10.1145/3763136 http://arxiv.org/abs/2509.06794v1 Dato: A Task-Based Programming Model for Dataflow Accelerators 2025-09-08T15:22:51Z

Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator's spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.

2025-09-08T15:22:51Z Shihan Fang Hongzheng Chen Niansong Zhang Jiajie Li Han Meng Adrian Liu Zhiru Zhang http://arxiv.org/abs/2509.06724v1 Pacing Types: Safe Monitoring of Asynchronous Streams 2025-09-08T14:22:00Z

Stream-based monitoring is a real-time safety assurance mechanism for complex cyber-physical systems such as unmanned aerial vehicles. In this context, a monitor aggregates streams of input data from sensors and other sources to give real-time statistics and assessments of the system's health. Since monitors are safety-critical components, it is crucial to ensure that they are free of potential runtime errors. One of the central challenges in designing reliable stream-based monitors is to deal with the asynchronous nature of data streams: in concrete applications, the different sensors being monitored produce values at different speeds, and it is the monitor's responsibility to correctly react to the asynchronous arrival of different streams of values. To ease this process, modern frameworks for stream-based monitoring such as RTLola feature an expressive specification language that allows to finely specify data synchronization policies. While this feature dramatically simplifies the design of monitors, it can also lead to subtle runtime errors. To mitigate this issue, this paper presents pacing types, a novel type system implemented in RTLola to ensure that monitors for asynchronous streams are well-behaved at runtime. We formalize the essence of pacing types for a core fragment of RTLola, and present a soundness proof of the pacing type system using a new logical relation.

2025-09-08T14:22:00Z Florian Kohn Arthur Correnson Jan Baumeister Bernd Finkbeiner http://arxiv.org/abs/2506.22370v4 Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny 2025-09-07T11:26:53Z

Students in computing education increasingly use large language models (LLMs) such as ChatGPT. Yet, the role of LLMs in supporting cognitively demanding tasks, like deductive program verification, remains poorly understood. This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny, a language that supports functional correctness, by allowing programmers to write formal specifications and automatically verifying that the implementation satisfies the specification. We conducted a mixed-methods study with master's students enrolled in a formal methods course. Each participant completed two verification problems, one with access to a custom ChatGPT interface that logged all interactions, and the other without. We identified strategies used by successful students and assessed the level of trust students place in LLMs. Our findings show that students perform significantly better when using ChatGPT; however, performance gains are tied to prompt quality. We conclude with practical recommendations for integrating LLMs into formal methods courses more effectively, including designing LLM-aware challenges that promote learning rather than substitution.

2025-06-27T16:34:13Z Carolina Carreira Álvaro Silva Alexandre Abreu Alexandra Mendes http://arxiv.org/abs/2506.23696v2 What Challenges Do Developers Face When Using Verification-Aware Programming Languages? 2025-09-07T11:10:45Z

Software reliability is critical in ensuring that the digital systems we depend on function correctly. In software development, increasing software reliability often involves testing. However, for complex and critical systems, developers can use Design by Contract (DbC) methods to define precise specifications that software components must satisfy. Verification-Aware (VA) programming languages support DbC and formal verification at compile-time or run-time, offering stronger correctness guarantees than traditional testing. However, despite the strong guarantees provided by VA languages, their adoption remains limited. In this study, we investigate the barriers to adopting VA languages by analyzing developer discussions on public forums using topic modeling techniques. We complement this analysis with a developer survey to better understand the practical challenges associated with VA languages. Our findings reveal key obstacles to adoption, including steep learning curves and usability issues. Based on these insights, we identify actionable recommendations to improve the usability and accessibility of VA languages. Our findings suggest that simplifying tool interfaces, providing better educational materials, and improving integration with everyday development environments could improve the usability and adoption of these languages. Our work provides actionable insights for improving the usability of VA languages and making verification tools more accessible.

2025-06-30T10:17:39Z Francisco Oliveira Alexandra Mendes Carolina Carreira http://arxiv.org/abs/2509.00948v2 Decision Procedure for A Theory of String Sequences 2025-09-07T04:40:03Z

The theory of sequences, supported by many SMT solvers, can model program data types including bounded arrays and lists. Sequences are parameterized by the element data type and provide operations such as accessing elements, concatenation, forming sub-sequences and updating elements. Strings and sequences are intimately related; many operations, e.g., matching a string according to a regular expression, splitting strings, or joining strings in a sequence, are frequently used in string-manipulating programs. Nevertheless, these operations are typically not directly supported by existing SMT solvers, which instead only consider the generic theory of sequences. In this paper, we propose a theory of string sequences and study its satisfiability. We show that, while it is undecidable in general, the decidability can be recovered by restricting to the straight-line fragment. This is shown by encoding each string sequence as a string, and each string sequence operation as a corresponding string operation. We provide pre-image computation for the resulting string operations with respect to automata, effectively casting it into the generic OSTRICH string constraint solving framework. We implement the new decision procedure as a tool $\ostrichseq$, and carry out experiments on benchmark constraints generated from real-world JavaScript programs, hand-crafted templates and unit tests. The experiments confirm the efficacy of our approach.

2025-08-31T17:57:01Z 21 pages, 2 tables, APLAS 2025 Denghang Hu Taolue Chen Philipp Rümmer Fu Song Zhilin Wu http://arxiv.org/abs/2509.05462v1 The compact double category $\mathbf{Int}(\mathbf{Poly}_*)$ models control flow and data transformations 2025-09-05T19:35:08Z

Hasegawa showed that control flow in programming languages -- while loops and if-then-else statements -- can be modeled using traced cocartesian categories, such as the category $\mathbf{Set}_*$ of pointed sets. In this paper we define an operad $\mathscr{W}$ of wiring diagrams that provides syntax for categories whose control flow moreover includes data transformations, including deleting, duplicating, permuting, and applying pre-specified functions to variables. In the most basic version, the operad underlies $\mathbf{Int}(\mathbf{Poly}_*)$, where $\mathbf{Int}(\mathscr{T})$ denotes the free compact category on a traced category $\mathscr{T}$, as defined by Joyal, Street, and Verity; to do so, we show that $\mathbf{Poly}_*$, as well as any multivariate version of it, is traced. We show moreover that whenever $\mathscr{T}$ is uniform -- a condition also defined by Hasegawa and satisfied by $\mathbf{Int}(\mathscr{T})$ -- the resulting $\mathbf{Int}$-construction extends to a double category $\mathbb{I}\mathbf{nt}(\mathscr{T})$, which is compact in the sense of Patterson. Finally, we define a universal property of the double category $\mathbb{I}\mathbf{nt}(\mathbf{Poly}_*)$ and $\mathbb{I}\mathbf{nt}(\mathbf{Set}_*)$ by which one can track trajectories as they move through the control flow associated to a wiring diagram.

2025-09-05T19:35:08Z 28 pages including many diagrams Grigory Kondyrev David I. Spivak http://arxiv.org/abs/2509.07003v1 veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD 2025-09-05T19:29:00Z

Large Language Models (LLMs) have scaled rapidly in size and complexity, requiring increasingly intricate parallelism for distributed training, such as 3D parallelism. This sophistication motivates a shift toward simpler, more debuggable programming paradigm like Single Program Multiple Data (SPMD). However, SPMD in eager execution introduces two key challenges: ensuring consistency with single-device execution and achieving high performance at scale. In this paper, we introduce veScale, an eager-mode training system that fully embraces SPMD paradigm to democratize distributed tensor programming. veScale addresses the prevalent issue of inconsistent results in systems like PyTorch by introducing a novel algorithm of distributed Random Number Generation (RNG) compatible with arbitrary sharded operators. veScale also significantly boosts training performance by reducing PyTorch primitive's overhead and improving communication efficiency. Evaluations show that veScale delivers up to 2.2x speedup over the state-of-the-art training systems, like TorchTitan, and cuts code complexity by 78.4%, while preserving single-device-equivalent results.

2025-09-05T19:29:00Z 21 pages, 16 figures, 5 tables Youjie Li Cheng Wan Zhiqi Lin Hongyu Zhu Jiacheng Yang Ziang Song Xinyi Di Jiawei Wu Huiyao Shu Wenlei Bao Yanghua Peng Haibin Lin Li-Wen Chang http://arxiv.org/abs/2509.05293v1 Non-Termination Proving: 100 Million LoC and Beyond 2025-09-05T17:58:45Z

We report on our tool, Pulse Infinite, that uses proof techniques to show non-termination (divergence) in large programs. Pulse Infinite works compositionally and under-approximately: the former supports scale, and the latter ensures soundness for proving divergence. Prior work focused on small benchmarks in the tens or hundreds of lines of code (LoC), and scale limits their practicality: a single company may have tens of millions, or even hundreds of millions of LoC or more. We report on applying Pulse Infinite to over a hundred million lines of open-source and proprietary software written in C, C++, and Hack, identifying over 30 previously unknown issues, establishing a new state of the art for detecting divergence in real-world codebases.

2025-09-05T17:58:45Z 14 pages, 4 figures Julien Vanegue Jules Villard Peter O'Hearn Azalea Raad http://arxiv.org/abs/2509.05160v1 AI-Assisted Modeling: DSL-Driven AI Interactions 2025-09-05T14:56:18Z

AI-assisted programming greatly increases software development performance. We enhance this potential by integrating transparency through domain-specific modeling techniques and providing instantaneous, graphical visualizations that accurately represent the semantics of AI-generated code. This approach facilitates visual inspection and formal verification, such as model checking. Formal models can be developed using programming, natural language prompts, voice commands, and stage-wise refinement, with immediate feedback after each transformation step. This support can be tailored to specific domains or intended purposes, improving both code generation and subsequent validation processes. To demonstrate the effectiveness of this approach, we have developed a prototype as a Visual Studio Code extension for the Lingua Franca language. This prototype showcases the potential for novel domain-specific modeling practices, offering an advancement in how models are created, visualized, and verified.

2025-09-05T14:56:18Z 7 pages, 4 figures Steven Smyth Daniel Busch Moez Ben Haj Hmida Edward A. Lee Bernhard Steffen