Designing Walrus: Relational Programming with Rich Types, On-Demand Laziness, and Structured Traces

2025-10-28T20:37:01Z

We present Walrus, a functional relational programming language embedded in Haskell that extends the miniKanren model with type-polymorphic unification, on-demand laziness, and a range of usability features aimed at practical development. These include use of Haskell Generics for boilerplate reduction, structured debugging traces, and ergonomic support for product types. We describe the design and implementation of Walrus through the lens of our experience developing bidirectional compilers, and reflect on key design decisions and recurring usability challenges encountered in practice.

The CoCompiler: DSL Lifting via Relational Compilation

2025-10-28T20:26:30Z

Lifting low-level or legacy code into a domain-specific language (DSL) improves our ability to understand it, enables deeper formal reasoning, and facilitates safe modification. We present the CoCompiler, a bidirectional compiler and lifter between C and Lustre, a synchronous dataflow language used for reactive systems. The key insight behind the CoCompiler is that writing a compiler as a relation, rather than as a traditional function, yields a DSL lifter "for free". We implement this idea by rewriting the verified Lustre-to-C compiler Vélus in the Walrus relational programming language. This solves what we call the vertical lifting problem, translating canonical C into Lustre. To address the complementary horizontal problem-handling real-world C outside the compiler's image-we apply semantic-preserving canonicalization passes in Haskell. The resulting tool, the CoCompiler, supports lifting real reactive C code into Lustre and onward into graphical behavioral models. Our approach is modular, language-agnostic, and fast to implement, demonstrating that relational programming offers a practical foundation for building DSL lifters by repurposing existing compilers.

A Roadmap for Tamed Interactions with Large Language Models

2025-10-28T13:46:07Z

We are witnessing a bloom of AI-powered software driven by Large Language Models (LLMs). Although the applications of these LLMs are impressive and seemingly countless, their unreliability hinders adoption. In fact, the tendency of LLMs to produce faulty or hallucinated content makes them unsuitable for automating workflows and pipelines. In this regard, Software Engineering (SE) provides valuable support, offering a wide range of formal tools to specify, verify, and validate software behaviour. Such SE tools can be applied to define constraints over LLM outputs and, consequently, offer stronger guarantees on the generated content. In this paper, we argue that the development of a Domain Specific Language (DSL) for scripting interactions with LLMs using an LLM Scripting Language (LSL) may be key to improve AI-based applications. Currently, LLMs and LLM-based software still lack reliability, robustness, and trustworthiness, and the tools or frameworks to cope with these issues suffer from fragmentation. In this paper, we present our vision of LSL. With LSL, we aim to address the limitations above by exploring ways to control LLM outputs, enforce structure in interactions, and integrate these aspects with verification, validation, and explainability. Our goal is to make LLM interaction programmable and decoupled from training or implementation.

JAX Autodiff from a Linear Logic Perspective (Extended Version)

2025-10-28T08:37:31Z

Autodiff refers to the core of the automatic differentiation systems developed in projects like JAX and Dex. Autodiff has recently been formalised in a linear typed calculus by Radul et al in arXiv:2204.10923. Although this formalisation suffices to express the main program transformations of Autodiff, the calculus is very specific to this task, and it is not clear whether the type system yields a substructural logic that has interest on its own. We propose an encoding of Autodiff into a linear $λ$-calculus that enjoys a Curry-Howard correspondence with Girard's linear logic. We prove that the encoding is sound both qualitatively (the encoded terms are extensionally equivalent to the original ones) and quantitatively (the encoding preserves the original work cost as described in arXiv:2204.10923). As a byproduct, we show that unzipping, one of the transformations used to implement backpropagation in Autodiff, is, in fact, optional.

Formal Verification of a Token Sale Launchpad: A Compositional Approach in Dafny

2025-10-27T20:45:15Z

The proliferation of decentralized financial (DeFi) systems and smart contracts has underscored the critical need for software correctness. Bugs in such systems can lead to catastrophic financial losses. Formal verification offers a path to achieving mathematical certainty about software behavior. This paper presents the formal verification of the core logic for a token sale launchpad, implemented and proven correct using the Dafny programming language and verification system. We detail a compositional, bottom-up verification strategy, beginning with the proof of fundamental non-linear integer arithmetic properties, and building upon them to verify complex business logic, including asset conversion, time-based discounts, and capped-sale refund mechanics. The principal contributions are the formal proofs of critical safety and lifecycle properties. Most notably, we prove that refunds in a capped sale can never exceed the user's original deposit amount, and that the precision loss in round-trip financial calculations is strictly bounded. Furthermore, we verify the complete lifecycle logic, including user withdrawals under various sale mechanics and the correctness of post-sale token allocation, vesting, and claiming. This work serves as a comprehensive case study in applying rigorous verification techniques to build high-assurance financial software.

AI-Assisted Authoring for Transparent, Data-Driven Documents

2025-10-27T13:39:05Z

We introduce _transparent documents_, interactive web-based scholarly articles which allow readers to explore the relationship to the underlying data by hovering over fragments of text, and present an LLM-based tool for authoring transparent documents, building on recent developments in data provenance for general-purpose programming languages. As a target platform, our implementation uses Fluid, an open source programming language with a provenance-tracking runtime. Our agent-based tool supports a human author during the creation of transparent documents, identifying fragments of text which can be computed from data, such as numerical values selected from records or computed by aggregations like sum and mean, comparatives and superlatives like _better than_ and _largest_, trend-adjectives like _growing_, and similar quantitative or semi-quantitative phrases, and then attempts to synthesise a suitable Fluid query over the data which generates the target string. The resulting expression is inserted into the article's web page, turning the static text fragment into an interactable data-driven element able to reveal the data that underwrites the natural language claim. We evaluate our approach on a subset of SciGen, an open source dataset consisting of tables from scientific articles and their corresponding descriptions, which we extend with hand-generated counterfactual test cases to evaluate how well machine-generated expressions generalise. Our results show that gpt4o is often able to synthesise compound expressions extensionally compatible with our gold solutions.

Monoidal closure of Grothendieck constructions via $Σ$-tractable monoidal structures and Dialectica formulas

2025-10-27T12:25:25Z

We examine the categorical structure of the Grothendieck construction $Σ_{\mathsf{C}}\mathsf{L}$ of an indexed category $\mathsf{L} \colon \mathsf{C}^{op} \to \mathsf{CAT}$. Our analysis begins with characterisations of fibred limits, colimits, and monoidal (closed) structures. The study of fibred colimits leads naturally to a generalisation of the notion of extensive indexed category introduced in CHAD for Expressive Total Languages, and gives rise to the concept of left Kan extensivity, which provides a uniform framework for computing colimits in Grothendieck constructions. We then establish sufficient conditions for the (non-fibred) monoidal closure of the total category $Σ_{\mathsf{C}}\mathsf{L}$. This extends Gödel's Dialectica interpretation and rests upon a new notion of $Σ$-tractable monoidal structure. Under this notion, $Σ$-tractable coproducts unify and extend cocartesian coclosed structures, biproducts, and extensive coproducts. Finally, we consider when the induced closed structure is fibred, showing that this need not hold in general, even in the presence of a fibred monoidal structure.

Anvil: A General-Purpose Timing-Safe Hardware Description Language

2025-10-27T04:57:34Z

Expressing hardware designs using hardware description languages (HDLs) routinely involves using stateless signals whose values change according to their underlying registers. Unintended behaviours can arise when the stored values in these underlying registers are mutated while their dependent signals are expected to remain constant across multiple cycles. Such timing hazards are common because, with a few exceptions, existing HDLs lack abstractions for values that remain unchanged over multiple clock cycles, delegating this responsibility to hardware designers. Designers must then carefully decide whether a value should remain unchanged, sometimes even across hardware modules. This paper proposes Anvil, an HDL which statically prevents timing hazards with a novel type system. Anvil is the only HDL we know of that guarantees timing safety, i.e., absence of timing hazards, without sacrificing expressiveness for cycle-level timing control or dynamic timing behaviours. Unlike many HLS languages that abstract away the differences between registers and signals, Anvil's type system exposes them fully while capturing the timing relationships between register value mutations and signal usages to enforce timing safety. This, in turn, enables safe composition of communicating hardware modules by static enforcement of timing contracts that encode timing constraints on shared signals. Such timing contracts can be specified parametric on abstract time points that can vary during run-time, allowing the type system to statically express dynamic timing behaviour. We have implemented Anvil and successfully used it to implement key timing-sensitive modules, comparing them against open-source SystemVerilog counterparts to demonstrate the practicality and expressiveness of the generated hardware.

LatticeHashForest: An Efficient Data Structure for Repetitive Data and Operations

2025-10-26T05:05:00Z

Analysis of entire programs as a single unit, or whole-program analysis, involves propagation of large amounts of information through the control flow of the program. This is especially true for pointer analysis, where, unless significant compromises are made in the precision of the analysis, there is a combinatorial blowup of information. One of the key problems we observed in our own efforts to this end is that a lot of duplicate data was being propagated, and many low-level data structure operations were repeated a large number of times. We present what we consider to be a novel and generic data structure, LatticeHashForest (LHF), to store and operate on such data in a manner that eliminates a majority of redundant computations and duplicate data in scenarios similar to those encountered in compilers and program optimization. LHF differs from similar work in this vein, such as hash-consing, ZDDs, and BDDs, by not only providing a way to efficiently operate on large, aggregate structures, but also modifying the elements of such structures in a manner that they can be deduplicated immediately. LHF also provides a way to perform a nested construction of elements such that they can be deduplicated at multiple levels, cutting down the need for additional, nested computations. We provide a detailed structural description, along with an abstract model of this data structure. An entire C++ implementation of LHF is provided as an artifact along with evaluations of LHF using examples and benchmark programs. We also supply API documentation and a user manual for users to make independent applications of LHF. Our main use case in the realm of pointer analysis shows memory usage reduction to an almost negligible fraction, and speedups beyond 4x for input sizes approaching 10 million when compared to other implementations.

An Empirical Study of Bitwise Operators Intuitiveness through Performance Metrics

2025-10-26T04:37:45Z

Objectives: This study aims to investigate the readability and understandability of bitwise operators in programming, with the main hypothesis that there will be a difference in the performance metrics (response time and error rate) between participants exposed to various bitwise operators related questions and those who are not. Participants: Participants in this human research study include people without programming background, novice programmers, and university students with varying programming experience (from freshmen to PhD level). There were 23 participants in this study. Study Methods: This study uses a within-subjects experimental design to assess how people with diverse programming backgrounds understand and use bitwise operators. Participants complete tasks in a JavaScript program, and their task completion times and task accuracy are recorded for analysis. Findings: The results indicate that operators can be one of the factors predicting response time, showing a small but significant effect (R-squared = 0.032, F(1, 494) = 16.5, p < .001). Additionally, operators such as OR, NOT, and Left Shift showed statistical significance in task completion times compared to other operators. Conclusions: While the complexity of bitwise operators did not generally result in longer task completion times, certain operators were found to be less intuitive, suggesting the need for further investigation and potential redesign for improved understandability.

Lessons Learned So Far From a Community Effort to Verify the Rust Standard Library (work-in-progress)

2025-10-26T00:53:29Z

Although Rust primarily intends to be a safe programming language that excludes undefined behaviour, it provides its users with the escape hatch of unsafe Rust, allowing them to circumvent some of its strong compile-time checks. This additional freedom has some advantages, including potentially more efficient code, which is one of the main reasons why unsafe code is used extensively throughout Rust's standard library. However, because unsafe code also re-opens the door to undefined behaviour, Amazon has convened a community to verify the safety of the standard library, and in particular the unsafe code contained therein. Given that this effort is done in public, is open-sourced, and has seen significant participation (from at least 50 different contributors), we have access to a wealth of information on how people are verifying the standard library, as well as what is currently possible and what still appears to be beyond the state of the art for verified software. In this paper, we discuss the lessons learned thus far from this verification effort, from both our work on it, as well as that of the broader community. In particular, we start by reviewing what has been accomplished thus far, as well as the main tools used (specifically, their advantages and their limitations). We then focus on some of the remaining fundamental obstacles to verifying the standard library, and propose potential solutions to overcome them. We hope that these observations can guide future verification of not only the standard library, but also unsafe Rust code in general.

Bean: A Language for Backward Error Analysis

2025-10-24T15:16:22Z

Backward error analysis offers a method for assessing the quality of numerical programs in the presence of floating-point rounding errors. However, techniques from the numerical analysis literature for quantifying backward error require substantial human effort, and there are currently no tools or automated methods for statically deriving sound backward error bounds. To address this gap, we propose Bean, a typed first-order programming language designed to express quantitative bounds on backward error. Bean's type system combines a graded coeffect system with strict linearity to soundly track the flow of backward error through programs. We prove the soundness of our system using a novel categorical semantics, where every Bean program denotes a triple of related transformations that together satisfy a backward error guarantee. To illustrate Bean's potential as a practical tool for automated backward error analysis, we implement a variety of standard algorithms from numerical linear algebra in Bean, establishing fine-grained backward error bounds via typing in a compositional style. We also develop a prototype implementation of Bean that infers backward error bounds automatically. Our evaluation shows that these inferred bounds match worst-case theoretical relative backward error bounds from the literature, underscoring Bean's utility in validating a key property of numerical programs: numerical stability.

Semantically Separating Nominal Wyvern for Usability and Decidability

2025-10-24T03:31:22Z

The Dependent Object Types (DOT) calculus incorporates concepts from functional languages (e.g. modules) with traditional object-oriented features (e.g. objects, subtyping) to achieve greater expressivity (e.g. F-bounded polymorphism). However, this merger of paradigms comes at the cost of subtype decidability. Recent work on bringing decidability to DOT has either sacrificed expressiveness or ease of use. The unrestricted construction of recursive types and type bounds has made subtype decidability a much harder problem than in traditional object-oriented programming. Recognizing this, our paper introduces Nominal Wyvern, a DOT-like dependent type system that takes an alternative approach: instead of having a uniform structural syntax like DOT, Nominal Wyvern is designed around a "semantic separation" between the nominal declaration of recursive types on the one hand, and the structural refinement of those types when they are used on the other. This design naturally guides the user to avoid writing undecidably recursive structural types. From a technical standpoint, this separation also makes guaranteeing decidability possible by allowing for an intuitive adaptation of material/shape separation, a technique for achieving subtype decidability by separating types responsible for subtyping constraints from types that represent concrete data. The result is a type system with syntax and structure familiar to OOP users that achieves decidability without compromising the expressiveness of F-bounded polymorphism and module systems as they are used in practice.

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

2025-10-24T02:21:11Z

Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.

Dependent Session Types for Verified Concurrent Programming

2025-10-23T18:18:15Z

We present TLLC which extends the Two-Level Linear dependent type theory (TLL) with session-based concurrency. Equipped with Martin-Löf style dependency, the session types of TLLC allow protocols to specify properties of communicated messages. When used in conjunction with the dependent type machinery already present in TLL, dependent session types facilitate a form of relational verification by relating concurrent programs with their idealized sequential counterparts. Correctness properties proven for sequential programs can be easily lifted to their corresponding concurrent implementations. TLLC makes session types a powerful tool for intrinsically verifying the correctness of data structures such as queues and concurrent algorithms such as map-reduce. To extend TLL with session types, we develop a novel formulation of intuitionistic session type which we believe to be widely applicable for integrating session types into other type systems beyond the context of TLLC. We study the meta-theory of our language, proving its soundness as both a term calculus and a process calculus. To demonstrate the practicality of TLLC, we have implemented a prototype compiler that translates TLLC programs into concurrent C code, which has been extensively evaluated.