https://arxiv.org/api/E00kiwOphC27MrzGzKEDYZg9+B4 2026-06-14T01:06:05Z 9885 195 15 http://arxiv.org/abs/2605.09610v1 SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications 2026-05-10T15:47:46Z

We introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark's reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and external security analysis via the Slither static analyzer confirming 79.4% agreement between the LLM auditor and a non-LLM rule-based tool. Systematic analysis of 9,000 generated contracts reveals characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, and complexity-driven degradation) and quantifies a +8.29 composite-score advantage of generated contracts over ground-truth implementations, attributable to LLMs' literal specification-following behavior. SmartEval establishes a reproducible, validated foundation for empirical research on LLM smart contract synthesis quality, with all data, evaluation code, and generated contracts publicly released.

2026-05-10T15:47:46Z Abhinav Goel Agostino Capponi Alfio Gliozzo Chaitya Shah http://arxiv.org/abs/2604.26961v2 Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding 2026-05-10T14:14:59Z

Static program slicing is a fundamental software engineering technique for isolating code relevant to specific variables. While recent learning-based approaches using language models (LMs) show promise in automating slice prediction, they suffer from inaccurate dependency modeling and unconstrained generation, where LMs fail to capture precise data flow relations and produce slices containing hallucinated tokens and statements. To address these challenges, we propose Sliceformer, a novel approach that reformulates static program slicing as a sequence-to-sequence task using small language models such as CodeT5+. Sliceformer introduces two key innovations that directly target the identified limitations. First, to improve dependency modeling, we design dataflow-aware pretraining objectives that leverage data flow graphs (DFG) to teach models data dependencies through dataflow-preserving statement permutation and dataflow-aware span corruption. Second, to eliminate hallucination, we develop a constrained decoding mechanism that enforces both lexical and syntactic constraints. We evaluate Sliceformer on Java and Python program slicing benchmarks, demonstrating consistent improvements over state-of-the-art baselines with up to 22% gain in ExactMatch.

2026-04-09T19:46:06Z Accepted at ACL 2026 Pengfei He Shaowei Wang Tse-Hsun Chen Muhammad Asaduzzaman http://arxiv.org/abs/2605.09491v1 Categorical Message Passing Language (CaMPL) for programmers 2026-05-10T11:55:30Z

Categorical Message Passing Language (CaMPL) is a functional-style concurrent programming language whose semantics is in category theory, more specifically, linear actegories. Its core programming feature is message passing along typed communication channels between concurrent processes. CaMPL also supports controlled non-determinism via 'races' which allow processes to adapt dynamically while they are running, higher-order processes which pass other processes as messages, and custom channel datatypes called protocols and coprotocols which allow one to define infinite channel types or implement session types. The type system of CaMPL arises from a Curry-Howard-Lambek-like correspondence for concurrent programming, established by Cockett and Pastro in their paper titled "The logic of message passing". This type system ensures that a formal CaMPL program, i.e., one which does not allow general recursion, will never become deadlocked or livelocked. In this article, we explore the type system of CaMPL, custom channel types, and controlled non-determinism using code examples after briefly introducing its mathematical underpinnings.

2026-05-10T11:55:30Z 14 pages Daniel Kiyoshi Hashimoto Alexanna Little Berg Priyaa Varshinee Srinivasan http://arxiv.org/abs/2605.09411v1 Persistent Amortised Analysis, Operationally 2026-05-10T08:24:43Z

Amortised analysis is a technique for proving a combined time bound for a batch of operations on a data structure, even if some of those operations are expensive. But the traditional method of amortised analysis yields incorrect time bounds when the data structure is used persistently. Persistence allows operations to be performed on previous versions of the data structure, which prevents us from amortising expensive restructuring work. In his seminal book, Chris Okasaki showed how to extend amortised analysis to persistent usage. His method works by extending the data structure with thunks and performing the analysis with debits rather than credits. His argument, that credits are unsound for analysing persistent usage, has become folklore. In this paper, we provide a new perspective on the role of debits in Okasaki's work. First, we set up an operational semantics of call-by-value lambda calculus with thunks, and show formally that traditional amortised analysis does not work in a persistent setting. Then we show that, contrary to the folklore, credit-based amortised analysis can be sound in a persistent setting as long as credits are only stored on thunks. Finally, we provide a formal semantics for Okasaki's debit-based approach. Our paper clarifies the formal foundation of Okasaki's work and makes it accessible to a wider audience.

2026-05-10T08:24:43Z To be published in ICALP 2026 Anton Lorenzen http://arxiv.org/abs/2505.02184v3 Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes 2026-05-09T19:15:29Z

Large language models (LLMs) are increasingly used for generating parallel scientific codes, with a primary focus on generating functionally correct code. Recent work has focused on generating performant code, with an emphasis on its execution time. However, energy efficiency is now recognized as a critical objective, given the significant power demands of large-scale compute systems. This paper addresses the research question of whether LLMs can generate energy-efficient parallel scientific codes when guided by empirical execution feedback. To answer this question, we propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel codes through a multi-stage, iterative approach integrating runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-a-Judge agent for screening generated code. We evaluate LASSI-EE using twenty-two representative scientific benchmarks and applications on NVIDIA A100 and AMD MI100 GPUs. The results indicate an average energy reduction of 36% for MI100 and 34% for A100, across trials that produced passing energy-reducing refactorings.

2025-05-04T17:05:34Z 12 pages, 5 figures, version under review at a peer-reviewed conference Matthew T. Dearing Yiheng Tao Xingfu Wu Zhiling Lan Valerie Taylor http://arxiv.org/abs/2605.08927v1 Quantitative Comparison of Credible Compilation and Verification In Coding Agent Compiler Development 2026-05-09T12:55:54Z

Formal program verification is a longstanding goal in the field. We present the first quantitative comparison of the two primary compiler verification approaches, credible compilation/translation validation and full verification. Working with the first verified compiler developed by a coding agent (operating under human supervision), we present quantitative results from a coding agent implementing several optimizations using these two approaches. The results indicate that 1) verification requires roughly an order of magnitude more development effort than credible compilation, 2) to enhance provability, the coding agent chooses less efficient algorithms and data structures for verified optimizations, and 3) in an attempt to minimize proof effort the coding agent repeatedly implemented optimization scope reductions for verified optimizations, and 4) certificate checking time dominates optimization and certificate generation time for the considered optimizations. Because of the increased proof overhead, verified optimizations required substantially more supervision and coding sessions than credible compilation optimizations. Given the capabilities of a modern coding agent working in this context, implementation efforts for both credible compilation and verified versions remained feasible for the considered optimizations (unreachable code elimination, dead assignment elimination, and constant propagation/folding).

2026-05-09T12:55:54Z Martin Rinard http://arxiv.org/abs/2605.01660v2 Testing, Credible Compilation, and Verification in the Axon Verified Compiler in Lean and Claude Code 2026-05-09T07:54:04Z

This paper presents the use of testing, credible compilation/translation validation, verification, and audits in the Axon compiler. Axon comes with fully machine checked proofs that guarantee the correctness of the generated code. All code and proofs were written in Lean by Claude Code, with the correctness proofs eliminating any need to audit or examine any verified code. I present a development process for using these validation techniques, evaluate the use of this process during the development of the compiler, and discuss implications for other development efforts.

2026-05-03T00:55:29Z Martin Rinard http://arxiv.org/abs/2605.08369v1 First-Class Refinement Types for Scala 2026-05-08T18:23:48Z

Refinement types -- types qualified with logical predicates -- have proven effective for lightweight verification in languages like Liquid Haskell, F*, and Dafny. However, in these systems refinements are either written in a separate specification language or treated as second-class annotations, disconnected from the host language's type system. This disconnect creates usability barriers: programmers must maintain two mental models, and refinements cannot interact with features like type inference, subtyping, or overloading. We present the design of first-class refinement types for Scala 3, where refinements are ordinary types that participate in subtyping, inference, and pattern matching alongside existing language features. We prove type soundness of a core calculus mechanized in Rocq, combining dependent function types, bounded polymorphism, positive equi-recursive types, union and intersection types, and refinement types under a partial-correctness semantics using a fuel-bounded definitional interpreter and semantic typing. Finally, we implement our design as a prototype extension of the Scala 3 compiler with a lightweight e-graph-based solver for predicate entailment.

2026-05-08T18:23:48Z 27 pages, 12 figures, including appendix Matt Bovel Viktor Kunčak Martin Odersky http://arxiv.org/abs/2510.13725v3 A Complementary Approach to Incorrectness Typing 2026-05-08T13:58:21Z

We introduce a new two-sided type system for verifying the correctness and incorrectness of functional programs with atoms and pattern matching. A key idea in the work is that types should range over sets of normal forms, rather than sets of values, and this allows us to define a complement operator on types that acts as a negation on typing formulas. We show that the complement allows us to derive a wide range of refutation principles within the system, including the type-theoretic analogue of co-implication, and we use them to certify that a number of Erlang-like programs go wrong. An expressive axiomatisation of the complement operator via subtyping is shown decidable, and the type system as a whole is shown to be not only sound, but also complete for normal forms.

2025-10-15T16:29:26Z The main text of this paper was published at POPL'26. This version of the appendices includes an essential change to the statement of Inversion (and its proof) after a flaw was identified by Tomos Sherlock: the new Inversion theorem only applies when the main subject is a closed term PACMPL, Volume 10, Issue POPL. 2026. Article No.: 82, Pages 2380 - 2408 Celia Mengyue Li Sophie Pull Steven Ramsay 10.1145/3776724 http://arxiv.org/abs/2512.00164v2 Faster Verified Explanations for Neural Networks 2026-05-08T11:21:01Z

Verified explanations are a principled way to explain the decisions taken by neural networks, which are otherwise black-box in nature. However, these techniques face significant scalability challenges, as they require multiple calls to neural network verifiers, each of them with an exponential worst-case complexity. We present FaVeX, a novel algorithm to compute verified explanations. FaVeX accelerates the computation by dynamically combining batch and sequential processing of input features, and by reusing information from previous queries, both when proving invariances with respect to certain input features, and when searching for feature assignments altering the prediction. Furthermore, we present a novel and hierarchical definition of verified explanations, termed verifieroptimal robust explanations, that explicitly factors the incompleteness of network verifiers within the explanation. Our comprehensive experimental evaluation demonstrates the superior scalability of both FaVeX, and of verifier-optimal robust explanations, which together can produce meaningful formal explanation on networks with hundreds of thousands of non-linear activations.

2025-11-28T19:05:39Z ECOOP 2026 Alessandro De Palma Greta Dolcetti Caterina Urban http://arxiv.org/abs/2605.06972v1 A New Interaction Concept for Interactive and Autoactive Program Verification 2026-05-07T21:42:42Z

Fully functional program verification is an undecidable$\unicode{x2014}$and, hence, inherently difficult$\unicode{x2014}$task, that is not automatically solvable but typically requires user interaction and guidance. Existing verifiers either work autoactively, requiring the user to write annotations in source code, without the possibility to inspect the proof state or intervene in case of an unsuccessful attempt, or allow interactions on a logical encoding that is on a lower level than the user-provided specifications. We present a novel interaction concept which allows the user to inspect and interact with the proof state on source code and specification level. This minimizes the mental gap between the representations. We provide an implementation of the concept as a plugin for the Java verification engine KeY, and show with a user study that this prototype can be beneficial for users to understand the proof state and find defects in source code or specifications.

2026-05-07T21:42:42Z 13 pages, 10 figures; Manuscript accepted at FTfJP'26 Wolfram Pfeifer Mattias Ulbrich Daniel Drodt http://arxiv.org/abs/2604.15533v2 Verification Modulo Tested Library Contracts 2026-05-07T19:38:31Z

We consider the problem of verification modulo tested library contracts as a step towards automating the verification of client programs that use complex libraries. We formulate this problem as the synthesis of modular contracts for the library methods used by the client that are adequate to prove the client correct, and that also pass the scrutiny of a testing engine that tests the library against these contracts. We also consider a new form of method contracts called contextual contracts that arise in this setting that hold in the context of the client program, and can often be simpler and easier to infer than classical modular contracts. We provide a counterexample-guided learning framework to solve this problem, in which the synthesizer interacts with a constraint solver as well as the testing engine in order to infer adequate modular/contextual method contracts and inductive invariants for the client. The main synthesis engines we use are generalizing CHC solvers that are realized using ICE learning algorithms. We realize this framework in a tool called DUALIS and show its efficacy on benchmarks where clients call large libraries.

2026-04-16T21:27:34Z Removed LaTeX formatting from abstract text Abhishek Uppar Omar Muhammad Sumanth Prabhu Deepak D'Souza Madhusudan P Adithya Murali http://arxiv.org/abs/2605.06803v1 Bounding Fixed Points of Non-Monotone Processes: Theory to Practice 2026-05-07T18:06:02Z

Many modern solvers and program analyzers rely on non-monotone reasoning (e.g. negation-as-failure, speculative updates, backtracking) for which classical monotone fixed-point methods do not apply. The general problem of finding the fixed points of these processes is a difficult one. For this reason, there have been theoretical efforts in existing Approximation Fixpoint Theory (AFT) from the domain of logic programming to approximate fixed points of non-monotone operators. Tight approximations of these fixed points are highly useful for accelerating non-monotonic computations by restricting the search space. In practice, however, even the best approximations obtained through AFT can be coarse and computationally expensive. We aim to address both issues to make AFT approximation methods practical for use in programming languages (PL) settings. To mitigate inefficiency, we prove the soundness of an abstract interpretation for approximating operators. To improve upon coarse approximations, we carefully introduce controlled unsoundness to design an effective yet practical algorithm for partitioning and tightening AFT's best approximations. This algorithm is sound, anytime, and guarantees termination on finite-height lattices. We further present a modification that ensures polynomial-time complexity. We instantiate these methods in two settings: (1) answer set programming, where it serves as a convergence-accelerating pre-processor, and (2) speculative program analysis, where it reduces rollback while preserving soundness. In both settings, we focus on implementation-level details to demonstrate the practical applicability of our methods.

2026-05-07T18:06:02Z Abdullah H. Rasheed Vijay K. Garg http://arxiv.org/abs/2603.00991v2 Tracking Capabilities for Safer Agents 2026-05-07T14:09:04Z

AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.

2026-03-01T08:39:37Z Martin Odersky Yaoyu Zhao Yichen Xu Oliver Bračevac Cao Nguyen Pham 10.1145/3786335.3813127 http://arxiv.org/abs/2605.08247v1 LLM Translation of Compiler Intermediate Representation 2026-05-07T13:22:23Z

GCC and LLVM underpin much of modern software infrastructure, relying on distinct Intermediate Representations (IRs) to drive optimizations and code generation. However, the semantic and structural differences between these IRs create significant barriers for cross-toolchain interaction, limiting the reuse of compiler frontends, backends, and optimization pipelines across programming languages and compilation ecosystems. Traditional rule-based translators have attempted to bridge this gap, but their complexity and maintenance cost have hindered practical adoption. In this context, Large Language Models (LLMs) appear to be an emerging technology that offers a data-driven alternative, capable of learning complex mappings between heterogeneous compiler IRs directly from sufficiently representative examples. To explore this approach, this paper presents IRIS-14B, a 14-billion-parameter transformer model fine-tuned to translate GIMPLE (as emitted by GCC) to LLVM IR (as emitted by LLVM). The model is trained on paired IRs extracted from C sources and evaluated on the GIMPLE-to-LLVM IR transformation applied to IRs derived from real-world C code and competitive programming problems. To the best of our knowledge, IRIS-14B is the first model trained explicitly for IR-to-IR translation. It outperforms the accuracy of widely used models, including the largest state-of-the-art open models available today, ranging from 13 to 1,000 billion parameters, by up to 44 percentage points. The proposed transformation supports the integration of LLMs as complementary components within hybrid neuro-symbolic compiler architectures, where models such as IRIS-14B act as interoperability layers enabling cross-toolchain workflows without modifying existing compiler passes, while traditional compiler infrastructure continues to perform deterministic compilation and optimization.

2026-05-07T13:22:23Z Andrea Valenzuela Ramirez Cristian Gutierrez-Gomez Marta Barroso Dario Garcia-Gasulla Sara Royuela