https://arxiv.org/api//JsUFhm72pxUQ2sZCN8IlIvD/rc 2026-06-26T13:40:04Z 9951 1275 15 http://arxiv.org/abs/2508.15576v2 Compositional Symbolic Execution for the Next 700 Memory Models (Extended Version) 2025-08-27T09:41:14Z

Multiple successful compositional symbolic execution (CSE) tools and platforms exploit separation logic (SL) for compositional verification and/or incorrectness separation logic (ISL) for compositional bug-finding, including VeriFast, Viper, Gillian, CN, and Infer-Pulse. Previous work on the Gillian platform, the only CSE platform that is parametric on the memory model, meaning that it can be instantiated to different memory models, suggests that the ability to use custom memory models allows for more flexibility in supporting analysis of a wide range of programming languages, for implementing custom automation, and for improving performance. However, the literature lacks a satisfactory formal foundation for memory-model-parametric CSE platforms. In this paper, inspired by Gillian, we provide a new formal foundation for memory-model-parametric CSE platforms. Our foundation advances the state of the art in four ways. First, we mechanise our foundation (in the interactive theorem prover Rocq). Second, we validate our foundation by instantiating it to a broad range of memory models, including models for C and CHERI. Third, whereas previous memory-model-parametric work has only covered SL analyses, we cover both SL and ISL analyses. Fourth, our foundation is based on standard definitions of SL and ISL (including definitions of function specification validity, to ensure sound interoperation with other tools and platforms also based on standard definitions).

2025-08-21T13:49:58Z OOPSLA 2025 Andreas Lööw Seung Hoon Park Daniele Nantes-Sobrinho Sacha-Élie Ayoun Opale Sjöstedt Philippa Gardner 10.1145/3763151 http://arxiv.org/abs/2508.19558v1 Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking 2025-08-27T04:17:02Z

Embedding models have demonstrated strong performance in tasks like clustering, retrieval, and feature extraction while offering computational advantages over generative models and cross-encoders. Benchmarks such as MTEB have shown that text embeddings from large language models (LLMs) capture rich semantic information, but their ability to reflect code-level functional semantics remains unclear. Existing studies largely focus on code clone detection, which emphasizes syntactic similarity and overlooks functional understanding. In this paper, we focus on the functional consistency of LLM code embeddings, which determines if two code snippets perform the same function regardless of syntactic differences. We propose a novel data synthesis framework called Functionality-Oriented Code Self-Evolution to construct diverse and challenging benchmarks. Specifically, we define code examples across four semantic and syntactic categories and find that existing datasets predominantly capture syntactic properties. Our framework generates four unique variations from a single code instance, providing a broader spectrum of code examples that better reflect functional differences. Extensive experiments on three downstream tasks-code clone detection, code functional consistency identification, and code retrieval-demonstrate that embedding models significantly improve their performance when trained on our evolved datasets. These results highlight the effectiveness and generalization of our data synthesis framework, advancing the functional understanding of code.

2025-08-27T04:17:02Z Zhuohao Li Wenqing Chen Jianxing Yu Zhichao Lu http://arxiv.org/abs/2508.19074v1 An LLM-powered Natural-to-Robotic Language Translation Framework with Correctness Guarantees 2025-08-26T14:32:49Z

The Large Language Models (LLM) are increasingly being deployed in robotics to generate robot control programs for specific user tasks, enabling embodied intelligence. Existing methods primarily focus on LLM training and prompt design that utilize LLMs to generate executable programs directly from user tasks in natural language. However, due to the inconsistency of the LLMs and the high complexity of the tasks, such best-effort approaches often lead to tremendous programming errors in the generated code, which significantly undermines the effectiveness especially when the light-weight LLMs are applied. This paper introduces a natural-robotic language translation framework that (i) provides correctness verification for generated control programs and (ii) enhances the performance of LLMs in program generation via feedback-based fine-tuning for the programs. To achieve this, a Robot Skill Language (RSL) is proposed to abstract away from the intricate details of the control programs, bridging the natural language tasks with the underlying robot skills. Then, the RSL compiler and debugger are constructed to verify RSL programs generated by the LLM and provide error feedback to the LLM for refining the outputs until being verified by the compiler. This provides correctness guarantees for the LLM-generated programs before being offloaded to the robots for execution, significantly enhancing the effectiveness of LLM-powered robotic applications. Experiments demonstrate NRTrans outperforms the existing method under a range of LLMs and tasks, and achieves a high success rate for light-weight LLMs.

2025-08-26T14:32:49Z ZhenDong Chen ZhanShang Nie ShiXing Wan JunYi Li YongTian Cheng Shuai Zhao http://arxiv.org/abs/2508.14851v3 Correct Black-Box Monitors for Distributed Deadlock Detection: Formalisation and Implementation (Technical Report) 2025-08-26T10:45:47Z

Many software applications rely on concurrent and distributed (micro)services that interact via message-passing and various forms of remote procedure calls (RPC). As these systems organically evolve and grow in scale and complexity, the risk of introducing deadlocks increases and their impact may worsen: even if only a few services deadlock, many other services may block while awaiting responses from the deadlocked ones. As a result, the "core" of the deadlock can be obfuscated by its consequences on the rest of the system, and diagnosing and fixing the problem can be challenging. In this work we tackle the challenge by proposing distributed black-box monitors that are deployed alongside each service and detect deadlocks by only observing the incoming and outgoing messages, and exchanging probes with other monitors. We present a formal model that captures popular RPC-based application styles (e.g., gen_servers in Erlang/OTP), and a distributed black-box monitoring algorithm that we prove sound and complete (i.e., identifies deadlocked services with neither false positives nor false negatives). We implement our results in a tool called DDMon for the monitoring of Erlang/OTP applications, and we evaluate its performance. This is the first work that formalises, proves the correctness, and implements distributed black-box monitors for deadlock detection. Our results are mechanised in Coq. DDMon is the companion artifact of this paper.

2025-08-20T17:03:18Z OOPSLA 2025 Radosław Jan Rowicki Adrian Francalanza Alceste Scalas 10.1145/3763069 http://arxiv.org/abs/2508.18587v1 A Case Study on the Effectiveness of LLMs in Verification with Proof Assistants 2025-08-26T01:39:50Z

Large language models (LLMs) can potentially help with verification using proof assistants by automating proofs. However, it is unclear how effective LLMs are in this task. In this paper, we perform a case study based on two mature Rocq projects: the hs-to-coq tool and Verdi. We evaluate the effectiveness of LLMs in generating proofs by both quantitative and qualitative analysis. Our study finds that: (1) external dependencies and context in the same source file can significantly help proof generation; (2) LLMs perform great on small proofs but can also generate large proofs; (3) LLMs perform differently on different verification projects; and (4) LLMs can generate concise and smart proofs, apply classical techniques to new definitions, but can also make odd mistakes.

2025-08-26T01:39:50Z Accepted by LMPL 2025 Barış Bayazıt Yao Li Xujie Si 10.1145/3759425.3763391 http://arxiv.org/abs/2508.16063v2 Synthesizing DSLs for Few-Shot Learning 2025-08-25T21:41:25Z

We study the problem of synthesizing domain-specific languages (DSLs) for few-shot learning in symbolic domains. Given a base language and instances of few-shot learning problems, where each instance is split into training and testing samples, the DSL synthesis problem asks for a grammar over the base language that guarantees that small expressions solving training samples also solve corresponding testing samples. We prove that the problem is decidable for a class of languages whose semantics over fixed structures can be evaluated by tree automata and when expression size corresponds to parse tree depth in the grammar, and, furthermore, the grammars solving the problem correspond to a regular set of trees. We also prove decidability results for variants of the problem where DSLs are only required to express solutions for input learning problems and where DSLs are defined using macro grammars.

2025-08-22T03:30:42Z Paul Krogmeier P. Madhusudan 10.1145/3763073 http://arxiv.org/abs/2503.21691v5 Place Capability Graphs: A General-Purpose Model of Rust's Ownership and Borrowing Guarantees 2025-08-25T20:21:19Z

Rust's novel type system has proved an attractive target for verification and program analysis tools, due to the rich guarantees it provides for controlling aliasing and mutability. However, fully understanding, extracting and exploiting these guarantees is subtle and challenging: existing models for Rust's type checking either support a smaller idealised language disconnected from real-world Rust code, or come with severe limitations in terms of precise modelling of Rust borrows, composite types storing them, function signatures and loops. In this paper, we present a novel model of Rust's type-checking called Place Capability Graphs, which lifts these limitations, and which can be directly calculated from the Rust compiler's own programmatic representations and analyses. We demonstrate that our model supports over 97% of Rust functions in the most popular public crates, and show its suitability as a general-purpose basis for verification and program analysis tools by developing promising new prototype versions of the existing Flowistry and Prusti tools.

2025-03-27T16:55:41Z Zachary Grannan Aurel Bílý Jonáš Fiala Jasper Geer Markus de Medeiros Peter Müller Alexander J. Summers http://arxiv.org/abs/2506.14485v3 Optimized Execution of FreeCHR 2025-08-25T13:19:57Z

Constraint Handling Rules (CHR) is a rule-based programming language that rewrites collections of constraints. It is typically embedded into a general-purpose language. There exists a plethora of implementation for numerous host languages. However, the existing implementations often re-invent the method of embedding, which impedes maintenance and weakens assertions of correctness. To formalize and thereby standardize the embedding of a ground subset of CHR into arbitrary host languages, we introduced the framework FreeCHR and proved it to be a valid representation of classical CHR. For the sake of simplicity, abstract implementations of our framework did not yet include a concrete matching algorithm nor optimizations. In this paper, we introduce an improved execution and matching algorithm for FreeCHR. We also provide empirical evaluation of the algorithm.

2025-06-17T13:07:30Z This is a preprint of a paper submitted to the 39th Workshop on (Constraint and Functional) Logic Programming (WLP 2025); minor changes throughout the paper, additional explanations in sections 3.3 to Def 1; re-arranged section 3, added appendix Sascha Rechenberger Thom Frühwirth http://arxiv.org/abs/2410.04581v5 Efficient Decrease-and-Conquer Linearizability Monitoring 2025-08-25T10:38:06Z

Linearizability has become the de facto correctness specification for implementations of concurrent data structures. While formally verifying such implementations remains challenging, linearizability monitoring has emerged as a promising first step to rule out early problems in the development of custom implementations, and serves as a key component in approaches that stress test such implementations. In this work, we investigate linearizability monitoring -- check if an execution history of an implementation is linearizable. While this problem is intractable in general, a systematic understanding of when it becomes tractable has remained elusive. We revisit this problem and first present a unified `decrease-and-conquer' algorithmic framework for linearizability monitoring. At its heart, this framework asks to identify special linearizability-preserving values in a given history -- values whose presence yields an equilinearizable sub-history when removed, and whose absence indicates non-linearizability. We prove that a polynomial time algorithm for the problem of identifying linearizability-preserving values, yields a polynomial time algorithm for linearizability monitoring, while conversely, intractability of this problem implies intractability of the monitoring problem. We demonstrate our framework's effectiveness by instantiating it for several popular data types -- sets, stacks, queues and priority queues -- deriving polynomial time algorithms for each, with the unambiguity restriction, where each insertion to the underlying data structure adds a distinct value. We optimize these algorithms to achieve the optimal log-linear time complexity by amortizing the cost of solving sub-problems through efficient data structures. Our implementation and evaluation on publicly available implementations show that our approach scales to large histories and outperforms existing tools.

2024-10-06T18:35:51Z Lee Zheng Han Umang Mathur http://arxiv.org/abs/2508.17568v1 MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation 2025-08-25T00:36:07Z

Metamaterials are micro-architected structures whose geometry imparts highly tunable-often counter-intuitive-bulk properties. Yet their design is difficult because of geometric complexity and a non-trivial mapping from architecture to behaviour. We address these challenges with three complementary contributions. (i) MetaDSL: a compact, semantically rich domain-specific language that captures diverse metamaterial designs in a form that is both human-readable and machine-parsable. (ii) MetaDB: a curated repository of more than 150,000 parameterized MetaDSL programs together with their derivatives-three-dimensional geometry, multi-view renderings, and simulated elastic properties. (iii) MetaBench: benchmark suites that test three core capabilities of vision-language metamaterial assistants-structure reconstruction, property-driven inverse design, and performance prediction. We establish baselines by fine-tuning state-of-the-art vision-language models and deploy an omni-model within an interactive, CAD-like interface. Case studies show that our framework provides a strong first step toward integrated design and understanding of structure-representation-property relationships.

2025-08-25T00:36:07Z Liane Makatura Benjamin Jones Siyuan Bian Wojciech Matusik http://arxiv.org/abs/2508.17344v1 Who Wins the Race? (R Vs Python) - An Exploratory Study on Energy Consumption of Machine Learning Algorithms 2025-08-24T12:59:47Z

The utilization of Machine Learning (ML) in contemporary software systems is extensive and continually expanding. However, its usage is energy-intensive, contributing to increased carbon emissions and demanding significant resources. While numerous studies examine the performance and accuracy of ML, only a limited few focus on its environmental aspects, particularly energy consumption. In addition, despite emerging efforts to compare energy consumption across various programming languages for specific algorithms and tasks, there remains a gap specifically in comparing these languages for ML-based tasks. This paper aims to raise awareness of the energy costs associated with employing different programming languages for ML model training and inference. Through this empirical study, we measure and compare the energy consumption along with run-time performance of five regression and five classification tasks implemented in Python and R, the two most popular programming languages in this context. Our study results reveal a statistically significant difference in costs between the two languages in 95% of the cases examined. Furthermore, our analysis demonstrates that the choice of programming language can influence energy efficiency significantly, up to 99.16% during model training and up to 99.8% during inferences, for a given ML task.

2025-08-24T12:59:47Z 18 pages including references, 5 figures Rajrupa Chattaraj Sridhar Chimalakonda Vibhu Saujanya Sharma Vikrant Kaulgud http://arxiv.org/abs/2508.17190v1 Borrowing Dirty Qubits in Quantum Programs 2025-08-24T02:35:45Z

Dirty qubits are ancillary qubits that can be borrowed from idle parts of a computation, enabling qubit reuse and reducing the demand for fresh, clean qubits-a resource that is typically scarce in practice. For such reuse to be valid, the initial states of the dirty qubits must not affect the functionality of the quantum circuits in which they are employed. Moreover, their original states, including any entanglement they possess, must be fully restored after use-a requirement commonly known as safe uncomputation. In this paper, we formally define the semantics of dirty-qubit borrowing as a feature in quantum programming languages, and introduce a notion of safe uncomputation for dirty qubits in quantum programs. We also present an efficient algorithm, along with experimental results, for verifying safe uncomputation of dirty qubits in certain quantum circuits.

2025-08-24T02:35:45Z Bonan Su Li Zhou Yuan Feng Mingsheng Ying http://arxiv.org/abs/2504.10800v2 Products of Recursive Programs for Hypersafety Verification (Extended Version) 2025-08-23T09:05:51Z

We study the problem of automated hypersafety verification of infinite-state recursive programs. We propose an infinite class of product programs, specifically designed with recursion in mind, that reduce the hypersafety verification of a recursive program to standard safety verification. For this, we combine insights from language theory and concurrency theory to propose an algorithmic solution for constructing an infinite class of recursive product programs. One key insight is that, using the simple theory of visibly pushdown languages, one can maintain the recursive structure of syntactic program alignments which is vital to constructing a new product program that can be viewed as a classic recursive program -- that is, one that can be executed on a single stack. Another key insight is that techniques from concurrency theory can be generalized to help define product programs based on the view that the parallel composition of individual recursive programs includes all possible alignments from which a sound set of alignments that faithfully preserve the satisfaction of the hypersafety property can be selected. On the practical side, we formulate a family of parametric canonical product constructions that are intuitive to programmers and can be used as building blocks to specify recursive product programs for the purpose of relational and hypersafety verification, with the idea that the right product program can be verified automatically using existing techniques. We demonstrate the effectiveness of these techniques through an implementation and highly promising experimental results.

2025-04-15T01:52:50Z 26 pages of main text (7 figures) and 45 pages in total; extended version of the paper "Products of Recursive Programs for Hypersafety Verification" accepted at OOPSLA 2025 Ruotong Cheng Azadeh Farzan http://arxiv.org/abs/2503.24036v2 Automated Discovery of Tactic Libraries for Interactive Theorem Proving 2025-08-22T19:10:25Z

Enabling more concise and modular proofs is essential for advancing formal reasoning using interactive theorem provers (ITPs). Since many ITPs, such as Rocq and Lean, use tactic-style proofs, learning higher-level custom tactics is crucial for proof modularity and automation. This paper presents a novel approach to tactic discovery, which leverages Tactic Dependence Graphs (TDGs) to identify reusable proof strategies across multiple proofs. TDGs capture logical dependencies between tactic applications while abstracting away irrelevant syntactic details, allowing for both the discovery of new tactics and the refactoring of existing proofs into more modular forms. We have implemented this technique in a tool called TacMiner and compare it against an anti-unification-based approach Peano to tactic discovery. Our evaluation demonstrates that TacMiner can learn 3x as many tactics as Peano and reduces the size of proofs by 26% across all benchmarks. Furthermore, our evaluation demonstrates the benefits of learning custom tactics for proof automation, allowing a state-of-the-art proof automation tool to achieve a relative increase of 172% in terms of success rate.

2025-03-31T13:00:29Z Yutong Xin Jimmy Xin Gabriel Poesia Noah Goodman Qiaochu Chen Isil Dillig http://arxiv.org/abs/2508.16746v1 SafeTree: Expressive Tree Policies for Microservices 2025-08-22T18:57:16Z

A microservice-based application is composed of multiple self-contained components called microservices, and controlling inter-service communication is important for enforcing safety properties. Presently, inter-service communication is configured using microservice deployment tools. However, such tools only support a limited class of single-hop policies, which can be overly permissive because they ignore the rich service tree structure of microservice calls. Policies that can express the service tree structure can offer development and security teams more fine-grained control over communication patterns. To this end, we design an expressive policy language to specify service tree structures, and we develop a visibly pushdown automata-based dynamic enforcement mechanism to enforce service tree policies. Our technique is non-invasive: it does not require any changes to service implementations, and does not require access to microservice code. To realize our method, we build a runtime monitor on top of a service mesh, an emerging network infrastructure layer that can control inter-service communication during deployment. In particular, we employ the programmable network traffic filtering capabilities of Istio, a popular service mesh implementation, to implement an online and distributed monitor. Our experiments show that our monitor can enforce rich safety properties while adding minimal latency overhead on the order of milliseconds.

2025-08-22T18:57:16Z Karuna Grewal P. Brighten Godfrey Justin Hsu