https://arxiv.org/api//JsUFhm72pxUQ2sZCN8IlIvD/rc2026-06-26T13:40:04Z9951127515http://arxiv.org/abs/2508.15576v2Compositional Symbolic Execution for the Next 700 Memory Models (Extended Version)2025-08-27T09:41:14ZMultiple successful compositional symbolic execution (CSE) tools and platforms exploit separation logic (SL) for compositional verification and/or incorrectness separation logic (ISL) for compositional bug-finding, including VeriFast, Viper, Gillian, CN, and Infer-Pulse. Previous work on the Gillian platform, the only CSE platform that is parametric on the memory model, meaning that it can be instantiated to different memory models, suggests that the ability to use custom memory models allows for more flexibility in supporting analysis of a wide range of programming languages, for implementing custom automation, and for improving performance. However, the literature lacks a satisfactory formal foundation for memory-model-parametric CSE platforms.
In this paper, inspired by Gillian, we provide a new formal foundation for memory-model-parametric CSE platforms. Our foundation advances the state of the art in four ways. First, we mechanise our foundation (in the interactive theorem prover Rocq). Second, we validate our foundation by instantiating it to a broad range of memory models, including models for C and CHERI. Third, whereas previous memory-model-parametric work has only covered SL analyses, we cover both SL and ISL analyses. Fourth, our foundation is based on standard definitions of SL and ISL (including definitions of function specification validity, to ensure sound interoperation with other tools and platforms also based on standard definitions).2025-08-21T13:49:58ZOOPSLA 2025Andreas LööwSeung Hoon ParkDaniele Nantes-SobrinhoSacha-Élie AyounOpale SjöstedtPhilippa Gardner10.1145/3763151http://arxiv.org/abs/2508.19558v1Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking2025-08-27T04:17:02ZEmbedding models have demonstrated strong performance in tasks like clustering, retrieval, and feature extraction while offering computational advantages over generative models and cross-encoders. Benchmarks such as MTEB have shown that text embeddings from large language models (LLMs) capture rich semantic information, but their ability to reflect code-level functional semantics remains unclear. Existing studies largely focus on code clone detection, which emphasizes syntactic similarity and overlooks functional understanding. In this paper, we focus on the functional consistency of LLM code embeddings, which determines if two code snippets perform the same function regardless of syntactic differences. We propose a novel data synthesis framework called Functionality-Oriented Code Self-Evolution to construct diverse and challenging benchmarks. Specifically, we define code examples across four semantic and syntactic categories and find that existing datasets predominantly capture syntactic properties. Our framework generates four unique variations from a single code instance, providing a broader spectrum of code examples that better reflect functional differences. Extensive experiments on three downstream tasks-code clone detection, code functional consistency identification, and code retrieval-demonstrate that embedding models significantly improve their performance when trained on our evolved datasets. These results highlight the effectiveness and generalization of our data synthesis framework, advancing the functional understanding of code.2025-08-27T04:17:02ZZhuohao LiWenqing ChenJianxing YuZhichao Luhttp://arxiv.org/abs/2508.19074v1An LLM-powered Natural-to-Robotic Language Translation Framework with Correctness Guarantees2025-08-26T14:32:49ZThe Large Language Models (LLM) are increasingly being deployed in robotics to generate robot control programs for specific user tasks, enabling embodied intelligence. Existing methods primarily focus on LLM training and prompt design that utilize LLMs to generate executable programs directly from user tasks in natural language. However, due to the inconsistency of the LLMs and the high complexity of the tasks, such best-effort approaches often lead to tremendous programming errors in the generated code, which significantly undermines the effectiveness especially when the light-weight LLMs are applied. This paper introduces a natural-robotic language translation framework that (i) provides correctness verification for generated control programs and (ii) enhances the performance of LLMs in program generation via feedback-based fine-tuning for the programs. To achieve this, a Robot Skill Language (RSL) is proposed to abstract away from the intricate details of the control programs, bridging the natural language tasks with the underlying robot skills. Then, the RSL compiler and debugger are constructed to verify RSL programs generated by the LLM and provide error feedback to the LLM for refining the outputs until being verified by the compiler. This provides correctness guarantees for the LLM-generated programs before being offloaded to the robots for execution, significantly enhancing the effectiveness of LLM-powered robotic applications. Experiments demonstrate NRTrans outperforms the existing method under a range of LLMs and tasks, and achieves a high success rate for light-weight LLMs.2025-08-26T14:32:49ZZhenDong ChenZhanShang NieShiXing WanJunYi LiYongTian ChengShuai Zhaohttp://arxiv.org/abs/2508.14851v3Correct Black-Box Monitors for Distributed Deadlock Detection: Formalisation and Implementation (Technical Report)2025-08-26T10:45:47ZMany software applications rely on concurrent and distributed (micro)services that interact via message-passing and various forms of remote procedure calls (RPC). As these systems organically evolve and grow in scale and complexity, the risk of introducing deadlocks increases and their impact may worsen: even if only a few services deadlock, many other services may block while awaiting responses from the deadlocked ones. As a result, the "core" of the deadlock can be obfuscated by its consequences on the rest of the system, and diagnosing and fixing the problem can be challenging.
In this work we tackle the challenge by proposing distributed black-box monitors that are deployed alongside each service and detect deadlocks by only observing the incoming and outgoing messages, and exchanging probes with other monitors. We present a formal model that captures popular RPC-based application styles (e.g., gen_servers in Erlang/OTP), and a distributed black-box monitoring algorithm that we prove sound and complete (i.e., identifies deadlocked services with neither false positives nor false negatives). We implement our results in a tool called DDMon for the monitoring of Erlang/OTP applications, and we evaluate its performance.
This is the first work that formalises, proves the correctness, and implements distributed black-box monitors for deadlock detection. Our results are mechanised in Coq. DDMon is the companion artifact of this paper.2025-08-20T17:03:18ZOOPSLA 2025Radosław Jan RowickiAdrian FrancalanzaAlceste Scalas10.1145/3763069http://arxiv.org/abs/2508.18587v1A Case Study on the Effectiveness of LLMs in Verification with Proof Assistants2025-08-26T01:39:50ZLarge language models (LLMs) can potentially help with verification using proof assistants by automating proofs. However, it is unclear how effective LLMs are in this task. In this paper, we perform a case study based on two mature Rocq projects: the hs-to-coq tool and Verdi. We evaluate the effectiveness of LLMs in generating proofs by both quantitative and qualitative analysis. Our study finds that: (1) external dependencies and context in the same source file can significantly help proof generation; (2) LLMs perform great on small proofs but can also generate large proofs; (3) LLMs perform differently on different verification projects; and (4) LLMs can generate concise and smart proofs, apply classical techniques to new definitions, but can also make odd mistakes.2025-08-26T01:39:50ZAccepted by LMPL 2025Barış BayazıtYao LiXujie Si10.1145/3759425.3763391http://arxiv.org/abs/2508.16063v2Synthesizing DSLs for Few-Shot Learning2025-08-25T21:41:25ZWe study the problem of synthesizing domain-specific languages (DSLs) for few-shot learning in symbolic domains. Given a base language and instances of few-shot learning problems, where each instance is split into training and testing samples, the DSL synthesis problem asks for a grammar over the base language that guarantees that small expressions solving training samples also solve corresponding testing samples. We prove that the problem is decidable for a class of languages whose semantics over fixed structures can be evaluated by tree automata and when expression size corresponds to parse tree depth in the grammar, and, furthermore, the grammars solving the problem correspond to a regular set of trees. We also prove decidability results for variants of the problem where DSLs are only required to express solutions for input learning problems and where DSLs are defined using macro grammars.2025-08-22T03:30:42ZPaul KrogmeierP. Madhusudan10.1145/3763073http://arxiv.org/abs/2503.21691v5Place Capability Graphs: A General-Purpose Model of Rust's Ownership and Borrowing Guarantees2025-08-25T20:21:19ZRust's novel type system has proved an attractive target for verification and program analysis tools, due to the rich guarantees it provides for controlling aliasing and mutability. However, fully understanding, extracting and exploiting these guarantees is subtle and challenging: existing models for Rust's type checking either support a smaller idealised language disconnected from real-world Rust code, or come with severe limitations in terms of precise modelling of Rust borrows, composite types storing them, function signatures and loops.
In this paper, we present a novel model of Rust's type-checking called Place Capability Graphs, which lifts these limitations, and which can be directly calculated from the Rust compiler's own programmatic representations and analyses. We demonstrate that our model supports over 97% of Rust functions in the most popular public crates, and show its suitability as a general-purpose basis for verification and program analysis tools by developing promising new prototype versions of the existing Flowistry and Prusti tools.2025-03-27T16:55:41ZZachary GrannanAurel BílýJonáš FialaJasper GeerMarkus de MedeirosPeter MüllerAlexander J. Summershttp://arxiv.org/abs/2506.14485v3Optimized Execution of FreeCHR2025-08-25T13:19:57ZConstraint Handling Rules (CHR) is a rule-based programming language that rewrites collections of constraints. It is typically embedded into a general-purpose language. There exists a plethora of implementation for numerous host languages. However, the existing implementations often re-invent the method of embedding, which impedes maintenance and weakens assertions of correctness. To formalize and thereby standardize the embedding of a ground subset of CHR into arbitrary host languages, we introduced the framework FreeCHR and proved it to be a valid representation of classical CHR. For the sake of simplicity, abstract implementations of our framework did not yet include a concrete matching algorithm nor optimizations. In this paper, we introduce an improved execution and matching algorithm for FreeCHR. We also provide empirical evaluation of the algorithm.2025-06-17T13:07:30ZThis is a preprint of a paper submitted to the 39th Workshop on (Constraint and Functional) Logic Programming (WLP 2025); minor changes throughout the paper, additional explanations in sections 3.3 to Def 1; re-arranged section 3, added appendixSascha RechenbergerThom Frühwirthhttp://arxiv.org/abs/2410.04581v5Efficient Decrease-and-Conquer Linearizability Monitoring2025-08-25T10:38:06ZLinearizability has become the de facto correctness specification for implementations of concurrent data structures. While formally verifying such implementations remains challenging, linearizability monitoring has emerged as a promising first step to rule out early problems in the development of custom implementations, and serves as a key component in approaches that stress test such implementations. In this work, we investigate linearizability monitoring -- check if an execution history of an implementation is linearizable. While this problem is intractable in general, a systematic understanding of when it becomes tractable has remained elusive. We revisit this problem and first present a unified `decrease-and-conquer' algorithmic framework for linearizability monitoring. At its heart, this framework asks to identify special linearizability-preserving values in a given history -- values whose presence yields an equilinearizable sub-history when removed, and whose absence indicates non-linearizability. We prove that a polynomial time algorithm for the problem of identifying linearizability-preserving values, yields a polynomial time algorithm for linearizability monitoring, while conversely, intractability of this problem implies intractability of the monitoring problem. We demonstrate our framework's effectiveness by instantiating it for several popular data types -- sets, stacks, queues and priority queues -- deriving polynomial time algorithms for each, with the unambiguity restriction, where each insertion to the underlying data structure adds a distinct value. We optimize these algorithms to achieve the optimal log-linear time complexity by amortizing the cost of solving sub-problems through efficient data structures. Our implementation and evaluation on publicly available implementations show that our approach scales to large histories and outperforms existing tools.2024-10-06T18:35:51ZLee Zheng HanUmang Mathurhttp://arxiv.org/abs/2508.17568v1MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation2025-08-25T00:36:07ZMetamaterials are micro-architected structures whose geometry imparts highly tunable-often counter-intuitive-bulk properties. Yet their design is difficult because of geometric complexity and a non-trivial mapping from architecture to behaviour. We address these challenges with three complementary contributions. (i) MetaDSL: a compact, semantically rich domain-specific language that captures diverse metamaterial designs in a form that is both human-readable and machine-parsable. (ii) MetaDB: a curated repository of more than 150,000 parameterized MetaDSL programs together with their derivatives-three-dimensional geometry, multi-view renderings, and simulated elastic properties. (iii) MetaBench: benchmark suites that test three core capabilities of vision-language metamaterial assistants-structure reconstruction, property-driven inverse design, and performance prediction. We establish baselines by fine-tuning state-of-the-art vision-language models and deploy an omni-model within an interactive, CAD-like interface. Case studies show that our framework provides a strong first step toward integrated design and understanding of structure-representation-property relationships.2025-08-25T00:36:07ZLiane MakaturaBenjamin JonesSiyuan BianWojciech Matusikhttp://arxiv.org/abs/2508.17344v1Who Wins the Race? (R Vs Python) - An Exploratory Study on Energy Consumption of Machine Learning Algorithms2025-08-24T12:59:47ZThe utilization of Machine Learning (ML) in contemporary software systems is extensive and continually expanding. However, its usage is energy-intensive, contributing to increased carbon emissions and demanding significant resources. While numerous studies examine the performance and accuracy of ML, only a limited few focus on its environmental aspects, particularly energy consumption. In addition, despite emerging efforts to compare energy consumption across various programming languages for specific algorithms and tasks, there remains a gap specifically in comparing these languages for ML-based tasks. This paper aims to raise awareness of the energy costs associated with employing different programming languages for ML model training and inference. Through this empirical study, we measure and compare the energy consumption along with run-time performance of five regression and five classification tasks implemented in Python and R, the two most popular programming languages in this context. Our study results reveal a statistically significant difference in costs between the two languages in 95% of the cases examined. Furthermore, our analysis demonstrates that the choice of programming language can influence energy efficiency significantly, up to 99.16% during model training and up to 99.8% during inferences, for a given ML task.2025-08-24T12:59:47Z18 pages including references, 5 figuresRajrupa ChattarajSridhar ChimalakondaVibhu Saujanya SharmaVikrant Kaulgudhttp://arxiv.org/abs/2508.17190v1Borrowing Dirty Qubits in Quantum Programs2025-08-24T02:35:45ZDirty qubits are ancillary qubits that can be borrowed from idle parts of a computation, enabling qubit reuse and reducing the demand for fresh, clean qubits-a resource that is typically scarce in practice. For such reuse to be valid, the initial states of the dirty qubits must not affect the functionality of the quantum circuits in which they are employed. Moreover, their original states, including any entanglement they possess, must be fully restored after use-a requirement commonly known as safe uncomputation. In this paper, we formally define the semantics of dirty-qubit borrowing as a feature in quantum programming languages, and introduce a notion of safe uncomputation for dirty qubits in quantum programs. We also present an efficient algorithm, along with experimental results, for verifying safe uncomputation of dirty qubits in certain quantum circuits.2025-08-24T02:35:45ZBonan SuLi ZhouYuan FengMingsheng Yinghttp://arxiv.org/abs/2504.10800v2Products of Recursive Programs for Hypersafety Verification (Extended Version)2025-08-23T09:05:51ZWe study the problem of automated hypersafety verification of infinite-state recursive programs. We propose an infinite class of product programs, specifically designed with recursion in mind, that reduce the hypersafety verification of a recursive program to standard safety verification. For this, we combine insights from language theory and concurrency theory to propose an algorithmic solution for constructing an infinite class of recursive product programs. One key insight is that, using the simple theory of visibly pushdown languages, one can maintain the recursive structure of syntactic program alignments which is vital to constructing a new product program that can be viewed as a classic recursive program -- that is, one that can be executed on a single stack. Another key insight is that techniques from concurrency theory can be generalized to help define product programs based on the view that the parallel composition of individual recursive programs includes all possible alignments from which a sound set of alignments that faithfully preserve the satisfaction of the hypersafety property can be selected. On the practical side, we formulate a family of parametric canonical product constructions that are intuitive to programmers and can be used as building blocks to specify recursive product programs for the purpose of relational and hypersafety verification, with the idea that the right product program can be verified automatically using existing techniques. We demonstrate the effectiveness of these techniques through an implementation and highly promising experimental results.2025-04-15T01:52:50Z26 pages of main text (7 figures) and 45 pages in total; extended version of the paper "Products of Recursive Programs for Hypersafety Verification" accepted at OOPSLA 2025Ruotong ChengAzadeh Farzanhttp://arxiv.org/abs/2503.24036v2Automated Discovery of Tactic Libraries for Interactive Theorem Proving2025-08-22T19:10:25ZEnabling more concise and modular proofs is essential for advancing formal reasoning using interactive theorem provers (ITPs). Since many ITPs, such as Rocq and Lean, use tactic-style proofs, learning higher-level custom tactics is crucial for proof modularity and automation. This paper presents a novel approach to tactic discovery, which leverages Tactic Dependence Graphs (TDGs) to identify reusable proof strategies across multiple proofs. TDGs capture logical dependencies between tactic applications while abstracting away irrelevant syntactic details, allowing for both the discovery of new tactics and the refactoring of existing proofs into more modular forms. We have implemented this technique in a tool called TacMiner and compare it against an anti-unification-based approach Peano to tactic discovery. Our evaluation demonstrates that TacMiner can learn 3x as many tactics as Peano and reduces the size of proofs by 26% across all benchmarks. Furthermore, our evaluation demonstrates the benefits of learning custom tactics for proof automation, allowing a state-of-the-art proof automation tool to achieve a relative increase of 172% in terms of success rate.2025-03-31T13:00:29ZYutong XinJimmy XinGabriel PoesiaNoah GoodmanQiaochu ChenIsil Dillighttp://arxiv.org/abs/2508.16746v1SafeTree: Expressive Tree Policies for Microservices2025-08-22T18:57:16ZA microservice-based application is composed of multiple self-contained components called microservices, and controlling inter-service communication is important for enforcing safety properties. Presently, inter-service communication is configured using microservice deployment tools. However, such tools only support a limited class of single-hop policies, which can be overly permissive because they ignore the rich service tree structure of microservice calls. Policies that can express the service tree structure can offer development and security teams more fine-grained control over communication patterns.
To this end, we design an expressive policy language to specify service tree structures, and we develop a visibly pushdown automata-based dynamic enforcement mechanism to enforce service tree policies. Our technique is non-invasive: it does not require any changes to service implementations, and does not require access to microservice code. To realize our method, we build a runtime monitor on top of a service mesh, an emerging network infrastructure layer that can control inter-service communication during deployment. In particular, we employ the programmable network traffic filtering capabilities of Istio, a popular service mesh implementation, to implement an online and distributed monitor. Our experiments show that our monitor can enforce rich safety properties while adding minimal latency overhead on the order of milliseconds.2025-08-22T18:57:16ZKaruna GrewalP. Brighten GodfreyJustin Hsu