https://arxiv.org/api/7ZWtDu3xK2CsQZwpf3zs5w2x4DY 2026-06-13T17:50:59Z 9885 105 15 http://arxiv.org/abs/2605.25180v1 DateSAT: A Framework for Solving Date and Period Constraints 2026-05-24T17:20:45Z Dates and calendar periods (i.e., days, months, years) appear frequently in tasks involving analysis of software, data, and documents. Prior research has shown that computer logic involving dates and calendrical calculations is error-prone due to tricky rules (e.g., irregularly sized months), ambiguities (e.g., scheduling one month from "Jan 31st"), and edge cases (e.g., leap years). However, existing program analysis and verification tools do not provide native support for dates, making it hard to reason about operations involving calendrical arithmetic symbolically. This paper presents DateSAT, the first framework for expressing and solving satisfiability constraints involving dates and calendar periods. The paper first formalizes an input language and the semantics of date and period arithmetic. The paper then presents five separate strategies for solving DateSAT constraints based on reductions to SMT formulas involving integers, which we have implemented using Z3 as a backend. We curate a dataset of 450 DateSAT constraints synthesized using LLM prompting, grammar-based sampling, and mining legal documents, and then present an empirical evaluation of DateSAT solver performance. 2026-05-24T17:20:45Z Leyi Cui Shrey Tiwari Rohan Padhye http://arxiv.org/abs/2510.08609v3 Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating? 2026-05-24T16:38:01Z Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. Pinning dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, floating can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate pinning dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since pinning is the tightest version constraint, pinning is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using pinning as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is floating-minor, with pinning being the next most common. We also find that floating-major is the least likely to result in outdated and floating-minor is the least likely to result in vulnerable dependencies. 2025-10-07T14:37:03Z Accepted to ASE 2025 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) Imranur Rahman Jill Marley William Enck Laurie Williams 10.1109/ASE63991.2025.00229 http://arxiv.org/abs/2602.22631v2 TorchLean: Formalizing Neural Networks in Lean 2026-05-24T04:59:07Z Neural networks are increasingly deployed in scientific, safety critical, and mission critical pipelines, yet verification and analysis are often performed outside the programming environment that defines and runs the model. This creates a semantic gap between the executed network and the analyzed artifact: guarantees can depend on implicit conventions about operator semantics, tensor layouts, preprocessing, floating-point behavior, graph transformations, accelerated kernels, and external certificates. We present TorchLean, a unified framework for formalizing, executing, and verifying neural networks in Lean 4. TorchLean treats learned models as executable programs and mathematical objects with a shared semantics for computation, verification, and theorem proving. The framework provides a PyTorch style API for typed tensors, layers, objectives, optimizers, automatic differentiation, and graph programs, with eager and compiled execution paths that lower to a common computation-graph representation. TorchLean supports exact and finite-precision tensor semantics, verified reverse-mode differentiation, interval and affine bound propagation, CROWN/LiRPA style certificate checking, import/export workflows, and CUDA-backed execution through explicit FFI boundaries. It also includes semantic layers for attention and FlashAttention, state-space sequence models, diffusion and sampling processes, probability kernels, reinforcement-learning objectives and Markov decision processes, and self-supervised objectives such as masked autoencoding, JEPA-style predictive views, and variance/correlation-based anti-collapse losses. Together, these components provide a semantic foundation for verified machine learning, where executable neural network artifacts, verification procedures, runtime boundaries, and mathematical claims can be stated and related inside one theorem-proving environment. 2026-02-26T05:11:44Z 55 pages Robert Joseph George Jennifer Cruden Will Adkisson Xiangru Zhong Huan Zhang Anima Anandkumar http://arxiv.org/abs/2509.05586v4 Fixed Parameter Tractable Linearizability Monitoring 2026-05-23T15:55:12Z We study the linearizability monitoring problem, which asks whether a given concurrent history of a data structure is equivalent to some sequential execution of the same data structure. In general, this problem is $\textsf{NP}$-hard, even for simple objects such as registers. Recent work has identified tractable cases for restricted classes of histories, notably unambiguous and differentiated histories. We revisit the tractability boundary from a fine-grained, parameterized perspective. We show that for a broad class of data structures -- including stacks, queues, priority queues, and maps -- linearizability monitoring is fixed-parameter tractable when parameterized by the number of processes. Concretely, we give an algorithm running in time $O(c^{k} \cdot \textsf{poly}(n))$, where $n$ is the history size, $k$ is the number of processes, and $c$ is a constant, yielding efficient performance when $k$ is small. Our approach reduces linearizability monitoring to a language reachability problem on graphs, which asks whether a labeled graph admits a path whose label sequence belongs to a fixed language $L$. We identify classes of languages that capture the sequential specifications of the above data structures and show that language reachability is efficiently solvable on the graph structures induced by concurrent histories. Our results complement prior hardness results and existing tractable subclasses, and provide a unified algorithmic framework. We implement our approach and demonstrate significant runtime improvements over existing algorithms, which exhibit exponential worst-case behavior. 2025-09-06T04:18:43Z To appear in proceedings of PLDI 2026 Lee Zheng Han Umang Mathur 10.1145/3808315 http://arxiv.org/abs/2604.11811v2 M$^\star$: Every Task Deserves Its Own Memory Harness 2026-05-23T08:43:43Z Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms. 2026-04-10T03:22:26Z Preprint. Code: https://github.com/wbopan/mstar ; Live demo: https://mstar.wenbo.io Wenbo Pan Shujie Liu Xiangyang Zhou Shiwei Zhang Wanlu Shi Mirror Xu Xiaohua Jia http://arxiv.org/abs/2605.24263v1 Program Synthesis for Non-Linear Real Arithmetic: Going Beyond Realizability 2026-05-22T22:33:06Z We study the problem of synthesizing programs from nonlinear real arithmetic (NRA) specifications. Existing techniques, such as syntax-guided synthesis (SyGuS), fail to synthesize programs when the specification is unrealizable. We argue this is unsatisfactory in many situations, and aim to synthesize programs from arbitrary NRA specifications, such that for any input, the synthesized program either produces outputs satisfying the specification or reports non-existence of any such output. To avoid rounding errors inherent in floating-point arithmetic, we restrict our programs to work on rational inputs and outputs. We first show that our variant of the synthesis problem is as hard as a long-standing open problem in number theory, and that synthesizing loop-free programs from arbitrary NRA specifications with rational inputs and outputs is impossible in general. Second, we present a sound and complete synthesis algorithm for the case where the specification involves a single output variable. We also show that for realizable specifications, a program generated by SyGuS for NRA (real inputs and outputs) serves as a solution to our problem, where inputs and outputs are rationals. Third, we provide a sound (but necessarily incomplete) synthesis algorithm for the general case of specifications. We have implemented our approach in a prototype tool called NQSynth that solves many benchmarks beyond the reach of state-of-the-art SyGuS tools, even when we render the specifications realizable. 2026-05-22T22:33:06Z S. Akshay Supratik Chakraborty R. Govind Aniruddha R. Joshi http://arxiv.org/abs/2604.17592v2 TensorRocq: Enabling diagrammatic reasoning in Rocq 2026-05-22T21:45:54Z Symmetric monoidal categories (SMCs) are a common framework for reasoning about computation, focusing on the parallel and sequential compositionality of operations. String diagrams are a ubiquitous and powerful tool for reasoning about equations in SMCs, leveraging eliding the fine details of compositionality to focus on connectivity. However, when working with SMCs in a proof assistant, the rigid equational structure of composition occludes the essential connective information, longer proofs filled with uninformative syntactic manipulation. To address the gap between proof assistants and paper proof, we have developed verified tools for diagrammatic reasoning in Rocq, including inferring term equivalence and rewriting modulo the deformation of string diagrams. This is achieved by converting between syntactic representations of SMC terms and hypergraphs with interfaces, while preserving a common tensor semantics. We provide tools to develop simple SMC theories from generators and relations, and perform equational reasoning these systems. We also enable our tactics to be used in existing verification projects about SMCs which can be given semantics as tensor expressions. 2026-04-19T19:52:46Z 23 pages, 4 figures Benjamin Caldwell William Spencer Aleks Kissinger Robert Rand http://arxiv.org/abs/2605.23772v1 Agentic Proving for Program Verification 2026-05-22T15:41:27Z Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification. 2026-05-22T15:41:27Z Alessandro Sosso Akhil Arora Bas Spitters http://arxiv.org/abs/2402.16741v6 Less is More Revisited: Association with Global Protocols and Multiparty Sessions 2026-05-22T13:38:37Z Ensuring correctness of communication in distributed systems remains challenging. To address this, Multiparty session types (MPST), initially introduced by Honda et al. [52, 53], offer a type discipline in which a programmer or architect specifies an overall view of communication as a global protocol (global type), and each distributed program is locally type-checked against its end-point projection. In practice, the MPST framework has been integrated into over 25 programming languages or tools. Ten years after the emergence of MPST, Scalas and Yoshida [84] discovered that existing proofs of type safety using end-point projection with mergeability are flawed, where the mergeability operator enlarges the typability of MPST end-point programs, admits easy implementation, and is more efficient than alternative approaches, including model checking. Nevertheless, following the result in [84], the soundness of end-point projection (with mergeability) has been interpreted in the literature as problematic. We clarify this concern by proposing a new general proof technique for type soundness (subject reduction) of multiparty session $π$-calculus, which relies on an association relation between the behavioural semantics of a global type and its end-point projection. With this approach, behavioural properties, namely session fidelity, deadlock freedom, and liveness, are also guaranteed based on global types. Additionally, we provide detailed comparisons with existing MPST typing systems and discuss their respective proof methods for type soundness. 2024-02-26T17:05:11Z Ping Hou Nobuko Yoshida Iona Kuhn http://arxiv.org/abs/2605.23570v1 Misleading Microbenchmarks on the Java Virtual Machines 2026-05-22T12:38:12Z Developers often use microbenchmarks to choose the most performant implementation of a method or a class. On the Java Virtual Machine (JVM), this is commonly done using the Java Microbenchmark Harness (JMH) which addresses common pitfalls of measuring code performance on the JVM. However, even using JMH guidelines cannot overcome the fundamental issue of context. Microbenchmarks inherently execute code in isolation, without interference from other application code competing for CPU resources, such as cache or branch-predictor capacity. On managed runtimes with tiered dynamic compilation, such as the JVM, the speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution. Because profiles usually include also branch probabilities and receiver types (besides code hotness metrics), a badly designed microbenchmark may cause the JVM to collect an unrealistic profile, resulting in aggressive, yet misleading, optimizations, that would not occur in a real application. In this paper, we demonstrate how using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. We also extend these guidelines by suggesting actions to make the microbenchmark results more representative. 2026-05-22T12:38:12Z The 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26), March 23--27, 2026, Thessaloniki, Greece Filippo Schiavio Lubomír Bulej Walter Binder 10.1145/3748522.3779882 http://arxiv.org/abs/2605.23543v1 JEDI: Java Evaluation of Declarative and Imperative Queries 2026-05-22T12:07:55Z The Java Stream API aims at increasing developer productivity thanks to an easy-to-read declarative syntax to express computations. It also simplifies parallel computing, providing a high-level abstraction on top of common parallelization aspects. Unfortunately, there is a lack of benchmarks specifically targeting stream-based applications. Such a lack of benchmarks makes it difficult for researchers and developers of the Java class library to optimize the Stream API. Moreover, in the absence of dedicated benchmarks, it is difficult to analyze the performance of streams to suggest developers how to write efficient code using the API. In this work we present JEDI, a benchmark suite that targets the Stream API. JEDI is automatically generated by converting SQL benchmarks into Java benchmarks. Our code generator supports targets different implementations (both stream-based and imperative) for the same query. The ultimate goal of our benchmark suite -- and the main contribution of this work -- is to analyze the performance of the different implementations to spot inefficient code structures and better alternatives, suggesting best practices to Java developers. Among the multiple implementations we generate, we focus on different parallelization strategies and explain the most efficient parallelization strategies based on characteristics of the processed data. Finally, the code generation producing imperative code defines of a baseline that can guide researchers and Java implementers to optimize the Stream API. 2026-05-22T12:07:55Z 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE '26) Filippo Schiavio Walter Binder 10.1145/3744916.3773165 http://arxiv.org/abs/2511.12276v2 Reflections on the design, applications and implementations of the normative specification language eFLINT 2026-05-22T11:37:01Z Checking the compliance of software against laws, regulations and contracts is increasingly important and costly as the embedding of software into societal practices is becoming more pervasive. Moreover, the digitalised services provided by governmental organisations and companies are governed by an increasing amount of laws and regulations, requiring highly adaptable compliance practices. A potential solution is to automate compliance using software. However, automating compliance is difficult for various reasons. Legal practices involve subjective processes such as interpretation and qualification. New laws and regulations come into effect regularly and laws and regulations, as well as their interpretations, are subjected to constant revision. In addition, computational reasoning with laws requires a cross-disciplinary process involving both legal and software expertise. This paper reflects on the domain-specific language eFLINT developed to experiment with novel solutions to these challenges. Specifically, the language has been developed to experiment with the abstract syntax and semantics of a language supporting different types of reasoning for various applications. The language combines declarative and procedural elements, formalises connections between legal concepts and computational concepts, and is designed to automate compliance checks before, during and after a software system runs. The various design goals and applications areas for the language give rise to (conflicting) requirements. This paper presents and reflects on the current design of the language by recalling applications and requirements. As such, this paper reports on results and insights of an investigation that can benefit language developers within the field of automated compliance. 2025-11-15T16:09:31Z 29 pages L. Thomas van Binsbergen Christopher A. Esterhuyse Tim Müller http://arxiv.org/abs/2605.23435v1 MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization 2026-05-22T09:45:56Z Compiler phase ordering has a strong effect on program performance. Finding an effective sequence of passes is still a difficult task because the search space is large and execution time, code size and energy consumption often conflict. Existing methods usually depend on fixed optimization levels or limited heuristics and they rarely handle multiple objectives at the same time. This paper presents MileStone, a modular framework that models compiler phase ordering as a multi-objective optimization problem. MileStone represents programs as graphs, predicts performance metrics with a graph neural network and explores pass sequences with a reinforcement-learning agent that follows user constraints. The framework also builds a self-evolving database that collects compiler transformations and improves prediction quality. Experiments on standard benchmarks show that MileStone finds strong Pareto-optimal solutions, meets energy limits more accurately than LLVM optimization levels and other related techniques. MileStone reduces execution time by up to 45 percent under the same energy budget using a multi-objective approach. The results show that MileStone provides an effective and scalable solution for multi-objective compiler phase ordering. 2026-05-22T09:45:56Z Amirhosein Sadr Mehran Alidoost Nia http://arxiv.org/abs/2605.23358v1 A Compilation Framework for Quantum Simulation of Non-unitary Dynamics 2026-05-22T08:22:31Z Most quantum compilers assume programs are reversible unitary circuits. This fits closed-system algorithms, but not open-system simulation, where the natural program objects are quantum channels describing non-unitary dynamics. We present a channel-first compilation framework that treats channels as first-class compilation objects. Our core IR, ChannelIR, represents channels explicitly in Kraus form, a standard channel representation, with Pauli-sum structure, enabling algebraic rewrites before circuit synthesis. We instantiate the framework with LindFront, a frontend that lowers continuous-time Lindbladian generators to short-time channels, and a backend that compiles these channels to executable circuits with structure-aware optimizations. On Lindbladian and channel-simulation benchmarks, the optimized pipeline reduces gate count by up to 99% over an unoptimized channel-first baseline and scales better than circuit-first Stinespring compilation. 2026-05-22T08:22:31Z 35 pages, 15 figures Qifan Huang Minbo Gao Li Zhou Mingsheng Ying http://arxiv.org/abs/2605.19005v2 Rewrite System Showdown: Stochastic Search vs. EqSat 2026-05-22T01:08:52Z Equality saturation has become a dominant paradigm for equational program optimization. However, it has never been rigorously compared to another approach to the same problem, even though several exist, the most notable being stochastic search. In this paper, we compare equality saturation to stochastic search over five benchmarks to answer the question: are e-graphs actually good? 2026-05-18T18:27:33Z To be presented at EGRAPHS 2026 Qiantan Hong Rupanshu Soi Yihong Zhang Alex Aiken