https://arxiv.org/api/7ZWtDu3xK2CsQZwpf3zs5w2x4DY2026-06-13T17:50:59Z988510515http://arxiv.org/abs/2605.25180v1DateSAT: A Framework for Solving Date and Period Constraints2026-05-24T17:20:45ZDates and calendar periods (i.e., days, months, years) appear frequently in tasks involving analysis of software, data, and documents. Prior research has shown that computer logic involving dates and calendrical calculations is error-prone due to tricky rules (e.g., irregularly sized months), ambiguities (e.g., scheduling one month from "Jan 31st"), and edge cases (e.g., leap years). However, existing program analysis and verification tools do not provide native support for dates, making it hard to reason about operations involving calendrical arithmetic symbolically.
This paper presents DateSAT, the first framework for expressing and solving satisfiability constraints involving dates and calendar periods. The paper first formalizes an input language and the semantics of date and period arithmetic. The paper then presents five separate strategies for solving DateSAT constraints based on reductions to SMT formulas involving integers, which we have implemented using Z3 as a backend. We curate a dataset of 450 DateSAT constraints synthesized using LLM prompting, grammar-based sampling, and mining legal documents, and then present an empirical evaluation of DateSAT solver performance.2026-05-24T17:20:45ZLeyi CuiShrey TiwariRohan Padhyehttp://arxiv.org/abs/2510.08609v3Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating?2026-05-24T16:38:01ZDevelopers consistently use version constraints to specify acceptable versions of the dependencies for their project. Pinning dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, floating can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate pinning dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since pinning is the tightest version constraint, pinning is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using pinning as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is floating-minor, with pinning being the next most common. We also find that floating-major is the least likely to result in outdated and floating-minor is the least likely to result in vulnerable dependencies.2025-10-07T14:37:03ZAccepted to ASE 20252025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)Imranur RahmanJill MarleyWilliam EnckLaurie Williams10.1109/ASE63991.2025.00229http://arxiv.org/abs/2602.22631v2TorchLean: Formalizing Neural Networks in Lean2026-05-24T04:59:07ZNeural networks are increasingly deployed in scientific, safety critical, and mission critical pipelines, yet verification and analysis are often performed outside the programming environment that defines and runs the model. This creates a semantic gap between the executed network and the analyzed artifact: guarantees can depend on implicit conventions about operator semantics, tensor layouts, preprocessing, floating-point behavior, graph transformations, accelerated kernels, and external certificates. We present TorchLean, a unified framework for formalizing, executing, and verifying neural networks in Lean 4. TorchLean treats learned models as executable programs and mathematical objects with a shared semantics for computation, verification, and theorem proving. The framework provides a PyTorch style API for typed tensors, layers, objectives, optimizers, automatic differentiation, and graph programs, with eager and compiled execution paths that lower to a common computation-graph representation. TorchLean supports exact and finite-precision tensor semantics, verified reverse-mode differentiation, interval and affine bound propagation, CROWN/LiRPA style certificate checking, import/export workflows, and CUDA-backed execution through explicit FFI boundaries. It also includes semantic layers for attention and FlashAttention, state-space sequence models, diffusion and sampling processes, probability kernels, reinforcement-learning objectives and Markov decision processes, and self-supervised objectives such as masked autoencoding, JEPA-style predictive views, and variance/correlation-based anti-collapse losses. Together, these components provide a semantic foundation for verified machine learning, where executable neural network artifacts, verification procedures, runtime boundaries, and mathematical claims can be stated and related inside one theorem-proving environment.2026-02-26T05:11:44Z55 pagesRobert Joseph GeorgeJennifer CrudenWill AdkissonXiangru ZhongHuan ZhangAnima Anandkumarhttp://arxiv.org/abs/2509.05586v4Fixed Parameter Tractable Linearizability Monitoring2026-05-23T15:55:12ZWe study the linearizability monitoring problem, which asks whether a given concurrent history of a data structure is equivalent to some sequential execution of the same data structure. In general, this problem is $\textsf{NP}$-hard, even for simple objects such as registers. Recent work has identified tractable cases for restricted classes of histories, notably unambiguous and differentiated histories.
We revisit the tractability boundary from a fine-grained, parameterized perspective. We show that for a broad class of data structures -- including stacks, queues, priority queues, and maps -- linearizability monitoring is fixed-parameter tractable when parameterized by the number of processes. Concretely, we give an algorithm running in time $O(c^{k} \cdot \textsf{poly}(n))$, where $n$ is the history size, $k$ is the number of processes, and $c$ is a constant, yielding efficient performance when $k$ is small. Our approach reduces linearizability monitoring to a language reachability problem on graphs, which asks whether a labeled graph admits a path whose label sequence belongs to a fixed language $L$. We identify classes of languages that capture the sequential specifications of the above data structures and show that language reachability is efficiently solvable on the graph structures induced by concurrent histories.
Our results complement prior hardness results and existing tractable subclasses, and provide a unified algorithmic framework. We implement our approach and demonstrate significant runtime improvements over existing algorithms, which exhibit exponential worst-case behavior.2025-09-06T04:18:43ZTo appear in proceedings of PLDI 2026Lee Zheng HanUmang Mathur10.1145/3808315http://arxiv.org/abs/2604.11811v2M$^\star$: Every Task Deserves Its Own Memory Harness2026-05-23T08:43:43ZLarge language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.2026-04-10T03:22:26ZPreprint. Code: https://github.com/wbopan/mstar ; Live demo: https://mstar.wenbo.ioWenbo PanShujie LiuXiangyang ZhouShiwei ZhangWanlu ShiMirror XuXiaohua Jiahttp://arxiv.org/abs/2605.24263v1Program Synthesis for Non-Linear Real Arithmetic: Going Beyond Realizability2026-05-22T22:33:06ZWe study the problem of synthesizing programs from nonlinear real arithmetic (NRA) specifications. Existing techniques, such as syntax-guided synthesis (SyGuS), fail to synthesize programs when the specification is unrealizable. We argue this is unsatisfactory in many situations, and aim to synthesize programs from arbitrary NRA specifications, such that for any input, the synthesized program either produces outputs satisfying the specification or reports non-existence of any such output. To avoid rounding errors inherent in floating-point arithmetic, we restrict our programs to work on rational inputs and outputs.
We first show that our variant of the synthesis problem is as hard as a long-standing open problem in number theory, and that synthesizing loop-free programs from arbitrary NRA specifications with rational inputs and outputs is impossible in general. Second, we present a sound and complete synthesis algorithm for the case where the specification involves a single output variable. We also show that for realizable specifications, a program generated by SyGuS for NRA (real inputs and outputs) serves as a solution to our problem, where inputs and outputs are rationals. Third, we provide a sound (but necessarily incomplete) synthesis algorithm for the general case of specifications. We have implemented our approach in a prototype tool called NQSynth that solves many benchmarks beyond the reach of state-of-the-art SyGuS tools, even when we render the specifications realizable.2026-05-22T22:33:06ZS. AkshaySupratik ChakrabortyR. GovindAniruddha R. Joshihttp://arxiv.org/abs/2604.17592v2TensorRocq: Enabling diagrammatic reasoning in Rocq2026-05-22T21:45:54ZSymmetric monoidal categories (SMCs) are a common framework for reasoning about computation, focusing on the parallel and sequential compositionality of operations. String diagrams are a ubiquitous and powerful tool for reasoning about equations in SMCs, leveraging eliding the fine details of compositionality to focus on connectivity. However, when working with SMCs in a proof assistant, the rigid equational structure of composition occludes the essential connective information, longer proofs filled with uninformative syntactic manipulation. To address the gap between proof assistants and paper proof, we have developed verified tools for diagrammatic reasoning in Rocq, including inferring term equivalence and rewriting modulo the deformation of string diagrams. This is achieved by converting between syntactic representations of SMC terms and hypergraphs with interfaces, while preserving a common tensor semantics. We provide tools to develop simple SMC theories from generators and relations, and perform equational reasoning these systems. We also enable our tactics to be used in existing verification projects about SMCs which can be given semantics as tensor expressions.2026-04-19T19:52:46Z23 pages, 4 figuresBenjamin CaldwellWilliam SpencerAleks KissingerRobert Randhttp://arxiv.org/abs/2605.23772v1Agentic Proving for Program Verification2026-05-22T15:41:27ZAgentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.2026-05-22T15:41:27ZAlessandro SossoAkhil AroraBas Spittershttp://arxiv.org/abs/2402.16741v6Less is More Revisited: Association with Global Protocols and Multiparty Sessions2026-05-22T13:38:37ZEnsuring correctness of communication in distributed systems remains challenging. To address this, Multiparty session types (MPST), initially introduced by Honda et al. [52, 53], offer a type discipline in which a programmer or architect specifies an overall view of communication as a global protocol (global type), and each distributed program is locally type-checked against its end-point projection. In practice, the MPST framework has been integrated into over 25 programming languages or tools. Ten years after the emergence of MPST, Scalas and Yoshida [84] discovered that existing proofs of type safety using end-point projection with mergeability are flawed, where the mergeability operator enlarges the typability of MPST end-point programs, admits easy implementation, and is more efficient than alternative approaches, including model checking. Nevertheless, following the result in [84], the soundness of end-point projection (with mergeability) has been interpreted in the literature as problematic. We clarify this concern by proposing a new general proof technique for type soundness (subject reduction) of multiparty session $π$-calculus, which relies on an association relation between the behavioural semantics of a global type and its end-point projection. With this approach, behavioural properties, namely session fidelity, deadlock freedom, and liveness, are also guaranteed based on global types. Additionally, we provide detailed comparisons with existing MPST typing systems and discuss their respective proof methods for type soundness.2024-02-26T17:05:11ZPing HouNobuko YoshidaIona Kuhnhttp://arxiv.org/abs/2605.23570v1Misleading Microbenchmarks on the Java Virtual Machines2026-05-22T12:38:12ZDevelopers often use microbenchmarks to choose the most performant implementation of a method or a class. On the Java Virtual Machine (JVM), this is commonly done using the Java Microbenchmark Harness (JMH) which addresses common pitfalls of measuring code performance on the JVM. However, even using JMH guidelines cannot overcome the fundamental issue of context.
Microbenchmarks inherently execute code in isolation, without interference from other application code competing for CPU resources, such as cache or branch-predictor capacity. On managed runtimes with tiered dynamic compilation, such as the JVM, the speculative, profile-driven nature of compilation decisions means that code performance is highly dependent on profiles collected during early execution. Because profiles usually include also branch probabilities and receiver types (besides code hotness metrics), a badly designed microbenchmark may cause the JVM to collect an unrealistic profile, resulting in aggressive, yet misleading, optimizations, that would not occur in a real application.
In this paper, we demonstrate how using microbenchmarks under conditions that induce the JVM to collect unrealistic profiles yields misleading results despite following existing guidelines. We also extend these guidelines by suggesting actions to make the microbenchmark results more representative.2026-05-22T12:38:12ZThe 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26), March 23--27, 2026, Thessaloniki, GreeceFilippo SchiavioLubomír BulejWalter Binder10.1145/3748522.3779882http://arxiv.org/abs/2605.23543v1JEDI: Java Evaluation of Declarative and Imperative Queries2026-05-22T12:07:55ZThe Java Stream API aims at increasing developer productivity thanks to an easy-to-read declarative syntax to express computations. It also simplifies parallel computing, providing a high-level abstraction on top of common parallelization aspects. Unfortunately, there is a lack of benchmarks specifically targeting stream-based applications. Such a lack of benchmarks makes it difficult for researchers and developers of the Java class library to optimize the Stream API. Moreover, in the absence of dedicated benchmarks, it is difficult to analyze the performance of streams to suggest developers how to write efficient code using the API.
In this work we present JEDI, a benchmark suite that targets the Stream API. JEDI is automatically generated by converting SQL benchmarks into Java benchmarks. Our code generator supports targets different implementations (both stream-based and imperative) for the same query. The ultimate goal of our benchmark suite -- and the main contribution of this work -- is to analyze the performance of the different implementations to spot inefficient code structures and better alternatives, suggesting best practices to Java developers. Among the multiple implementations we generate, we focus on different parallelization strategies and explain the most efficient parallelization strategies based on characteristics of the processed data. Finally, the code generation producing imperative code defines of a baseline that can guide researchers and Java implementers to optimize the Stream API.2026-05-22T12:07:55Z2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE '26)Filippo SchiavioWalter Binder10.1145/3744916.3773165http://arxiv.org/abs/2511.12276v2Reflections on the design, applications and implementations of the normative specification language eFLINT2026-05-22T11:37:01ZChecking the compliance of software against laws, regulations and contracts is increasingly important and costly as the embedding of software into societal practices is becoming more pervasive. Moreover, the digitalised services provided by governmental organisations and companies are governed by an increasing amount of laws and regulations, requiring highly adaptable compliance practices. A potential solution is to automate compliance using software. However, automating compliance is difficult for various reasons. Legal practices involve subjective processes such as interpretation and qualification. New laws and regulations come into effect regularly and laws and regulations, as well as their interpretations, are subjected to constant revision. In addition, computational reasoning with laws requires a cross-disciplinary process involving both legal and software expertise.
This paper reflects on the domain-specific language eFLINT developed to experiment with novel solutions to these challenges. Specifically, the language has been developed to experiment with the abstract syntax and semantics of a language supporting different types of reasoning for various applications. The language combines declarative and procedural elements, formalises connections between legal concepts and computational concepts, and is designed to automate compliance checks before, during and after a software system runs. The various design goals and applications areas for the language give rise to (conflicting) requirements. This paper presents and reflects on the current design of the language by recalling applications and requirements. As such, this paper reports on results and insights of an investigation that can benefit language developers within the field of automated compliance.2025-11-15T16:09:31Z29 pagesL. Thomas van BinsbergenChristopher A. EsterhuyseTim Müllerhttp://arxiv.org/abs/2605.23435v1MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization2026-05-22T09:45:56ZCompiler phase ordering has a strong effect on program performance. Finding an effective sequence of passes is still a difficult task because the search space is large and execution time, code size and energy consumption often conflict. Existing methods usually depend on fixed optimization levels or limited heuristics and they rarely handle multiple objectives at the same time. This paper presents MileStone, a modular framework that models compiler phase ordering as a multi-objective optimization problem. MileStone represents programs as graphs, predicts performance metrics with a graph neural network and explores pass sequences with a reinforcement-learning agent that follows user constraints. The framework also builds a self-evolving database that collects compiler transformations and improves prediction quality. Experiments on standard benchmarks show that MileStone finds strong Pareto-optimal solutions, meets energy limits more accurately than LLVM optimization levels and other related techniques. MileStone reduces execution time by up to 45 percent under the same energy budget using a multi-objective approach. The results show that MileStone provides an effective and scalable solution for multi-objective compiler phase ordering.2026-05-22T09:45:56ZAmirhosein SadrMehran Alidoost Niahttp://arxiv.org/abs/2605.23358v1A Compilation Framework for Quantum Simulation of Non-unitary Dynamics2026-05-22T08:22:31ZMost quantum compilers assume programs are reversible unitary circuits. This fits closed-system algorithms, but not open-system simulation, where the natural program objects are quantum channels describing non-unitary dynamics. We present a channel-first compilation framework that treats channels as first-class compilation objects. Our core IR, ChannelIR, represents channels explicitly in Kraus form, a standard channel representation, with Pauli-sum structure, enabling algebraic rewrites before circuit synthesis. We instantiate the framework with LindFront, a frontend that lowers continuous-time Lindbladian generators to short-time channels, and a backend that compiles these channels to executable circuits with structure-aware optimizations. On Lindbladian and channel-simulation benchmarks, the optimized pipeline reduces gate count by up to 99% over an unoptimized channel-first baseline and scales better than circuit-first Stinespring compilation.2026-05-22T08:22:31Z35 pages, 15 figuresQifan HuangMinbo GaoLi ZhouMingsheng Yinghttp://arxiv.org/abs/2605.19005v2Rewrite System Showdown: Stochastic Search vs. EqSat2026-05-22T01:08:52ZEquality saturation has become a dominant paradigm for equational program optimization. However, it has never been rigorously compared to another approach to the same problem, even though several exist, the most notable being stochastic search. In this paper, we compare equality saturation to stochastic search over five benchmarks to answer the question: are e-graphs actually good?2026-05-18T18:27:33ZTo be presented at EGRAPHS 2026Qiantan HongRupanshu SoiYihong ZhangAlex Aiken