https://arxiv.org/api/naaqyVDysZ+dmrEK+Py1l6UqbDo 2026-06-13T22:29:45Z 9885 165 15 http://arxiv.org/abs/2605.14780v1 Mat2Boundary: Treating User-Defined Boundary Condition as SpMV for Distributed PDE Solvers on Block-Structured Grids 2026-05-14T12:49:09Z

Boundary-condition (BC) handling is a major source of complexity in PDE solvers on structured and block-structured grids, especially for high-order methods and distributed-memory execution. We present Mat2Boundary, a DSL and compiler for boundary computations that models a broad class of boundary-conditions as affine sparse linear operators. This abstraction unifies halo copying, circular and symmetric mappings, zero padding, block-edge synchronization, and user-defined interpolation, while exposing a modular basic sub-matrix interface for declarative composition. To make this representation efficient, Mat2Boundary combines multi-stage programming and polyhedral analysis to generate matrix-free kernels for structured cases, support user-defined sparse matrices for irregular cases, eliminate redundant boundary work, and synthesize reusable communication schedules for distributed execution. Evaluated on two shallow-water equation solvers on cubed-sphere grids and HPCG, Mat2Boundary achieves up to 7.6$\times$ BC-kernel speedup, reduces BC code by over 70%, and scales to 1,344 CPU cores with 72%-88% efficiency.

2026-05-14T12:49:09Z Yanzheng Cai Department of Computer Science and Technology & BRNist, Tsinghua University, Beijing, China Mingzhe Zhang Department of Computer Science and Technology & BRNist, Tsinghua University, Beijing, China Shengqi Chen Department of Computer Science and Technology & BRNist, Tsinghua University, Beijing, China Haoyuan Song Department of Computer Science and Technology & BRNist, Tsinghua University, Beijing, China Wenguang Chen Department of Computer Science and Technology & BRNist, Tsinghua University, Beijing, China http://arxiv.org/abs/2511.21104v4 BRIDGE: Building Representations In Domain Guided Program Synthesis 2026-05-14T05:18:06Z

Large language models can generate plausible code, but remain brittle for formal verification in proof assistants such as Lean. A central scalability challenge is that verified synthesis requires consistent artifacts across several coupled domains: executable code, formal specifications, theorem statements, and proof attempts. Existing approaches often treat these artifacts separately. We present BRIDGE, a structured prompting framework for multi-artifact program synthesis. BRIDGE decomposes generation into three interconnected domains: Code, Specification, and Theorem/Proof, and uses domain-specific intermediate reasoning to connect them. In Lean, BRIDGE often follows a code-first workflow, using the generated implementation as a semantic anchor for downstream specification, theorem statement, and proof-attempt generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by up to nearly 1.5x over direct prompting and can be roughly 2x more sample efficient at comparable generation lengths. We further find that specification-oriented prompting improves Python pass rates by up to 17.5 percentage points. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only fine-tuning, suggesting that these intermediate representations provide a learnable inductive bias. BRIDGE provides a practical framework for scaling verified synthesis while highlighting the remaining gap between executable correctness and full formal proof generation.

2025-11-26T06:39:19Z 41 pages, 10 figures, 3 tables. Preprint Robert Joseph George Carson Eisenach Udaya Ghai Dominique Perrault-Joncas Anima Anandkumar Dean Foster http://arxiv.org/abs/2605.15238v1 Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support 2026-05-14T03:18:16Z

Large language models are increasingly used for code generation, but many generated programs fail to compile, a prerequisite for further correctness checks such as unit tests. Existing solutions for repairing static errors are costly in both latency and token consumption. Post-hoc repair delays error detection until generation completes and commonly regenerates large regions of previously valid code. Constrained semantic decoding checks after each token, incurring per-token overhead while limiting repair to the current token even when the root cause lies earlier. We present Hydra, a system for efficient recovery from static errors during code generation. Hydra allows checking to proceed asynchronously with generation, avoiding checker overhead when the generated code is semantically correct. In addition, it provides checkpoint-and-rollback support for targeted repair, avoiding regeneration and rechecking of valid prefixes. We retrofit the Clang C/C++ compiler to support Hydra with modest modifications. Paired with a token-efficient repair strategy, Hydra reduces latency by up to 71% and token consumption by up to 70% relative to post-hoc repair on C/C++ code generation tasks that encounter static errors.

2026-05-14T03:18:16Z Alexander Du Jianjun Ou Danyang Zhuo Matthew Lentz http://arxiv.org/abs/2605.08419v2 Deterministic Fully-Static Whole-Binary Translation without Heuristics 2026-05-13T21:22:17Z

We present Elevator, the first binary translator that statically translates entire x86-64 executables to AArch64 without debug information, source code, or assumptions about code layout. Unlike existing systems, which rely on heuristics or runtime fallbacks to handle code-versus-data decoding errors, Elevator considers all possible interpretations of every byte and produces a separate translation for each feasible one ahead of time. Any byte may be interpreted as data, an opcode, or an opcode argument; we generate separate control flow paths for all interpretations, pruning only those leading to abnormal termination. Translations are built by composing code "tiles" automatically derived from a high-level description of the source ISA, yielding a nimble translation framework. The approach is deterministic and produces complete, self-contained binaries with no runtime component in the trusted code base. The principal cost is substantial code size expansion. The key benefit is that the output is the actual code that will run, enabling testing, validation, certification, and cryptographic signing prior to deployment, reducing risk compared to emulators or JIT compilers. We evaluate Elevator on a diverse corpus of real-world binaries, including the entire SPECint 2006 suite, demonstrating that static full-program binary translation can be both reliable and practical. Elevator achieves performance on par with or better than QEMU's user-mode JIT emulation.

2026-05-08T19:25:06Z Added acknowledgements Hongyu Chen James McGowan Michael Franz http://arxiv.org/abs/2605.13780v1 On the Complexity of Checking Soundness of Natural Reductions (Extended Version) 2026-05-13T17:01:51Z

The verification of reductions, representative subsets of interleavings, simplifies correctness proofs of parameterized concurrent programs. We introduce an expressive class of syntactic reductions, which we call natural reductions. Natural reductions are specified by introducing atomic blocks and global rendezvous points in the parameterized program's thread template. We study the problem of deciding whether a given natural reduction is sound wrt. a given (semi-)commutativity relation. In the case that there is no synchronization between threads, we present a sound and complete polynomial-time algorithm. In the case where synchronization is considered, we provide a general lower bound for the problem (parametric in the synchronization mechanism), and show that the problem is coNP-hard already for a simple mechanism like locking.

2026-05-13T17:01:51Z 31 pages, extended version (with proofs) of the paper accepted at CAV'26 Constantin Enea Azadeh Farzan Dominik Klumpp http://arxiv.org/abs/2511.01736v2 Cobble: Compiling Block Encodings for Quantum Computational Linear Algebra 2026-05-13T17:01:05Z

Quantum algorithms for computational linear algebra promise up to exponential speedups for applications such as simulation and regression, making them prime candidates for hardware realization. But these algorithms execute in a model that cannot efficiently store matrices in memory like a classical algorithm does, instead requiring developers to implement complex expressions for matrix arithmetic in terms of correct and efficient quantum circuits. Among the challenges for the developer is navigating a cost model in which conventional optimizations for linear algebra, such as subexpression reuse, can be inapplicable or unprofitable. In this work, we present Cobble, a language for programming with quantum computational linear algebra. Cobble enables developers to express and manipulate the quantum representations of matrices, known as block encodings, using high-level notation that automatically compiles to correct quantum circuits. Cobble features analyses that compute the time and space usage of programs, as well as optimizations that reduce overhead and generate efficient circuits using state-of-the-art techniques such as the quantum singular value transformation. We evaluate Cobble on benchmark kernels for simulation, regression, search, and other applications, showing 2.6x-25.4x speedups on these benchmarks compared to the unoptimized baseline.

2025-11-03T16:48:13Z 22 pages, 13 figures. v2: published version of paper Proc. ACM Program. Lang., Vol. 10, No. PLDI, Article 177. Publication date: June 2026 Charles Yuan 10.1145/3808255 http://arxiv.org/abs/2605.13765v1 First Steps Towards Probabilistic Iris: Harmonizing Independence, Conditioning, and Dynamic Heap Allocation 2026-05-13T16:46:19Z

There has recently been exciting progress in the realm of *probabilistic separation logics*. An important subclass of these -- including PSL, Lilac, Bluebell, and pcOL -- are *general-purpose probabilistic logics* (or GPLs, for short), meaning that they provide primitive Hoare-style assertions about probability distributions on the program state, along with powerful modularity principles like *independence* and *conditioning*. However, none of these logics support reasoning about dynamically allocated memory (i.e., pointers into a heap), let alone the more sophisticated resource algebra-based ghost state of modern separation logics like Iris. We argue that this is due to a fundamental obstacle: since the shape of memory (and identity of memory locations) may differ under different random outcomes, it is unclear how pointer ownership can be harmonized with probabilistic independence and conditioning. Furthermore, none of the existing GPLs have been mechanized in a proof assistant. In this paper, we take a substantial first step towards a marriage of GPLs and modern separation logics like Iris, in the form of **Amaryllis**. Amaryllis is the first GPL to support independence and conditional reasoning while also handling dynamic memory allocation. To overcome the aforementioned obstacle, we propose a new *indexed valuation*-style model of probabilistic assertions, whereby ownership of a standard Iris-style resource (e.g., heaps) can be promoted to ownership at the level of distributions in a generic fashion. We then show how to adapt the central Iris notions of *frame-preserving update*, *authoritative resource algebras*, and the *weakest precondition modality* to be sound for probabilistic reasoning and validate dynamic allocation. Finally, we have mechanized all our results in the Rocq proof assistant and developed an Iris-based proof mode for conducting proofs within Amaryllis.

2026-05-13T16:46:19Z Janine Lohse Tim Rohde Jimmy Xin Niklas Mück Iona Kuhn Derek Dreyer Deepak Garg Emanuele D'Osualdo http://arxiv.org/abs/2605.13929v1 Linear-Time T-Gate Optimization via Random Abstraction 2026-05-13T15:54:13Z

Quantum computers promise exponential speedups for problems in cryptography, chemistry, and optimization. Realizing this promise requires fault tolerance: physical qubits are noisy, so logical qubits must be encoded redundantly across many physical ones using quantum error-correcting codes. In most practical fault-tolerance schemes, T gates cannot be implemented transversally and instead require costly magic-state distillation protocols involving a complex set of operations. As a result, T-gate count can dominate the resource budget of large-scale quantum computations, making T-count minimization a central bottleneck on the path to quantum advantage. Existing T-count optimization tools, however, do not scale to the circuits that quantum advantage demands. We present theoretical and practical results on T-gate optimization. On the theoretical side, we give a linear-time randomized algorithm for phase folding, based on a novel randomized static analysis. Our static analysis soundly approximates the set of reachable quantum states with an arbitrarily high probability. Our key insight is a static analysis that does not track symbolic expressions, but propagates constant-width bitstrings down the circuit. On the practical side, our implementation, TZAP, is multiple orders of magnitude faster than state-of-the-art tools -- such as PyZX, VOQC, and Feynman -- closely matches their T-count reductions on standard benchmarks, and within seconds on a laptop computer can optimize circuits with millions of gates.

2026-05-13T15:54:13Z Aws Albarghouthi http://arxiv.org/abs/2605.13456v1 Liquid Tree Automata 2026-05-13T12:49:32Z

Component-based synthesis (CBS) generates loop-free programs from library components to satisfy logical queries. While expressive specifications and precise queries simplify the solution space, they make finding feasible execution paths significantly more difficult for traditional CBS procedures. As constraints become more exact, the search must navigate an increasingly sparse space of valid paths. We address this challenge by reasoning about \emph{logical similarities} between exploration paths. We consider library methods equipped with refinement-type specifications, which enrich base types with logical qualifiers to precisely constrain the value space. To efficiently explore this space, we introduce Liquid Tree Automata (LTA), a novel tree automata variant whose construction is driven by refinement typing rules. LTAs leverage subtyping constraints to identify and eagerly merge semantically similar states during search. This merging avoids redundant exploration of equivalent paths, significantly improving synthesis efficiency. We implement this approach in a tool called Hegel. Our evaluation demonstrates that Hegel synthesizes solutions to complex queries that are beyond the reach of existing state-of-the-art tools.

2026-05-13T12:49:32Z arXiv admin note: text overlap with arXiv:2508.14614 Ashish Mishra Suresh Jagannathan http://arxiv.org/abs/2605.16407v1 Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture 2026-05-13T12:01:41Z

We present a framework for verifying the deterministic structured computations surrounding a large language model rather than the model itself, extending a Lean 4 trust-boundary architecture to the generic interfaces of modern LLM pipelines. Certificate validity is a Lean 4 kernel type-check plus a sorry-free transitive axiom audit against the trusted set {propext, Classical.choice, Quot.sound}; other assumptions are declared and partitioned by tier (mathematical placeholders, cryptographic assumptions, ML/human oracles). The technical contribution comprises three local certificate families and two operators. The families are conflict-aware bilattice grounding (with an emission-gate soundness lemma), embedding sensitivity and paraphrase stability, and Hoare-style agent action. The operators are a Maximal Certifiable Residue, which turns abstention into the maximum-weight certifiable residue with audit-logged dropped claims, and a Compositional Stability theorem, which yields a closed-form pipeline-wide perturbation budget from per-layer gains and margins. The three families plus a Universal Assurance Card consolidator form the per-call deliverable for high-stakes deployments: patent and legal retrieval, regulated finance, clinical decision support, and agentic systems with irreversible side effects. A compiled Lean 4 reference artifact (Lean v4.30.0-rc2, Mathlib) covers all 22 certificate types, with 17 of 46 kernel-audited declarations axiom-free, the rest depending only on the trusted set and declared assumptions, and zero uses of sorryAx or Lean.ofReduceBool. The three families are empirically tested through four registered pilots: bilattice grounding on adversarially perturbed HotpotQA, embedding sensitivity in short- and long-form settings, and Hoare-style agent action on a filesystem sandbox with adversarial prompt injection.

2026-05-13T12:01:41Z 83 pages, 1 figure, 12 tables George Koomullil http://arxiv.org/abs/2605.15222v1 PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization 2026-05-13T08:10:26Z

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at https://anonymous.4open.science/r/perfcodebench-7CDE.

2026-05-13T08:10:26Z Huihao Jing Wenbin Hu Haochen Shi Hanyu Yang Sirui Zhang Shaojin Chen Haoran Li Yangqiu Song http://arxiv.org/abs/2605.10005v2 Combining Mechanical and Agentic Specification Inference for Move 2026-05-13T02:35:58Z

In this paper, we describe early work on a specification inference tool for the Move Prover that combines a weakest-precondition (WP) analysis over Move bytecode with an agentic coding CLI such as Claude Code. Specification inference reduces the boilerplate of writing specifications in Move: in order to verify a high-level property such as a global state invariant, pre- and post-conditions for the supporting functions typically have to be written by hand, which is tedious. In our setting, a Model Context Protocol (MCP) service exposes the WP analysis and the prover itself to the coding agent. The WP analysis provides a sound, mechanical baseline for inference; the AI is used precisely where WP is weakest -- for loop invariants and high-level idiomatic specifications such as monotonicity, conservation, and structural invariants. The Move Prover serves as the oracle that decides whether the generated specs are valid, and the agent is equipped to generate proof hints and to refine the inferred specification until verification succeeds. The tool has been applied to a corpus of canonical Move code, including code that uses higher-order functions, dynamic dispatch, global state, references, and various forms of loops.

2026-05-11T05:30:25Z Wolfgang Grieskamp Teng Zhang Vineeth Kashyap http://arxiv.org/abs/2605.10007v2 Formal Verification of Imperative First-Class Functions in Move 2026-05-13T02:34:36Z

The Move Prover (MVP) is a formal verifier for smart contracts written in the Move programming language. Recently, Move on Aptos was extended with higher-order functions: imperative functions as first-class values that can be passed around, stored in data structs, and kept in persistent storage, enabling dynamic dispatch. This paper describes the representation of function values in the Move specification language and their implementation in MVP. We introduce behavioral predicates which characterize Move functions (aborts and pre/post conditions) by single-state or two-state predicates. We also introduce state labels for naming intermediate memory states in which expressions are evaluated and which allow to compose behavioral predicates to describe sequences of state transitions. On SMT level, function values are encoded by discriminating over the possible function values reaching a call site: when the concrete function is known, its effect is accounted for directly; when it is unknown (for example, a function parameter, or a closure loaded from storage), its behavioral predicates describe the effect. Our approach goes beyond, for example, Dafny, by supporting imperative first-class functions which can modify state via Rust-style references and global variables, and leads to more efficient SMT encodings than separation logic because of the static separation of memory enabled by Move. We further extend MVP's specification inference tool to work with function values: given arbitrary higher-order Move code, weakest-precondition analysis semi-automatically derives behavioral-predicate-based specifications, reducing the annotation burden and providing a validation pipeline for the new specification constructs.

2026-05-11T05:31:55Z Wolfgang Grieskamp Teng Zhang Vineeth Kashyap Jake Silverman http://arxiv.org/abs/2605.12863v1 Language-Based Agent Control 2026-05-13T01:21:05Z

This paper introduces language-based agent control (LBAC), a new programming model for agentic applications that brings techniques from programming languages and language-based security to the problem of agent control. In conventional programming, combinations of static typing and runtime enforcement have long been used to guarantee that well-typed programs satisfy user-specified policies, including policies for access control, information flow, data provenance, and more. The key idea behind LBAC is to extend these guarantees to agentic applications by requiring agents to generate programs that are themselves well typed in the context of the surrounding scaffolding code. Unsafe programs are rejected by the type-checker before execution, allowing policies to apply uniformly across the entire application, including both agent-generated behavior and developer-written scaffolding. At the same time, LBAC preserves substantial expressiveness: agents may perform arbitrary side-effect-free computation and recursively invoke subagents, which retain full tool access subject to the same -- or potentially more restrictive -- policies. We demonstrate LBAC with three case studies: I/O sandboxing via filesystem capabilities, data provenance, and information-flow control.

2026-05-13T01:21:05Z Timothy Zhou Loris D'Antoni Nadia Polikarpova http://arxiv.org/abs/2605.18827v1 Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds 2026-05-12T20:20:06Z

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.

2026-05-12T20:20:06Z 28 Pages, 18 Figures Prateek Biswas Dhaval Patel Vedant Khandelwal Shuxin Lin Amit Sheth