https://arxiv.org/api/tY0iounqrFYJ7nPBaI0QDwyZFuc2026-06-13T16:43:49Z98859015http://arxiv.org/abs/2409.03918v2PoTo: A Hybrid Andersen's Points-to Analysis for Python2026-05-26T18:00:03ZAs Python is increasingly being adopted for large and complex programs, the importance of static analysis for Python (such as type inference) grows. Unfortunately, static analysis for Python remains a challenging task due to its dynamic language features and its abundant external libraries. To help fill this gap, this paper presents PoTo, an Andersen-style context-insensitive and flow-insensitive points-to analysis for Python. PoTo addresses Python-specific challenges and works for large programs via a novel hybrid evaluation, integrating traditional static points-to analysis with concrete evaluation in the Python interpreter for external library calls. Next, this paper presents PoTo+, a static type inference for Python built on the points-to analysis. We evaluate PoTo+ and compare it to two state-of-the-art Python type inference techniques: (1) the static rule-based Pytype and (2) the deep-learning based DLInfer. Our results show that PoTo+ outperforms both Pytype and DLInfer on existing Python packages.2024-09-05T21:26:25Z39th European Conference on Object-Oriented Programming (ECOOP 2025)Ingkarat Rak-amnouykitAna MilanovaGuillaume BaudartMartin HirzelJulian Dolby10.4230/LIPIcs.ECOOP.2025.27http://arxiv.org/abs/2605.27124v1ProDebug: An Automated Debugging System for Prolog2026-05-26T14:57:50ZProlog is a well-known declarative programming language commonly used in introductory courses on logic and reasoning. However, many students find Prolog challenging because it lacks the familiar debugging mechanisms found in imperative languages. In large classes, this difficulty is exacerbated by the challenge of providing timely and personalized feedback to students.
In this work, we introduce ProDebug, the first tool to combine Large Language Models (LLMs) with spectrum-based and mutation-based techniques for automated debugging of Prolog assignments. ProDebug automatically identifies faults and proposes bug repairs for student Git submissions. Faults are detected using three approaches--spectrum-based, mutation-based, and LLM reasoning--while repairs are generated using mutation-based techniques and LLMs. Our evaluation on 1499 buggy student submissions from a bachelor's level programming class demonstrates the potential of automated, LLM-augmented feedback systems to scale support for declarative programming education.2026-05-26T14:57:50ZRicardo BrancasVasco ManquinhoRuben Martinshttp://arxiv.org/abs/2605.01032v3Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries2026-05-26T12:41:10ZWe present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibility. The framework, mechanized in 32 Rocq modules (~12,000 lines, 454 theorems, 0 admitted), is built on interaction trees and parameterized coinduction. A three-axiom GovernanceAlgebra record (safety, transparency, properness) induces a symmetric monoidal category with verified pentagon, triangle, and hexagon coherence, where every tensor composition preserves governance. An algebraic effect system constrains the handler algebra so that only governance-preserving handlers can be constructed in the safe fragment; programs in the empty capability set provably emit only observability directives. Capability-indexed composition bundles programs with machine-checked capability bounds, and a dual guarantee theorem establishes that within_caps and gov_safe hold simultaneously under all composition operators. The capstone result is the coterminous boundary: within our formal model, every program expressible via the four primitive morphism constructors is governed under interpretation, and every governed program is the image of such a program. Turing completeness is preserved inside governance; unmediated I/O is excluded from the governed fragment. Governance denial is modeled as safe coinductive divergence. The governance algebra is parametric: any system instantiating the three axioms inherits all derived properties, including convergence, compositional closure, and goal preservation. Extracted OCaml runs as a NIF in the BEAM runtime, with property-based testing (70,000+ random inputs, zero disagreements) confirming behavioral equivalence between the specification and the runtime interpreter.2026-05-01T18:59:01Z26 pages, 1 figure, 1 table. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated licenseAlan L. McCannhttp://arxiv.org/abs/2605.05248v4Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect2026-05-26T12:37:10ZAI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving systems modify their own behavior. In classical homoiconic and staged languages, the transition from code representation to execution is unrestricted. eval is a language primitive, not a governed operation. We argue that in governed intelligent systems, this transition is an authority amplification: it converts symbolic structure into executable authority and must be mediated like any other effect. We present governed metaprogramming, a language design where program representations (machine forms) are first-class values, form manipulation is pure computation, and materialization (the transition from form to executable machine) is a governed effect subject to structural inspection. The governance system analyzes the proposed program's capability requirements, policy compliance, and resource estimates before permitting execution. We formalize two judgments: pure form evaluation (which emits no directives) and governed materialization (which emits exactly one governed directive). We prove three properties: purity of form manipulation, the no-bypass theorem, and boundary preservation. We implement the design in mashinTalk, a DSL for AI workflows compiling to BEAM byte code, and report on integration with 454 existing machine-checked Rocq theorems. The central contribution is reclassifying eval from a language primitive into a governed effect.2026-05-05T02:56:43Z15 pages. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Update: Abstract typo fixes. Updated licenseAlan L. McCannhttp://arxiv.org/abs/2605.01037v3Certified Purity for Cognitive Workflow Executors: From Static Analysis to Cryptographic Attestation2026-05-26T12:33:51ZWe present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention into a structural capability boundary. A prior three-layer governance architecture proves governance completeness, provenance completeness, and the impossibility of ungoverned effects, conditional on the pure module constraint: that step executors cannot perform effects. That constraint was enforced by module import graph analysis, which is insufficient against adversarial bypass on the BEAM virtual machine. This paper closes the gap through four mechanisms: (1) a restricted WebAssembly compilation target where effect-producing instructions are structurally absent; (2) purity certificates, cryptographically signed proofs binding executor binaries to their import classifications; (3) a runtime verification gate that rejects uncertified executors before they enter the governance pipeline; and (4) portable governance credentials via remote attestation for cross-organizational verification. We prove four theorems: structural purity by construction, bypass elimination for all five BEAM bypass classes, certificate integrity, and gate completeness. The guarantee holds relative to an explicit Trusted Computing Base. Evaluation on four implemented executors shows verification latency of 39--42 us, full plan cycle under 400 us, runtime overhead under 0.4% of a 100 ms HTTP request, and zero determinism divergences across repeated invocations.2026-05-01T19:04:37Z23 pages, 4 figures, 8 tables. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated licenseAlan L. McCannhttp://arxiv.org/abs/2605.01030v3Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries2026-05-26T12:30:21ZWe present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be imposed without reducing internal computational expressivity. Using Interaction Trees in Rocq 8.19, we define a governance operator G that mediates all effectful directives, including memory access, external calls, and oracle (LLM) queries. Our development compiles with 0 admitted lemmas and consists of 36 modules, ~12,000 lines of Rocq, and 454 theorems. We establishseven properties: (P1) governed Turing completeness, (P2) governed oracle expressivity, (P3) a decidability boundary in which governance predicates are total and closed under Boolean composition while semantic program properties remain non-trivial and undecidable by governance, (P4) goal preservation for permitted executions, (P5) expressive minimality of primitive capabilities (compute, memory, reasoning, external call, observability), (P6) subsumption asymmetry showing structural governance strictly subsumes content-level filtering, and (P7) semantic transparency: on all executions where governance permits, the governed interpretation is observationally equivalent (modulo governance-only events) to the ungoverned interpretation. Together, these results show that governance and computational expressivity are orthogonal dimensions: governance constrains the effect boundary of programs while remaining semantically transparent to internal computation.2026-05-01T18:52:47Z15 pages. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers. License updatedAlan L. McCannhttp://arxiv.org/abs/2601.18987v5LLMs versus the Halting Problem: Characterizing Program Termination Reasoning2026-05-26T08:18:34ZDetermining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.2026-01-26T21:44:12ZOren SultanJordi Armengol-EstapePascal KesseliJulien VanegueDafna ShahafYossi AdiPeter O'Hearnhttp://arxiv.org/abs/2605.08455v2CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging2026-05-26T07:39:54ZDebugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.2026-05-08T20:24:32Z25 pages, 5 figuresShiyang LiHaoyang ChenMattia FazziniCaiwen Dinghttp://arxiv.org/abs/2605.26635v1Pacing Types for Asynchronous Stream Equations2026-05-26T07:14:48ZStream-based monitoring is a runtime verification approach where a monitor aggregates streams of input data from sensors and other sources to give real-time statistics and assessments of a system's health. One of the central challenges in designing reliable stream-based monitors is to deal with the asynchronous nature of data streams: in concrete applications, the different sensors being monitored produce values at different speeds, and it is the monitor's responsibility to correctly react to the asynchronous arrival of different streams of values. To ease this process, modern frameworks for stream-based monitoring such as RTLola enable users to finely specify data synchronization policies via a system of pacing annotations. While this feature simplifies the design of monitors, it can also lead users to write inconsistent policies, where synchronization between two streams is explicitly requested via annotations, but cannot always be achieved. To mitigate this issue, this paper presents pacing types, a novel type system implemented in RTLola to ensure that monitors for asynchronous streams are free of timing inconsistencies. We give a formal semantics to pacing annotations for a core fragment of RTLola, and present a soundness proof of the pacing type system. For an additional level of guarantees, we machine-checked the soundness proof using the Rocq proof assistant.2026-05-26T07:14:48ZFlorian KohnArthur CorrensonJan BaumeisterBernd Finkbeinerhttp://arxiv.org/abs/2605.26457v1Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization2026-05-26T02:12:48ZAI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym2026-05-26T02:12:48ZPreprintAnmol AgarwalNatalie NeamtuPranjal AggarwalSeungone KimJannis LimpergCedric FlamantKanna ShimizuBryan ParnoSean Welleckhttp://arxiv.org/abs/2508.20365v3Solvable Tuple Patterns and Their Applications to Program Verification2026-05-26T01:43:35ZDespite the recent progress of automated program verification techniques, fully automated verification of programs manipulating recursive data structures remains a challenge. We introduce solvable tuple patterns (STPs) and conjunctive STPs (CSTPs), novel formalisms for expressing and inferring invariants between list-like recursive data structures. A distinguishing feature of STPs is that they can be efficiently inferred from only a small number of positive samples; no negative samples are required. After presenting properties and inference algorithms of STPs and CSTPs, we show how to incorporate the CSTP inference into a CHC (Constrained Horn Clauses) solver supporting list-like data structures, which serves as a uniform backend for automated program verification tools. A CHC solver incorporating the (C)STP inference has won the ADT-LIN category of CHC-COMP 2025 by a significant margin.2025-08-28T02:30:02ZA short version was accepted for PLDI 2026Naoki KobayashiRyosuke SatoAyumi ShinoharaRyo Yoshinaka10.1145/3808258http://arxiv.org/abs/2605.26291v1Geo: A Query Rewrite Framework for Graph Pattern Mining2026-05-25T19:28:43ZGraph pattern mining is important for analyzing graph data. Graph mining systems typically require answering pattern matching queries, which involve solving the NP-complete subgraph isomorphism problem. To address this, domain experts often develop custom optimization strategies based on exploiting substructural similarities across different patterns. While these optimizers can be effective, their development is challenging, limiting the exploration of interactions between different optimization strategies and restricts experts from continuously improving the optimizers -- such as by incorporating additional custom or general pattern-based equivalences over time.
We present a programmable pattern matching query optimizer called Geo, which automatically manages the interactions between various equivalences, ensures the optimizations maintain correctness of results, and simplifies the management of substructure equivalences. Geo exposes a simple but flexible language for expressing pattern equivalences as rewrite rules. By maintaining canonical representations of generated patterns during equality saturation, Geo avoids issues arising from syntactic differences in isomorphic patterns. Additionally, we develop embedded reconstructablility (EmRec) that tracks provenance across equivalences to ensure various reconstructability needs of desired outputs. Our evaluation demonstrates that Geo can discover novel query equivalences through complex composition of various rewrite rules, enabling our optimized queries to achieve a cost reduction of up to 99% compared to the queries in prior work. We further test Geo's effectiveness at speeding up practical graph mining problems by using it in two representative case studies -- approximate pattern matching and quasi-clique mining, and find it is highly effective at optimizing these tasks, enabling cost reductions of up to 71%.2026-05-25T19:28:43ZOOPSLA 2026Nazanin YousefianKasra JamshidiKeval VoraAnders Miltnerhttp://arxiv.org/abs/2512.01678v5Morphling: Fast, Fused, and Flexible GNN Training at Scale2026-05-25T18:28:29ZGraph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. Morphling improves per-epoch training throughput by an average of 20X on CPUs, 19X on GPUs, and 6X in distributed settings over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.2025-12-01T13:45:03Z AnubhabRupesh Nasrehttp://arxiv.org/abs/2605.10913v2Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace2026-05-25T16:14:14ZAs LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, much as managers supervise employees. Whatever a meta-agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta-agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git-like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench-2 with 58% lower wall-clock; (3) a meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's gains on TerminalBench-2. We open-source Shepherd to empower future meta-agents with principled and efficient operations over agentic execution.2026-05-11T17:50:51Z50 pages, 22 figures, 14 tablesSimon YuDerek ChongAnanjan NandiDilara SoyluJiuding SunChristopher D ManningWeiyan Shihttp://arxiv.org/abs/2506.11027v3From Reasoning to Code: GRPO Optimization for Underrepresented Languages2026-05-25T11:34:07ZGenerating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming languages, such as Prolog and Lisp, due to the scarcity of public training data compared to high-resource languages like Python. This paper introduces a generalizable Reinforcement Learning (RL) approach that combines small-scale versions of the Qwen2.5-Coder model with Group Relative Policy Optimization (GRPO) to enable effective code generation through reasoning. To address the limitations of sparse datasets, we integrate execution-driven feedback directly into the RL loop, utilizing a reward system that exploits both logical correctness and structural formatting. Experimental results on GSM8K dataset demonstrate significant improvements in reasoning quality and code accuracy across underrepresented languages. These findings underscore the potential of our approach to benefit a wide range of programming languages lacking extensive training resources by leveraging symbolic reasoning and interpreter-based feedback.2025-05-20T11:28:48ZAccepted ICLP 2026Federico PenninoBianca RaimondiMassimo RondelliAndrea GurioliMaurizio Gabbrielli