https://arxiv.org/api/5/L8Ra3W+xyTsmX8dqI/ZwLbeyM2026-06-13T13:14:10Z98854515http://arxiv.org/abs/2606.04813v1GraphAlg Playground: An Online Platform for Learning and Experimenting with the GraphAlg Language2026-06-03T12:37:43ZThe GraphAlg language for graph algorithms enables native support for user-defined graph analytics workloads in databases. In this demonstration, we present a web-based playground for writing and executing GraphAlg programs in the web browser, including an interactive tutorial explaining its key concepts. The playground runs inside the user's web browser without any installation, and is freely available under a permissive license as a reusable library. We present two demonstration scenarios of the publicly available playground website, showing how new users can learn to program in GraphAlg using the tutorial, while expert users can use the playground to prototype and validate their algorithms.2026-06-03T12:37:43ZAccepted at the VLDB 2026 Demonstration Track; to appear in PVLDB Vol. 19. 4 pages, 8 figures, 1 table. Artifacts: https://wildarch.dev/graphalgDaan de GraafRobert BrijderSoham ChakrabortyGeorge FletcherBram van de WallNikolay Yakovetshttp://arxiv.org/abs/2412.18134v5Learning Randomized Reductions2026-06-03T03:12:59ZRandomized self-reductions (RSRs) express $f(x)$ using $f$ evaluated at random correlated points, enabling self-correcting programs, instance-hiding protocols, and applications in complexity theory and cryptography. Yet discovering RSRs has required manual expert derivation for over 40 years, limiting their practical use. We present Bitween for automated RSR learning. First, we formalize RSR learning with sample complexity analysis under correlated sampling. Second, we develop Vanilla Bitween, which integrates multiple backends (linear regression, genetic programming, symbolic regression, and mixed-integer programming). The linear regression backend outperforms the others, discovering RSRs for 43 of 80 functions (54%) in RSR-Bench, our benchmark suite, including the first known reduction for sigmoid. Third, we introduce Agentic Bitween, a neuro-symbolic approach where LLM agents propose novel query functions beyond the fixed set ($x+r$, $x-r$, $x \cdot r$, $x$, $r$) in prior work. Agentic Bitween discovers RSRs for 64 of 80 functions (80%), outperforming pure neural baselines in both RSR discovery and verification accuracy.2024-12-24T03:42:53ZAccepted at ICML 2026 (Spotlight). 9 pages main text + appendixProceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026Ferhat ErataOrr ParadiseThanos TypaldosTimos AntonopoulosThanhVu NguyenShafi GoldwasserRuzica Piskachttp://arxiv.org/abs/2606.04352v1Negative and Fractional Types in the Fidelity Framework2026-06-03T02:09:11ZOur Native Type Universe (NTU) has been detailed through five previous papers establishing the substrate our framework's compilation pipeline targets across multiple hardware platforms. We have found in the course of that work a deeper reach this foundation makes available: negative and fractional types as native first-class constructs. James and Sabry established these dualities in 2012; Chen and Sabry later developed their categorical interpretation in compact closed categories. These dualities have practical benefit for compute modalities in our Fidelity Framework where extant general purpose compute framings lack the substrate to host them as native constructs. We see practicality with these type forms in preserving decidability and principal types through the abelian-group algebraic pattern Kennedy's dimensional types establish. The resulting isomorphisms would admit new, concise forms of resolution within our novel lowering strategy, and we sketch a notional Clef language syntax that would admit rational dimensional exponents into our algebra. We trace the implications across several problem spaces these type forms would open to our compilation and verification disciplines: Bayesian inference where fractional types would express conditioning obligations, quantum computation (and simulations) where negative types would provide the type-level adjoint, and finally adiabatic computation where the combined discipline would express Hamiltonian deformation as a reversible constraint-propagation process. The inherent structure of our NTU together with the supporting framework appears well-suited to problem spaces that current software ecosystems do not directly address, while keeping approachable development ergonomics and mature tooling aligned with operational guarantees the framework aspires to provide.2026-06-03T02:09:11Z23 pages, 1 figure, sample code appendixHouston Hayneshttp://arxiv.org/abs/2606.04311v1Formal verification of the S-two AIR2026-06-03T00:36:41ZStarkWare's S-two prover provides an efficient means for establishing, on blockchain, that a program written in the Cairo virtual machine language runs to completion. The latter claim is encoded by an algebraic intermediate representation (AIR) that captures the semantics of the Cairo language. The AIR asserts the existence of tables of values from a finite field satisfying certain algebraic constraints. A cryptographic interactive proof system, circle STARK, provides an efficiently-checked certificate that the AIR is satisfied. We describe our verification, using the Lean 4 proof assistant, that the AIR encoding is sound, which is to say, the satisfiability of the AIR implies the computational claim.2026-06-03T00:36:41ZJeremy AvigadAnat GanorLior GoldbergDavid LevitOhad NirYoav SeginerAlon Titelmanhttp://arxiv.org/abs/2606.05228v1Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating2026-06-02T18:48:38ZLarge language models (LLMs) demonstrate increasing proficiency on competitive programming benchmarks, yet technical reports predominantly publish aggregate pass rates, obscuring domain-specific vulnerabilities. We present a systematic empirical study of LLM failure patterns using a balanced taxonomy of 315 Codeforces problems across seven algorithm categories and three difficulty tiers. We evaluate GPT-4o and Claude Sonnet 4.6 under strict execution-based conditions, controlling for temperature (T = 0.2). To isolate the impact of reasoning frameworks on algorithmic correctness, we conduct an ablation study comparing direct zero-shot generation against zero-shot Chain-of-Thought (CoT). Our findings reveal a severe divergence from standard NLP benchmarks: forcing CoT aggressively penalizes GPT-4o, dropping its pass rate from 46.0% to 36.8% and exacerbating a critical weakness in Greedy logic. Conversely, while Claude maintains a higher logical baseline (63.5% under CoT), the expanded text generation severely degrades its markdown instruction adherence, causing its Compile Errors to more than triple (from 9 to 31, a 244% increase). Furthermore, failure-mode analysis indicates that Wrong Answer (WA) is the dominant verdict for both models--accounting for over 90% of GPT-4o's and roughly 70% of Claude's unaccepted solutions. These findings empirically demonstrate that standard prompt engineering techniques fail to bridge the algorithmic reasoning gap in competitive programming environments.2026-06-02T18:48:38Z12 pages, 4 figuresAyush Kumar JhaShalini Jhahttp://arxiv.org/abs/2511.13663v2SAIL: Sound Abstract Interpreters with LLMs2026-06-02T16:19:57ZHow to construct globally sound abstract interpreters to safely approximate program behaviors remains a bottleneck in abstract interpretation. In this paper, we show the potential of using state-of-the-art LLMs to automate this tedious process. Focusing on the neural network verification area, we synthesize non-trivial sound abstract transformers across diverse abstract domains using LLMs to search within infinite space from scratch. We formalize the synthesis task as a constrained optimization problem, for which we design a novel mathematically grounded cost function that measures the degree of unsoundness of each generated candidate transformer, while enforcing hard syntactic and semantic validity constraints. Building on this formulation, we introduce SAIL, a novel unified framework that combines model generation, syntactic and semantic validation, and cost-function-based refinement to synthesize globally sound abstract transformers. Evaluation results show that SAIL not only matches the performance of manually designed transformers, but also is able to synthesize sound and high-precision transformers that do not exist in the literature for complex non-linear operators.2025-11-17T18:16:36Z43 pages, 21 figuresProc. ACM Program. Lang. 10, PLDI, Article 230, 26 pages (2026)Qiuhan GuAvaljot SinghGagandeep Singh10.1145/3808308http://arxiv.org/abs/2606.01794v2Tridirectional Discriminating-Power Formal Verification of Smart Contract Reentrancy Defense Against Production-Deployed Solidity Source2026-06-02T14:01:42ZWe present the first machine-checked correctness proof of the OpenZeppelin reentrancy-guard pattern against a Lean 4 state-machine model of production-deployed Solidity source. All thirteen theorems are machine-checked with zero sorry, zero user-introduced axioms, and an axiom footprint bounded by [propext] (a standard mathlib4 axiom), gated under continuous integration.
Smart contract reentrancy has caused over US$500M in documented losses since 2016, with the DAO 2016 attack draining ~3.6M ETH and forcing the hard fork that split Ethereum. The OpenZeppelin ReentrancyGuard pattern is the de facto defense across production DeFi, yet no prior work has established its discriminating power: that the guard blocks attacks on vulnerable instances, preserves correct execution for non-attacking transactions, and distinguishes adjacent safe and vulnerable variants. Prior efforts formalized either guard correctness on toy contracts or attack feasibility on isolated instances - not both directions plus boundary cases against production source.
We verify three production instantiations - DAO 2016, Compound v2, and Aave V3 flashLoan - plus a minimal-diff mutant of Aave V3's flashLoan (flashLoanVulnerable) isolating one security-critical difference, via mutation testing. The tridirectional structure pairs (a) attack reproduction of the DAO 2016 pattern, (b) a correctness proof for Compound v2, and (c) a boundary-case proof distinguishing Aave V3's CEI-correct flashLoan from the mutant. A capstone meta-theorem composes the three under a no-retrofit discipline, demonstrated at the first cross-protocol stress test (Compound v2 to Aave V3); broader-family portability is future work.
Full Lean 4 source, CI config and reproduction commands are at https://github.com/rayiskander2406/qanary-contracts, reproducible at v1.6-phase7-closure (substrate: v1.3-layer6-closure).2026-06-01T07:13:39ZPreprint v2: added Zenodo concept DOI (10.5281/zenodo.20510920) for the archival artifact snapshot in the reproduction-commands appendix. No changes to the methodology, results, theorems, or references. Community feedback welcome; v3 with feedback planned before venue submissionRay Iskanderhttp://arxiv.org/abs/2408.14345v4Guard Analysis and Safe Erasure Gradual Typing: a Type System for Elixir2026-06-02T13:24:13ZWe formalize a new type system for Elixir, a dynamically typed functional programming language of growing popularity that runs on the Erlang virtual machine. Our system combines gradual typing with semantic subtyping to enable precise, sound, and practical static type analysis, without requiring any changes to Elixir's compilation pipeline or runtime. Type soundness is ensured by leveraging runtime checks -- both implicit, from the Erlang VM, and explicit, via developer-written guards.
Central to our approach are two key innovations: the notion of "strong functions", which can be assigned precise types even when applied to inputs that may fall outside their intended domain; and a fine-grained analysis of guards that enables accurate type refinement for case expressions and guarded function definitions. While type information is erased before execution and not used by the compiler, our "safe erasure" gradual typing strategy maintains soundness and expressiveness without compromising compatibility or performance. This work lays the theoretical foundation for Elixir's new type system, outlines its integration into recent versions of the language, and demonstrates its effectiveness on large-scale industrial codebases.2024-08-26T15:20:09ZGiuseppe CastagnaGuillaume Dubochttp://arxiv.org/abs/2606.04056v1Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study2026-06-02T10:46:57ZLLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.2026-06-02T10:46:57Z26 pages. Artifact (catalog CSV, Rust crate, formal proofs): https://github.com/sajjadanwar0/token-budgetsSajjad Khanhttp://arxiv.org/abs/2507.15017v4Polynomial Invariant Generation for Floating-Point Programs2026-06-02T06:54:48ZIn numeric-intensive computations, it is well known that the execution of floating-point programs is imprecise as floating-point arithmetic incurs round-off errors. Although round-off errors are small for a single floating-point operation, the aggregation of such errors may be dramatic and cause catastrophic program failures. Therefore, to ensure the correctness of floating-point programs, round-off error needs to be carefully taken into account. In this work, we consider polynomial invariant generation for floating-point programs, aiming at generating tight invariants under the perturbation of round-off errors. Our contribution is a novel framework for applying polynomial constraint solving to address the invariant generation problem, which is also the first polynomial constraint solving based approach that handles floating-point errors to our best knowledge.
In our framework, we propose a novel combination of round-off error analysis and polynomial constraint solving, aiming to circumvent the cost of handling a large number of error variables in the floating-point model. Experimental results over a variety of challenging benchmarks show that our framework outperforms SOTA approaches in both time efficiency and the precision of generated invariants.2025-07-20T16:02:17ZXuran CaiLiqian ChenHongfei Fuhttp://arxiv.org/abs/2604.15713v3Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints2026-06-02T05:13:23ZType annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the problem of complete and minimal type annotations for rank-one polymorphic $λ$-calculus terms, as used in Isabelle. Building on prior work by Smolka, Blanchette et al., we give a metatheoretical account of the problem, with a full formal specification and proofs, and formalize it in Isabelle/HOL. Our development is a series of experiments featuring human-driven and AI-driven formalization workflows: a human and an LLM-powered AI agent independently produce pen-and-paper proofs, and the AI agent autoformalizes both in Isabelle, with further human-hinted AI interventions refining and generalizing the development.2026-04-17T05:34:49ZKevin KappelmannMaximilian SchäffelerLukas StevensMohammad AbdulazizAndrei PopescuDmitriy Traytelhttp://arxiv.org/abs/2606.04034v1Climbing Up the Semantic Tower -- at Runtime2026-06-01T21:32:40ZSoftware exists at multiple levels of abstraction, where each more concrete level is an implementation of the more abstract level above, in a semantic tower of compilers and/or interpreters. First-class implementations are a reflection protocol to navigate this tower *at runtime*: they enable changing the underlying implementation of a computation *while it is running*. Key is a generalized notion of *safe points* that enable observing a computation at a higher-level than that at which it runs, and therefore to climb up the semantic tower, when at runtime most existing systems only ever allow but to go further down. The protocol was obtained by extracting the computational content of a formal specification for implementations and some of their properties. This approach reconciles two heretofore mutually exclusive fields: Semantics and Runtime Reflection.2026-06-01T21:32:40ZPresented at Off the Beaten Track Workshop (OBT) at POPL 2018 in Los AngelesFrançois-René Rideauhttp://arxiv.org/abs/2606.02854v1Fixed-Point Scaffolding in the Clef Programming Language2026-06-01T20:19:10ZFor fans of Gabriel's "Worse is Better" it may be ironic that C++, by way of MLIR, serves as the scaffold for compiling an ML-family language whose correctness properties are structural. A crucial intersection in our Composer compiler initiates its lowering with a fixed-point combinator that preserves the dimensional, grade, escape, and numeric-representation structure from the Program Semantic Graph. And the MLIR that's witnessed from the PSG is no passive host. Its use of static single assignment, attribute system and dialects carry that structure materially. We show that our compiler middle end uses categorical construction for lowering code with companion verification to that strata: a functor from the compilation poset to a target category, subject to the compositionality equation. The grounding of our approach comes from three sources, each on its own algebraic object: Ohori's machine-code proof theory grounds the compilation axis, parametricity grounds the content at the base, and adjoint mode logic grounds the traversal between our verification tiers. To extend the thesis we introduce compact-closed negative and fractional types, and show the type machinery can be carried with preserved structure and realized through tooling MLIR provides. More broadly, the same fixed-point primitive that preserves types through compilation also supplies proof terms that can continue to be exercised in MLIR to verify its integrity as lowering proceeds through the pipeline. We argue that this foundation is a unique additional point anticipated by our framework that includes dimensional types, Tarau's groupoid, and cellular sheaves. Throughout, the formalism is instrumented as an internal scaffold: the abstractions support the compiler's mechanics, where a developer is never required to reach for category theory in order to rely on the guarantees the compiler provides.2026-06-01T20:19:10Z16 pages, 2 figuresHouston Hayneshttp://arxiv.org/abs/2606.02394v1From Time to Space: The Impact of Linearity in Higher-Order Datalog2026-06-01T15:43:37ZWe consider a fragment of Higher-Order Datalog with negation and argue that it generalizes the familiar and important fragment of Linear Datalog. We investigate the expressive power of this fragment, establishing a tight connection with the hierarchy of space complexity classes. In particular, we demonstrate that for all $k \ge 1$, the $(k+1)$-order fragment of Stratified Linear Higher-Order Datalog$^\neg$ captures $(k-1)$-EXPSPACE. This result suggests that restricting programs to linear recursion shifts the expressive power of the corresponding fragments from time to space, generalizing the classical result that (Stratified) Linear Datalog captures NL. Unlike the first-order setting where an ordering assumption is required to capture NL, our results hold without any such assumption on the input database. The proof relies on simulating space-bounded Turing machines using Stratified Linear Higher-Order Datalog$^\neg$ programs and providing a space-efficient evaluation of the query program. We argue that identifying such computationally well-behaved fragments is a crucial step towards paving the way for practical implementations of Higher-Order Datalog.2026-06-01T15:43:37ZAngelos CharalambidisBabis KostopoulosPanos Rondogiannishttp://arxiv.org/abs/2606.01974v1Toka: A Systems Programming Language with Explicit Resource Semantics2026-06-01T09:35:41ZSystems programming languages traditionally struggle with the tension between physical transparency and compile-time memory safety. C++ provides direct, zero-cost hardware access but lacks strict safety boundaries, whereas Rust guarantees safety at the cost of complex lifetime annotations and implicit dereferencing chains.
In this paper, we present Toka, a native systems programming language that establishes physical transparency in resource management via Explicit Resource Semantics. At the core of Toka's design is the Handle-Soul Duality (informally referred to as the Hat-Soul model), which cleanly dissociates pointer identities (Handles) from their underlying values (Souls) at the syntactic level. By enforcing that bare identifiers always represent values (Souls) and explicit sigils represent pointer handles, Toka eliminates the semantic ambiguity between rebind operations and value mutations. We detail Toka's resource morphology (supporting unique, shared, borrowed, and raw semantics), its lifetime checking mechanism, and its implementation of a prototype compiler. Our evaluation demonstrates that Toka achieves competitive runtime performance and minimal binary size while drastically reducing the cognitive overhead of lifetime annotations.2026-06-01T09:35:41Z12 pages. A Design and Implementation Experience Report on Explicit Resource SemanticsZhonghua Yi