https://arxiv.org/api/XKn0+Qn5gPRTDVwvrZmkiA4WiMk 2026-06-14T09:24:10Z 9885 315 15 http://arxiv.org/abs/2510.08726v2 Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs 2026-04-20T02:38:04Z

Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. This paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. Applying Neptune's advanced operator fusion to a plain attention operator generates operators equivalent to FlashAttention and FlashDecoding. On ten attention-based benchmarks, Neptune, starting from a plain attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have an average speedup of $1.35\times$ over the next best alternative, with up to $2.65\times$ speedup on Nvidia GPUs and up to $3.32\times$ on AMD GPUs, demonstrating its effectiveness for deep learning workloads.

2025-10-09T18:33:52Z Yifan Zhao Egan Johnson Prasanth Chatarasi Vikram Adve Sasa Misailovic http://arxiv.org/abs/2604.14550v2 VeriGraphi: A Multi-Agent Framework of Hierarchical RTL Generation for Large Hardware Designs 2026-04-19T19:11:10Z

Generating synthesizable Verilog for large, hierarchical hardware designs remains a significant challenge for large language models (LLMs), which struggle to replicate the structured reasoning that human experts employ when translating complex specifications into RTL. When tasked with producing hierarchical Verilog, LLMs frequently lose context across modules, hallucinate interfaces, fabricate inter-module wiring, and fail to maintain structural coherence - failures that intensify as design complexity grows and specifications involve informal prose, figures, and tables that resist direct operationalization. To address these challenges, we present VeriGraphi, a framework that introduces a spec-anchored Knowledge Graph as the architectural substrate driving the RTL generation pipeline. VeriGraphi constructs a HDA, a structured knowledge graph that explicitly encodes module hierarchy, port-level interfaces, wiring semantics, and inter-module dependencies as first-class graph entities and relations. Built through iterative multi-agent analysis of the specification, this Knowledge Graph provides a deterministic, machine-checkable structural scaffold before code generation. Guided by the KG, a progressive coding module incrementally generates pseudo-code and synthesizable RTL while enforcing interface consistency and dependency correctness at each submodule stage. We evaluate VeriGraphi on a benchmark of three representative specification documents from the National Institute of Standards and Technology and their corresponding implementations, and we present a RV32I processor as a detailed case study to illustrate the full pipeline. The results demonstrate that VeriGraphi enables reliable hierarchical RTL generation with minimal human intervention for RISC-V, marking a significant milestone for LLM-generated hardware design while maintaining strong functional correctness.

2026-04-16T02:30:57Z 9 pages, 2 figures, Case studies, v2 Sazzadul Islam Tasnim Tabassum Hao Zheng http://arxiv.org/abs/2603.14628v2 s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs 2026-04-19T15:46:53Z

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.

2026-03-15T21:55:59Z Accepted as a Workshop paper at AIPV 2026 Balaji Rao John Harrison Soonho Kong Juneyoung Lee Carlo Lipizzi http://arxiv.org/abs/2603.25414v3 Decidable By Construction: Design-Time Verification for Trustworthy AI 2026-04-19T15:17:01Z

A prevailing assumption in machine learning is that model correctness must be enforced after the fact. We observe that the properties determining whether an AI model is numerically stable, computationally correct, or consistent with a physical domain do not necessarily demand post hoc enforcement. They can be verified at design time, before training begins, at marginal computational cost, with particular relevance to models deployed in high-leverage decision support and scientifically constrained settings. These properties share a specific algebraic structure: they are expressible as constraints over finitely generated abelian groups $\mathbb{Z}^n$, where inference is decidable in polynomial time and the principal type is unique. A framework built on this observation composes three prior results (arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104): a dimensional type system carrying arbitrary annotations as persistent codata through model elaboration; a program hypergraph that infers Clifford algebra grade and derives geometric product sparsity from type signatures alone; and an adaptive domain model architecture preserving both invariants through training via forward-mode coeffect analysis and exact posit accumulation. We believe this composition yields a novel information-theoretic result: Hindley-Milner unification over abelian groups computes the maximum a posteriori hypothesis under a computable restriction of Solomonoff's universal prior, placing the framework's type inference on the same formal ground as universal induction. We compare four contemporary approaches to AI reliability and show that each imposes overhead that can compound across deployments, layers, and inference requests. This framework eliminates that overhead by construction.

2026-03-26T13:09:36Z 21 pages, 1 figure Houston Haynes http://arxiv.org/abs/2603.17627v3 The Program Hypergraph: Multi-Way Relational Structure for Geometric Algebra, Spatial Compute, and Physics-Aware Compilation 2026-04-19T15:10:23Z

The Program Semantic Graph (PSG) introduced in prior work on Dimensional Type Systems and Deterministic Memory Management encodes compilation-relevant properties as binary edge relations between computation nodes. This representation is adequate for scalar and tensor computations, but becomes structurally insufficient for two classes of problems central to heterogeneous compute: tile co-location and routing constraints in spatial dataflow architectures, which are inherently multi-way; and geometric algebra computation, where graded multi-way products cannot be faithfully represented as sequences of binary operations without loss of algebraic identity. This paper introduces the Program Hypergraph (PHG) as a principled generalization of the PSG that promotes binary edges to hyperedges of arbitrary arity. We demonstrate that grade in Clifford algebra is a natural dimension axis within the existing DTS abelian group framework, that grade inference derives geometric product sparsity eliminating the primary performance objection to Clifford algebra neural networks without manual specialization, and that the k-simplex structure of mesh topology is a direct instance of the hyperedge formalism. We assess the existing geometric algebra library ecosystem, identify the consistent type-theoretic gap that no current system addresses, and show that the PHG closes it within the Fidelity compilation framework. The result is a compilation framework where geometric correctness, memory placement, numerical precision selection, and hardware partitioning are jointly derivable from a single graph structure exposed as interactive design-time feedback.

2026-03-18T11:47:15Z 31 pages, 1 figure, 2 tables Houston Haynes http://arxiv.org/abs/2604.17364v1 LLM-Guided Strategy Synthesis for Scalable Equality Saturation 2026-04-19T10:21:26Z

Equality saturation (EqSat) is a powerful optimization paradigm that compactly represents many equivalent programs in an e-graph and delays commitment until extraction selects a lowest-cost program. Making EqSat effective, therefore, requires not only domain-specific rewrite rules but also domain-specific strategies. Today, much of this strategy design is still manual, making it a major obstacle to automating e-graph-based compilers. Recent rule-synthesis frameworks can automatically infer large rewrite vocabularies from semantic specifications, but they also enlarge the rewrite space and further exacerbate e-graph explosion. Although large language models (LLMs) make automated strategy synthesis plausible, directly evolving backend code remains ineffective in practice. The search lacks reusable strategy abstractions and actionable feedback, and can easily trigger e-graph explosion or converge to poor designs. We present EggMind, an LLM-guided, end-to-end framework for synthesizing reusable EqSat strategies. At its core, EggMind introduces a domain-specific language, EqSatL, to represent EqSat strategies as explicit and inspectable artifacts. It then proposes an LLM-guided agentic workflow, equipped with novel techniques including proof-derived rewrite motif caching and tractability guidance, to search efficiently for high-quality strategies while keeping synthesis stable under e-graph growth. Evaluation shows that EggMind substantially improves the resource-quality trade-off on vectorization benchmarks, reducing final cost by 45.1% and peak RAM by 69.1% relative to full EqSat. We further show that the same methodology transfers effectively to an XLA-based tensor compiler, and demonstrate its practical potential in a logic-synthesis case study with augmented rewrite spaces.

2026-04-19T10:21:26Z Chenyun Yin Youwei Xiao Yuze Luo Yuyang Zou Yun Liang http://arxiv.org/abs/2604.17290v1 Probabilistic Programs of Thought 2026-04-19T07:09:12Z

LLMs are widely used for code generation and mathematical reasoning tasks where they are required to generate structured output. They either need to reason about code, generate code for a given specification, or reason using programs of thought. The typical approach to code generation is to prompt the model and generate samples until an appropriate program is obtained. Within this process, sampling $n$ programs from the language model requires $n$ GPU compute-intensive generations which becomes prohibitively expensive for larger values of $n$. In this work, we address this limitation by exposing the LLM's distribution within the generated programs themselves. We propose a novel test-time framework we dub probabilistic programs of thought to obtain more samples from the model with fewer LLM generations. Given a program generated by a model and the associated next-token probabilities, we build a probabilistic program that compactly represents exponentially many deterministic programs. Since performing probabilistic reasoning in this probabilistic program is much cheaper, our approach allows sampling new programs without any additional GPU compute and little CPU overhead. We instantiate our approach on benchmarks for code generation, code understanding and mathematical reasoning and report improvements in performance with fewer generations from the LLM.

2026-04-19T07:09:12Z 26 pages Poorva Garg Renato Lui Geh Daniel Israel Todd Millstein Kyle Richardson Guy Van den Broeck http://arxiv.org/abs/2509.01082v3 RefineStat: Efficient Exploration for Probabilistic Program Synthesis 2026-04-19T01:13:14Z

Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models are tasked with generating probabilistic programs, they frequently produce outputs that suffer from both syntactic and semantic errors, such as flawed inference constructs. Motivated by probabilistic programmers' domain expertise and debugging strategies, we introduce RefineStat, a language model--driven framework that enforces semantic constraints ensuring synthesized programs contain valid distributions and well-formed parameters, and then applies diagnostic-aware refinement by resampling prior or likelihood components whenever reliability checks fail. We evaluate RefineStat on multiple probabilistic-programming code-generation tasks using smaller language models (SLMs) and find that it produces programs that are both syntactically sound and statistically reliable, often matching or surpassing those from closed-source large language models (e.g., OpenAI o3).

2025-09-01T03:13:36Z RefineStat constrains LM decoding with statistical validity checks and uses diagnostic-guided resampling (priors/likelihoods) to transform small LMs' drafts into correct, reliable probabilistic programs that can match or surpass closed-source models ICLR 2026 (Oral) Madhav Kanda Shubham Ugare Sasa Misailovic http://arxiv.org/abs/2505.04852v3 Raw Pointer Rewriting with LLMs for Translating C to Safer Rust 2026-04-18T19:10:00Z

There has been a growing interest in translating C code to Rust due to Rust's robust memory and thread safety guarantees. Tools such as C2RUST enable syntax-guided transpilation from C to semantically equivalent Rust code. However, the resulting Rust programs often rely heavily on unsafe constructs, particularly raw pointers, which undermines Rust's safety guarantees. This paper aims to improve the memory safety of Rust programs generated by C2RUST by eliminating raw pointers. Specifically, we propose a raw pointer rewriting technique that lifts raw pointers in individual functions to appropriate Rust data structures. Technically, PR2 employs decision-tree-based prompting to guide the pointer lifting process. It also leverages code change analysis to guide the repair of errors introduced during rewriting, effectively addressing errors encountered during compilation and test case execution. We implement PR2 and evaluate it using gpt-4o-mini on 28 real-world C projects. It is shown that PR2 successfully eliminates 18.57% of local raw pointers across these projects, significantly enhancing the safety of the translated Rust code. On average, PR2 completes the transformation of a project in 5.02 hours, at a cost of $1.13.

2025-05-07T23:30:27Z Yifei Gao Chengpeng Wang Pengxiang Huang Xuwei Liu Mingwei Zheng Xiangyu Zhang http://arxiv.org/abs/2510.06296v3 VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code 2026-04-18T14:30:50Z

Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with $2,389$ complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.

2025-10-07T13:19:05Z Lingfei Zeng Fengdi Che Xuhan Huang Fei Ye Xu Xu Binhang Yuan Jie Fu http://arxiv.org/abs/2604.16986v1 Shift schema drift left: policy-aware compile-time contracts for typed JVM and Spark pipelines 2026-04-18T13:18:18Z

Schema drift in data pipelines is often caught only when a job touches real data. Typed-Dataset layers close part of this gap but require wholesale adoption; table-level enforcement systems close another part but operate at write time against a stored schema. We present a small Scala 3 framework that occupies the seam: it proves producer-to-contract structural compatibility under explicit policies at compile time, derives Spark schemas from the same contract types, and re-checks the actual DataFrame schema at the sink boundary before write. The artifact fuses the compile-time witness with a policy-aware runtime comparator that adds a nested-collection-optionality check Spark's built-in comparators omit and implements structural subset semantics for backward- and forward-compatible field sets. Evaluation covers compile-time proofs, runtime policy tests, builder-path end-to-end tests, and reproducible benchmarks on two environments. This is a narrow, honest mechanism artifact; the broader claim that compile-time structural contracts deliver measurable productivity or reliability in practice is stated as motivation and left for future work.

2026-04-18T13:18:18Z 7 pages, 2 figures, 1 table. Mechanism artifact paper with reproducible benchmarks. Code at https://github.com/vim89/compile-time-data-contracts Vittal Mirji http://arxiv.org/abs/2412.09869v4 A Practical Quantum Hoare Logic with Classical Variables, I 2026-04-18T07:13:19Z

In this paper, we present a Hoare-style logic for reasoning about quantum programs with classical variables. Our approach offers several improvements over previous work: (1) Enhanced expressivity of the programming language: Our logic applies to quantum programs with classical variables that incorporate quantum arrays and parameterised quantum gates, which have not been addressed in previous research on quantum Hoare logic, either with or without classical variables. (2) Intuitive correctness specifications: In our logic, preconditions and postconditions for quantum programs with classical variables are specified as a pair consisting of a classical first-order logical formula and a quantum predicate formula (possibly parameterised by classical variables). These specifications offer greater clarity and align more closely with the programmer's intuitive understanding of quantum and classical interactions. (3) Simplified proof system: By introducing a novel idea in formulating a proof rule for reasoning about quantum measurements, along with (2), we develop a proof system for quantum programs that requires only minimal modifications to classical Hoare logic. Furthermore, this proof system can be effectively and conveniently combined with classical first-order logic to verify quantum programs with classical variables. As a result, the learning curve for quantum program verification techniques is significantly reduced for those already familiar with classical program verification techniques, and existing tools for verifying classical programs can be more easily adapted for quantum program verification.

2024-12-13T05:28:19Z Information and Computation (2026) Mingsheng Ying 10.1016/j.ic.2026.105417 http://arxiv.org/abs/2604.16832v1 DALC-CT: Dynamic Analysis of Low-Level Code Traces for Constant-Time Verification 2026-04-18T04:47:54Z

Timing side-channel attacks exploit variations in program execution time to recover sensitive information. Cryptographic implementations are especially vulnerable to these attacks, since even small timing differences in operations such as modular exponentiation or key comparisons can be exploited to extract highly sensitive information, such as secret keys. To mitigate this threat, implementations of programs that handle sensitive information are often expected to adhere to constant-time principles, ensuring that execution behavior does not depend on secret inputs. However, validating the constant-time property of programs remains a major challenge in cryptography development. Formal method approaches to verify constant-time implementations rely on abstractions that often fail to capture real execution behavior, while timing-based measurement techniques are highly sensitive to noise from other programs and even hardware environments. In this work, we propose a novel approach for verifying constant-time programs based on dynamic analysis of low-level execution traces. Our method measures instruction sequences across multiple input values for any given binary and targeted function. Any variations in the instruction mix distribution for any given pair of traces indicate a deviation from the constant-time principle and behavior. We developed an open-source tool called DALC-CT, for the constant-time verification of programs using this approach. We evaluated it on a set of well-known constant-time and non-constant-time examples, achieving a perfect detection of issues. Our results demonstrate that analyzing the logical execution of programs via instruction trace comparisons provides a lightweight and reliable way to verify the constant-time property of programs.

2026-04-18T04:47:54Z 9 pages Nges Brian Njungle Edwin P. Kayang Mishel J. Paul Michel A. Kinsy http://arxiv.org/abs/2602.11092v2 MerLin: A Discovery Engine for Photonic and Hybrid Quantum Machine Learning 2026-04-17T23:13:30Z

Identifying where quantum models may offer practical benefits in near term quantum machine learning (QML) requires moving beyond isolated algorithmic proposals toward systematic and empirical exploration across models, datasets, and hardware constraints. We introduce MerLin, an open-source framework designed as a discovery engine for photonic and hybrid quantum machine learning. MerLin integrates optimized strong simulation of linear optical circuits into standard PyTorch and scikit learn workflows, enabling end-to-end differentiable training of quantum layers. MerLin is designed around systematic benchmarking and reproducibility. As an initial contribution, we reproduce eighteen state-of-the-art photonic and hybrid QML works spanning kernel methods, reservoir computing, convolutional and recurrent architectures, generative models, and modern training paradigms. These reproductions are released as reusable, modular experiments that can be directly extended and adapted, establishing a shared experimental baseline consistent with empirical benchmarking methodologies widely adopted in modern artificial intelligence. By embedding photonic quantum models within established machine learning ecosystems, MerLin allows practitioners to leverage existing tooling for ablation studies, cross-modality comparisons, and hybrid classical-quantum workflows. The framework already implements hardware-aware features, allowing tests on available quantum hardware while enabling exploration beyond its current capabilities, positioning MerLin as a forward-looking co-design tool linking algorithms, benchmarks, and hardware.

2026-02-11T18:00:01Z Accepted to IJCNN SS07 Quantum Machine Learning Algorithms and Applications, IEE WCCI 2026 IJCNN SS07 Quantum Machine Learning Algorithms and Applications, IEEE WCCI 2026 Cassandre Notton Benjamin Stott Philippe Schoeb Anthony Walsh Grégoire Leboucher Vincent Espitalier Vassilis Apostolou Louis-Félix Vigneux Alexia Salavrakos Jean Senellart http://arxiv.org/abs/2604.11021v2 Emulation-Completeness of Programming Languages 2026-04-17T16:18:15Z

We study when a programming language can emulate programs written in that same language without delegating the guest program back to the host evaluator or compiler. We call this property emulation-completeness. The central observation is that Turing-completeness by itself is not enough: a self-emulator must not only compute the guest program's result, but must also account for the guest-visible state on which realistic programs depend, including control flow, exceptions, callbacks, timing, memory usage, and runtime metadata such as stack traces or line numbers. This paper is a systematization paper. Its contribution is not a new emulator implementation, but a precise vocabulary and a structured taxonomy for reasoning about self-emulation. We distinguish source-level evaluation from compiled-code emulation, define syntactic and compiled-code emulation-completeness, and separate weak from strong emulation-completeness according to how much observable runtime behavior must be preserved. We then organize the requirements into two classes: language-side requirements, which determine whether the guest semantics can be represented explicitly inside the language, and emulator-side requirements, which determine whether the resulting emulator can faithfully mask or reproduce relevant observations. The discussion is grounded by concrete examples, including publicly documented details from Erlang, where argument limits, bitstring pattern matching, and message reception expose subtle mismatches between direct execution and self-emulation. The resulting framework is intended as guidance for language designers, implementers of evaluators and emulators, and researchers interested in secure sandboxing, decompilation, and reflective execution.

2026-04-13T05:34:39Z 13 pages, 7 tables Gregory Morse Tamás Kozsik