Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design

2026-04-27T03:24:25Z

We present Arch (AI-native Register-transfer Clocked Hardware), a hardware description language for micro-architecture specification and AI-assisted code generation. Arch provides first-class constructs for pipelines, FSMs, FIFOs, arbiters, register files, buses with handshake channels, clock-domain crossings, and multi-cycle threads -- structures that existing HDLs express only as user-defined patterns prone to subtle errors. A central design choice is that clocks and resets are parameterized types (Clock, Reset) rather than ordinary nets, converting CDC and reset-domain analysis from external linter passes into compile-time typing rules. Bit widths, port directions, single-driver ownership, and combinational acyclicity are tracked in the same pass, catching latches, width mismatches, loops, and unsynchronized crossings before simulation. A guard clause on reg declarations captures the valid-data pattern declaratively, catching the producer bug where a valid flag asserts before data is written. Every syntactic choice is governed by an AI-generatability contract: an LL(1) grammar, no preprocessor, a uniform declaration schema, named block endings, and a todo! escape hatch let LLMs produce structurally correct, type-safe Arch from natural-language specs without fine-tuning. The compiler emits lint-clean IEEE 1800-2017 SystemVerilog and auto-generates safety properties (FIFO no-overflow, counter range, FSM legal-state, handshake protocol) verified with Verilator -- assert and EBMC, plus direct AST-to-SMT-LIB2 bounded model checking via arch formal. An integrated simulator compiles designs to native C++ with Python cocotb support. Case studies: L1 cache and AXI DMA (Yosys/OpenSTA, Sky130); 428/431 tests pass on VerilogEval and CVDP.

SWE-QA: Can Language Models Answer Repository-level Code Questions?

2026-04-26T17:37:00Z

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

Compile-Time Tensor Shape Checking via Staged Shape-Dependent Types

2026-04-26T17:06:38Z

When writing programs involving matrices or tensors in general, it is desirable to rule out the inconsistency of tensor shapes (i.e., the generalization of matrix sizes) before actual computation. For this purpose, some languages provide dependent types such as Mat m n, and others offer refinement types to track predicates for shapes. Despite the theoretical maturity, however, such methods are often unhandy for continuous software development due to the requirement of proofs for judging type equality or subtyping; even automated proving is often unsuitable due to its unforeseeable time consumption. To remedy this, our study provides an alternative formalization by using staging. Based on the observation that conditions for the shape consistency can be extracted before running the actual tensor computations in many typical cases, we ensure such consistency by assertions evaluated as compile-time computations, not by proofs. Under this formalization, we can verify the consistency virtually statically in the sense that inconsistencies as to tensor shapes will be immediately detected as assertion failures during compile-time computation. Our work achieves a mathematical guarantee that successfully generated code is always consistent with respect to tensor shapes. Furthermore, to vastly lessen the burden of adding shape- or stage-related descriptions, we (1) allow shape-related arguments to be implicit and infer them in a best-effort manner, and (2) offer a non-staged surface language that seemingly resembles ordinary dependently-typed languages and translate its programs into the staged core language. By a prototype implementation, we confirm that our language is expressive enough to verify a number of programs, including several examples offered by ocaml-torch.

Compiling OpenSCENARIO 2.1 for Scenario-Based Testing in CARLA

2026-04-26T15:48:31Z

While the ASAM OpenSCENARIO 2.1 Domain-Specific Language (DSL) enables declarative, intent-driven authoring for Scenario-Based Testing (SBT), its integration into open-source simulators like CARLA remains limited by legacy parsers. We propose a multi-pass modern compiler architecture that translates the OpenSCENARIO 2.1 DSL directly into executable CARLA behaviors. The pipeline features an ANTLR4 frontend for Abstract Syntax Tree (AST) generation, a semantic middle-end, and a runtime backend that synthesizes deterministic py_trees behavior trees. Mapping the standardized domain ontology directly to CARLA's procedural API via a custom method registry eliminates the need for external logic solvers. A demonstrative multi-actor cut-in and evasive maneuver, selected from a wider suite of validated scenarios, confirms the compiler's ability to process concurrent actions, dynamic mathematical expressions, and asynchronous signaling. This framework establishes a functional baseline for reproducible, large-scale SBT, paving the way for future C++ optimizations to mitigate current Python-based computational overhead.

Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation

2026-04-25T21:50:26Z

LLM-driven web agents operating through continuous inference loops -- repeatedly querying a model to evaluate browser state and select actions -- exhibit a fundamental scalability constraint for repetitive tasks. We characterize this as the Rerun Crisis: the linear growth of token expenditure and API latency relative to execution frequency. For a 5-step workflow over 500 iterations, a continuous agent incurs approximately 150.00 USD in inference costs; even with aggressive caching, this remains near 15.00 USD. We propose a Compile-and-Execute architecture that decouples LLM reasoning from browser execution, reducing per-workflow inference cost to under 0.10 USD. A one-shot LLM invocation processes a token-efficient semantic representation from a DOM Sanitization Module (DSM) and emits a deterministic JSON workflow blueprint. A lightweight runtime then drives the browser without further model queries. We formalize this cost reduction from O(M x N) to amortized O(1) inference scaling, where M is the number of reruns and N is the sequential actions. Empirical evaluation across data extraction, form filling, and fingerprinting tasks yields zero-shot compilation success rates of 80-94%. Crucially, the modularity of the JSON intermediate representation allows minimal Human-in-the-Loop (HITL) patching to elevate execution reliability to near-100%. At per-compilation costs between 0.002 USD and 0.092 USD across five frontier models, these results establish deterministic compilation as a paradigm enabling economically viable automation at scales previously infeasible under continuous architectures.

Evolving Abstract Transformers for Gradient-Guided, Adaptable Abstract Interpretation

2026-04-25T09:05:03Z

Current numerical abstract interpretation relies on fixed, hand-crafted, instruction-specific transformers tailored to each domain, causing three key limitations: transformers cannot be reused across domains; precise compositional reasoning over instruction sequences is difficult; and all downstream tasks must use the same fixed transformer regardless of precision or efficiency needs.To address this, we propose the Evolving Abstract Transformer, which replaces the fixed single-output design with an adaptable search over a parametric space of sound outputs via two algorithms. First, the Universal Parametric Output Space Encoder (UPOSE) constructs a compact parametric space of sound outputs for any polyhedral numerical domain and any operator in the Quadratic-Bounded Guarded Operators (QGO) class, covering both individual instructions and structured sequences. Second, the Adaptive Gradient Guidance (AGG) algorithm leverages the differentiable structure of UPOSE's output space to efficiently search it according to downstream objectives and available runtime, continually evolving the output as more time is provided. We implement these ideas in the AbsEvolve framework and evaluate across three numerical abstract domains: Zones, Octagons, and Polyhedra. Results show the evolving transformer works across domains and instructions, enables efficient precision-efficiency tradeoffs by adjusting number of gradient steps in the search, and reaches the most precise invariants up to 3.2x faster than existing baselines.

Typing Fallback Functions: A Semantic Approach to Type Safe Smart Contracts

2026-04-25T08:49:02Z

This paper develops semantic typing in a smart-contract setting to ensure type safety of code that uses statically untypable language constructs, such as the fallback function. The idea is that the creator of a contract on the blockchain equips code containing such constructs with a formal proof of its type safety, given in terms of the semantics of types. Then, a user of the contract only needs to check the validity of the provided 'proof certificate' of type safety. This is a form of proof-carrying code, which naturally fits with the immutable nature of the blockchain environment. As a concrete application of our approach, we focus on ensuring information flow control and non-interference for TinySol, a distilled version of the Solidity language, through security types. We provide the semantics of types in terms of a typed operational semantics of TinySol and we express the proofs of safety as coinductively-defined typing interpretations, which can be represented compactly via up-to techniques, similar to those used for bisimilarity. We also show how our machinery can be used to type the typical pointer-to-implementation pattern based on the fallback function and to reject a distilled version of the infamous Parity Multisig Wallet Attack.

Annotating and Auditing the Safety Properties of Unsafe Rust

2026-04-25T03:57:56Z

In Rust, unsafe code is the sole source of potential undefined behaviors. To avoid misuse, Rust developers should clarify the safety properties for each unsafe API. However, the community currently lacks a key standard for safety documentation: existing safety comments in the source code and safety documentation can be ad hoc and incomplete. This paper presents a tag-centric methodology for auditing the consistency and completeness of safety documentation. We first derive a taxonomy of Safety Tags to formalize natural-language requirements. Second, because API soundness frequently relies on struct invariants, we propose a set of empirical rules to systematically audit the structural consistency of safety documentation. We implemented this methodology in safety-tool, a static linter that automatically enforces structural consistency between local safety annotations and callee requirements. Our approach was applied to the Rust standard library, fixing documentation issues on 27 APIs with 61 safety tags and identifying safety tags that are applicable to 96.1% of the public unsafe APIs in libstd. Furthermore, we have formalized the tagging idea through a Rust RFC to the wider community. We believe that the approach establishes a standardized practice of safety documentation and helps significantly reduce safety perils.

SEVerA: Verified Synthesis of Self-Evolving Agents

2026-04-24T19:01:59Z

Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ($τ^2$-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

2026-04-24T18:22:58Z

Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code

State Space Estimation for DPOR-based Model Checkers(Extended Version)

2026-04-24T13:21:57Z

We study the estimation problem for concurrent programs: given a bounded program $P$, estimate the number of Mazurkiewicz trace-equivalence classes induced by its interleavings. This quantity informs two practical questions for enumeration-based model checking: how long a model checking run is likely to take, and what fraction of the search space has been covered so far. We first show the counting problem is #P-hard even for restricted programs and, unless $P=NP$, inapproximable within any subexponential factor, ruling out efficient exact or randomized approximation algorithms. We give a Monte Carlo approach to obtain a poly-time unbiased estimator: we convert a stateless optimal DPOR algorithm into an unbiased estimator by viewing its exploration as a bounded-depth, bounded-width tree whose leaves are the maximal Mazurkiewicz traces. A classical estimator by Knuth, when run on this tree, yields an unbiased estimate. To control the variance, we apply stochastic enumeration by maintaining a small population of partial paths per depth whose evolution is coupled. We have implemented our estimator in the JMC model checker and evaluated it on shared-memory benchmarks. With modest budgets, our estimator yields stable estimates, typically within a 20% band, within a few hundred trials, even when the state space has $10^5$--$10^6$ classes. We also show how the same machinery estimates model-checking cost by weighting all explored graphs, not only complete traces. Our algorithms provide the first provable poly-time unbiased estimators for counting traces, a problem of considerable importance when allocating model checking resources.

Language-Integrated Recursive Queries (Full Version)

2026-04-24T10:59:05Z

Performance-critical industrial applications, including large-scale program, network, and distributed system analyses, rely on fixed-point computations. The introduction of recursive common table expressions (CTEs) using the WITH RECURSIVE keyword in SQL:1999 extended the ability of relational database systems to handle fixed-point computations, unlocking significant performance advantages by allowing computation to move closer to the data. Yet with recursion, SQL becomes a Turing-complete programming language and, with that, unrecoverable safety and correctness risks. SQL itself lacks a fixed semantics, as the SQL specification is written in natural language, full of ambiguities that database vendors resolve in divergent ways. As a result, reasoning about the correctness of recursive SQL programs must rely on isolated mathematical properties of queries rather than wrestling a unified formal model out of a language with notoriously inconsistent semantics. To address these challenges, we propose a calculus that automatically derives mathematical properties from embedded recursive queries and, depending on the database backend, rejects queries that may lead to the three classes of recursive query errors - database errors, incorrect results, and non-termination. We introduce TyQL, a practical implementation in Scala for safe, recursive language-integrated query. Using Named-Tuples and type-level pattern matching, TyQL ensures query portability and safety, showing no performance penalty compared to raw SQL strings while unlocking a three-orders-of-magnitude speedup over non-recursive SQL queries.

Ownership Refinement Types for Pointer Arithmetic and Nested Arrays

2026-04-24T08:53:09Z

Tanaka et al. proposed a type system for verifying functional correctness properties of programs that use arrays and pointer arithmetic. Their system extends ConSORT -- a type system combining fractional ownership and refinement types for imperative program verification -- with support for pointer arithmetic. Their idea was to extend fractional ownership so that it can depend on an array index. Their formulation, however, does not handle nested arrays, which are essential for representing practical data structures such as matrices. We extend Tanaka et al.'s type system to support nested arrays by generalizing the notion of ownership to be able to refer to the indices of the outer arrays and prove the soundness of the extended type system. We have implemented a verifier based on the proposed type system and demonstrated that it can verify the correctness of programs that manipulate nested arrays, which were beyond the reach of Tanaka et al.

BLAST: Benchmarking LLMs with ASP-based Structured Testing

2026-04-24T07:39:09Z

Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.

QCP: A Practical Separation Logic-based C Program Verification Tool

2026-04-24T04:36:22Z

As software systems increase in size and complexity dramatically, ensuring their correctness, security, and reliability becomes an increasingly formidable challenge. Despite significant advancements in verification techniques and tools, their practical application to complex, real-world systems is often hindered by critical gaps in both automation and expressiveness. To address these difficulties, this paper presents \textbf{Qualified C Programming Verifier (QCP)}, a novel verification tool that integrates annotation-based automatic verification with interactive proving using Rocq. QCP employs symbolic execution and a separation logic entailment solver to automatically discharge many verification obligations, while deferring more complex obligations to Rocq for manual proof. Furthermore, QCP includes a VS Code extension designed to enhance proof efficiency and support a deeper understanding of both the program behavior and verification outcomes.