https://arxiv.org/api/q2FSFzNU0PoCZ9wYVmOLsY9/CzY2026-06-14T21:32:32Z988549515http://arxiv.org/abs/2603.17613v1VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation2026-03-18T11:25:40ZLLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA-aware, tool-integrated multi-agent framework for high-quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed-loop workflow composed of a \textit{Programmer Agent}, a \textit{Correctness Agent}, and a \textit{PPA Agent}, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an \textit{Evolved Memory Mechanism} that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool-driven feedback with structured and evolvable memory, our framework transforms RTL generation from one-shot reasoning into a continual, feedback-driven optimization process, providing a scalable pathway for deploying LLMs in real-world hardware design flows.2026-03-18T11:25:40ZYaoxiang WangQi ShiShangZhan LiQingguo HuXinyu YinBo GuoXu HanMaosong SunJinsong Suhttp://arxiv.org/abs/2603.17208v1SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization2026-03-17T23:15:24ZRegister-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.2026-03-17T23:15:24ZEACL 2026Prashanth VijayaraghavanApoorva NitsureLuyao ShiCharles MackinAshutosh JadhavDavid BeymerEhsan DeganVandana Mukherjeehttp://arxiv.org/abs/2603.17204v1CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization2026-03-17T23:10:07ZOptimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.2026-03-17T23:10:07ZEACL 2026Che-Ming ChangPrashanth VijayaraghavanAshutosh JadhavCharles MackinVandana MukherjeeHsinyu TsaiEhsan Deganhttp://arxiv.org/abs/2603.17170v1PAuth - Precise Task-Scoped Authorization For Agents2026-03-17T22:05:03ZThe emerging agentic web envisions AI agents that reliably fulfill users' natural-language (NL)-based tasks by interacting with existing web services. However, existing authorization models are misaligned with this vision. In particular, today's operator-scoped authorization, exemplified by OAuth, grants broad permissions tied to operators (e.g., the transfer operator) rather than to the specific operations (e.g., transfer $100 to Bob) implied by a user's task. This will inevitably result in overprivileged agents.
We introduce Precise Task-Scoped Implicit Authorization (PAuth), a fundamentally different model in which submitting an NL task implicitly authorizes only the concrete operations required for its faithful execution. To make this enforceable at servers, we propose NL slices: symbolic specifications of the calls each service expects, derived from the task and upstream results. Complementing this, we also propose envelopes: special data structure to bind each operand's concrete value to its symbolic provenance, enabling servers to verify that all operands arise from legitimate computations.
PAuth is prototyped in the agent-security evaluation framework AgentDojo. We evaluate it in both benign settings and attack scenarios where a spurious operation is injected into an otherwise normal task. In all benign tests, PAuth executes the tasks successfully without requiring any additional permissions. In all attack tests, PAuth correctly raises warnings about missing permissions. These results demonstrate that PAuth's reasoning about permissions is indeed precise. We further analyze the characteristics of these tasks and measure the associated token costs.2026-03-17T22:05:03ZReshabh K SharmaLinxi JiangZhiqiang LinShuo Chenhttp://arxiv.org/abs/2603.17150v1Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents2026-03-17T21:28:59ZAgentic AI systems can now generate code with remarkable fluency, but a fundamental question remains: \emph{does the generated code actually do what the user intended?} The gap between informal natural language requirements and precise program behavior -- the \emph{intent gap} -- has always plagued software engineering, but AI-generated code amplifies it to an unprecedented scale. This article argues that \textbf{intent formalization} -- the translation of informal user intent into a set of checkable formal specifications -- is the key challenge that will determine whether AI makes software more reliable or merely more abundant. Intent formalization offers a tradeoff spectrum suitable to the reliability needs of different contexts: from lightweight tests that disambiguate likely misinterpretations, through full functional specifications for formal verification, to domain-specific languages from which correct code is synthesized automatically. The central bottleneck is \emph{validating specifications}: since there is no oracle for specification correctness other than the user, we need semi-automated metrics that can assess specification quality with or without code, through lightweight user interaction and proxy artifacts such as tests. We survey early research that demonstrates the \emph{potential} of this approach: interactive test-driven formalization that improves program correctness, AI-generated postconditions that catch real-world bugs missed by prior methods, and end-to-end verified pipelines that produce provably correct code from informal specifications. We outline the open research challenges -- scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, designing human-AI specification interactions -- that define a research agenda spanning AI, programming languages, formal methods, and human-computer interaction.2026-03-17T21:28:59Z10 pagesShuvendu K. Lahirihttp://arxiv.org/abs/2603.18049v1Conditional Execution of Transpiler Passes Based on Per-Script Feature Detection2026-03-17T19:54:22ZAs the ECMAScript specification evolves, industrial-scale JavaScript compilers face the challenge of supporting modern language syntax while maintaining compatibility for diverse execution environments. Traditionally, compilers solve this by running transpilation passes in a monolithic pipeline, where the transpilation passes are chosen to execute strictly based on a target language level. This results in significant computational waste, as compilers perform expensive Abstract Syntax Tree (AST) traversals to lower features that may not exist in the actual input source code. We present a compiler improvement that conditionally executes transpiler passes based on accurately tracking and dynamically maintaining the exact set of language features present in the compilation unit throughout the transpilation process. It is implemented in the production Google Closure Compiler. By populating and maintaining a FeatureSet at every JavaScript script-level, it dynamically skips running unnecessary lowering passes. We detail the architectural safeguards -- including strategic pass ordering and dynamic validation of the transpiled code for feature-correctness. Evaluation of this improvement on large-scale production monorepos produced a considerable reduction in compilation time and saved compute and memory usage.2026-03-17T19:54:22ZPreprint. Under review at SOAP 2026. Implementation available in Google Closure CompilerRishipal Singh Bhatiahttp://arxiv.org/abs/2603.17099v1Vectorization of Verilog Designs and its Effects on Verification and Synthesis2026-03-17T19:39:56ZVectorization is a compiler optimization that replaces multiple operations on scalar values with a single operation on vector values. Although common in traditional compilers such as rustc, clang, and gcc, vectorization is not common in the Verilog ecosystem. This happens because, even though Verilog supports vector notation, the language provides no semantic guarantee that a vectorized signal behaves as a word-level entity: synthesis tools still resolve multiple individual assignments and a single vector assignment into the same set of parallel wire connections. However, vectorization brings important benefits in other domains. In particular, it reduces symbolic complexity even when the underlying hardware remains unchanged. Formal verification tools such as Cadence Jasper operates at the symbolic level: they reason about Boolean functions, state transitions, and equivalence classes, rather than about individual wires or gates. When these tools can treat a bus as a single symbolic entity, they scale more efficiently. This paper supports this observation by introducing a Verilog vectorizer. The vectorizer, built on top of the CIRCT compilation infrastructure, recognizes several vectorization patterns, including inverted assignments, assignments involving complex expressions, and inter-module assignments. It has been experimented with some Electronic design automation (EDA) tools, and for Jasper tool, it improves elaboration time by 28.12% and reduces memory consumption by 51.30% on 1,157 designs from the ChiBench collection.2026-03-17T19:39:56Z12 pages, 16 figures, 4 algorithms, 4 theoremsMaria Fernanda Oliveira GuimarãesUlisses RosaIan TrudelJoão Victor Amorim VieiraAugusto Amaral MafraMirlaine CrepaldeFernando Magno Quintão Pereirahttp://arxiv.org/abs/2603.16650v1FAlCon: A unified framework for algorithmic control of quantum dot devices2026-03-17T15:20:28ZAs spin-based quantum systems scale, their setup and control complexity increase sharply. In semiconductor quantum dot (QD) experiments, device-to-device variability, heterogeneous control-electronics stacks, and differing operational modalities make it difficult to reuse characterization, calibration, and control logic across laboratories. We present FAlCon, an open-source software ecosystem for portable, automated characterization and tuning measurement workflows. FAlCon provides (i) a lightweight domain-specific language for expressing state-based tuning logic in a hardware-agnostic form; (ii) specialized transmittable libraries of physics-informed QD data structures (``tuning vernacula''); and (iii) extensible libraries of shared measurement protocols enabling an interoperable lab-agnostic measurement stack. By separating algorithm intent from instrument realization, while preserving traceability and supporting typed scripting, FAlCon enables researchers and engineers to exchange, adapt, and deploy characterization and autotuning routines across heterogeneous QD setups. The framework supports all users, ranging from end users running prebuilt algorithms with custom initial conditions to developers extending instrumentation support and contributing new tuning strategies. Although the present release targets QD experiments, other qubit modalities and scientific experiments could reuse FAlCon's modular abstractions by providing new tuning data types and instrument control templates.2026-03-17T15:20:28Z19 pages, 3 figuresTyler J. KovachDaniel SchugZach D. MerinoMark FriesenMark A. ErikssonJustyna P. Zwolakhttp://arxiv.org/abs/2308.09481v5Types, equations, dimensions and the Pi theorem2026-03-17T11:58:04ZThe languages of mathematical physics and modelling are endowed with a rich ``grammar of dimensions'' that common abstractions of programming languages fail to represent. We propose a dependently typed domain-specific language (embedded in Idris) that captures this grammar. We apply it to formalize basic notions of dimensional analysis: those of dimension function, physical quantity, homomorphic measurement, the covariance principle and Buckingham's Pi theorem. We hope that the language makes mathematical physics more accessible to computer scientists and functional programming more palatable to modellers and physicists.2023-08-16T14:33:18ZNicola BottaPatrik Janssonhttp://arxiv.org/abs/2505.23135v3VERINA: Benchmarking Verifiable Code Generation2026-03-16T23:40:56ZLarge language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o3, achieves a 72.6\% code correctness rate, 52.3\% for specification soundness and completeness, and a mere 4.9\% proof success rate (based on one trial per task). We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.2025-05-29T06:12:52ZZhe YeZhengxu YanJingxuan HeTimothe KasrielKaiyu YangDawn Songhttp://arxiv.org/abs/2512.14805v2Sharing State Between Prompts and Programs2026-03-16T18:57:56ZThe rise of large language models (LLMs) has introduced a new type of programming: natural language programming. Users write prompts, which are instructions in natural language, to direct LLMs to perform tasks such as natural language processing, code generation, reasoning, etc.
An emerging area of research enables interoperability between prompts and programs. We present a novel programming abstraction, shared program state, that removes the manual work required to enable interoperability between prompts and program states. With shared program state, programmers can write prompts that directly access program variables, compute with program objects, and implement control flow in the program. We present a schema for specifying natural function interfaces that extend programming systems to support programs with prompts and leverage this schema to specify shared program state as a natural function interface.
We implement shared program state in the Nightjar programming system. Nightjar enables programmers to write Python programs containing prompts that share the Python program state. We show that Nightjar programs achieve comparable or higher task accuracy than manually written implementations (+4-19%), while decreasing the lines of code by 39.6% on average. The tradeoff is that Nightjar may incur runtime overhead (0.4-4.3x manual implementations).2025-12-16T18:41:50ZICLR 2026Ellie Y. ChengLogan WeberTian JinMichael Carbinhttp://arxiv.org/abs/2505.14744v2Beyond Either-Or Reasoning: Transduction and Induction as Cooperative Problem-Solving Paradigms2026-03-16T13:59:17ZTraditionally, in Programming-by-example (PBE) the goal is to synthesize a program from a small set of input-output examples. Lately, PBE has gained traction as a few-shot reasoning benchmark, relaxing the requirement to produce a program artifact altogether which allows transductive methods to directly the missing output sample. Transduction and induction are complementary reasoning modes--where induction derives general rules from examples, transduction leverages the examples directly to infer specific outputs without intermediate generalization. Yet existing approaches either treat them as mutually exclusive or couple them in hybrid structures where one paradigm dictates a fixed trajectory for the other -- undermining the latter's reasoning potential and creating cascading errors. We move away from these hierarchical models and introduce cooperative transductive-inductive problem solving: by interleaving both reasoning modes and ensuring neither unconditionally dominates the other, we preserve the search autonomy and reasoning capacity of each paradigm. We instantiate this concept in TIIPS. Across three PBE domains, TIIPS consistently outperforms state-of-the-art baselines and generates programs that more closely mirror ground-truth trajectories in both syntax and semantics, indicating a better match to the intended program behavior. Our findings highlight cooperative reasoning as a promising new direction for harnessing the full power of symbolic, inductive and neural, transductive reasoning.2025-05-20T08:23:46ZJanis ZenknerTobias SesterhennChristian Bartelthttp://arxiv.org/abs/2603.15096v1Generation of Programming Exam Question and Answer Using ChatGPT Based on Prompt Engineering2026-03-16T10:49:30ZIn computer science, students are encouraged to learn various programming languages such as Python, C++, and Java, equipping them with a broad range of technical skills and problem-solving capabilities. Nevertheless, the design of objective examination questions to assess students' creativity, problem-solving abilities, and domain knowledge remains a significant challenge. This paper proposes a methodology to address these challenges by leveraging prompt engineering techniques with ChatGPT. Prompt engineering is an efficient technique that optimizes the performance of language models, enabling the automatic generation of high-quality exam questions with varying types and difficulty levels, all without requiring additional fine-tuning of the model. This study applies diverse patterns and templates to generate exam questions that incorporate both theoretical and practical components, thereby facilitating a comprehensive evaluation of students' theoretical understanding and hands-on programming proficiency. A survey was conducted to validate the proposed method, and although certain areas indicated room for improvement, the overall results confirmed its significance and relevance. The generated questions and model answers exhibit quality comparable to, or even surpassing, manually crafted questions while significantly reducing the time and effort required for question preparation. This research demonstrates that automated exam question generation through prompt engineering enhances the quality and efficiency of assessment tools in education, establishing it as a valuable asset for future educational environments.2026-03-16T10:49:30ZJongwook SiSungyoung Kimhttp://arxiv.org/abs/2508.16125v2LPO: Discovering Missed Peephole Optimizations with Large Language Models2026-03-16T02:44:30ZPeephole optimization is an essential class of compiler optimizations that targets small, inefficient instruction sequences within programs. By replacing such suboptimal instructions with refined and more optimal sequences, these optimizations not only directly optimize code size and performance, but also enable more transformations in the subsequent optimization pipeline. Despite their importance, discovering new and effective peephole optimizations remains challenging due to the complexity and breadth of instruction sets. Prior approaches either lack scalability or have significant restrictions on the peephole optimizations that they can find. This paper introduces LPO, a novel automated framework to discover missed peephole optimizations. Our key insight is that, Large Language Models (LLMs) are effective at creative exploration but susceptible to hallucinations; conversely, formal verification techniques provide rigorous guarantees but struggle with creative discovery. By synergistically combining the strengths of LLMs and formal verifiers in a closed-loop feedback mechanism, LPO can effectively discover verified peephole optimizations that were previously missed. We comprehensively evaluated LPO within LLVM ecosystems. Our evaluation shows that LPO can successfully identify up to 22 out of 25 previously reported missed optimizations in LLVM. In contrast, the recently proposed superoptimizers for LLVM, Souper and Minotaur detected 15 and 3 of them, respectively. More importantly, within eleven months of development and intermittent testing, LPO found 62 missed peephole optimizations, of which 28 were confirmed and an additional 13 had already been fixed in LLVM. These results demonstrate LPO's strong potential to continuously uncover new optimizations as LLMs' reasoning improves.2025-08-22T06:36:42ZAccepted at ASPLOS 2026Proc. 31st ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2026)Zhenyang XuHongxu XuYongqiang TianXintong ZhouChengnian Sun10.1145/3779212.3790184http://arxiv.org/abs/2603.14538v1Reversible Lifetime Semantics for Quantum Programs2026-03-15T18:25:59ZReversible computation requires that intermediate data be explicitly undone rather than discarded. In quantum programming, this principle appears as uncomputation, usually treated as a technical cleanup mechanism. We instead present uncomputation as a semantic foundation. In the Qutes language, we introduce a formal model of \emph{Scope-Bounded Liveness-Guided Uncomputation}, where lexical scope bounds variable lifetime and static liveness and entanglement analysis determine the earliest safe reclamation point. We define semantic lifetime and a Restoration Invariant ensuring that temporary quantum information disappears once it becomes semantically irrelevant. We prove compositional correctness under nested scopes and show that early reclamation can reduce circuit depth by avoiding critical-path overhead and can bound peak live qubits through disciplined ancilla reuse. Finally, we show that parameter passing semantics emerges from the same lifetime discipline, with pass-by-value and pass-by-reference corresponding to different lifetime boundaries, and we characterize the constraints (irreversibility, persistent entanglement, and aliasing) under which automatic uncomputation must be restricted.2026-03-15T18:25:59ZSimone FaroFrancesco Pio MarinoGabriele Messina