https://arxiv.org/api/3Ine/tDH4hg7LPVxIwsvcgpqpkw 2026-06-15T04:31:05Z 9888 585 15 http://arxiv.org/abs/2602.22240v1 From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation 2026-02-24T09:49:10Z

Large Language Models (LLM) show strong abilities in code generation, but their skill in creating efficient parallel programs is less studied. This paper explores how LLMs generate task-based parallel code from three kinds of input prompts: natural language problem descriptions, sequential reference implementations, and parallel pseudo code. We focus on three programming frameworks: OpenMP Tasking, C++ standard parallelism, and the asynchronous many-task runtime HPX. Each framework offers different levels of abstraction and control for task execution. We evaluate LLM-generated solutions for correctness and scalability. Our results reveal both strengths and weaknesses of LLMs with regard to problem complexity and framework. Finally, we discuss what these findings mean for future LLM-assisted development in high-performance and scientific computing.

2026-02-24T09:49:10Z 12 pages, 4 figures, 2 tables, Workshop on Asynchronous Many-Task Systems and Applications 2026 Linus Bantel Moritz Strack Alexander Strack Dirk Pflüger http://arxiv.org/abs/2602.20082v1 Machine-Generated, Machine-Checked Proofs for a Verified Compiler (Experience Report) 2026-02-23T17:48:24Z

We report on using an agentic coding assistant (Claude Code, powered by Claude Opus 4.6) to mechanize a substantial Rocq correctness proof from scratch, with human guidance but without human proof writing. The proof establishes semantic preservation for the administrative normal form (ANF) transformation in the CertiCoq verified compiler for Rocq. The closely related continuation-passing style (CPS) transformation in CertiCoq was previously proved correct by human experts over several months. We use this proof as a template and instruct the LLM to adapt the proof technique to the ANF setting, which differs in important technical ways. The resulting ANF proof comprises approximately 7,800 lines of Rocq (larger than the 5,300-line CPS proof) and was developed in approximately 96 hours. We describe the proof technique and report on the experience of developing it with an LLM, discussing both the strengths and limitations of the approach and its implications for verified compiler construction.

2026-02-23T17:48:24Z Zoe Paraskevopoulou http://arxiv.org/abs/2602.20064v1 The LLMbda Calculus: AI Agents, Conversations, and Information Flow 2026-02-23T17:22:35Z

A conversation with a large language model (LLM) is a sequence of prompts and responses, with each response generated from the preceding conversation. AI agents build such conversations automatically: given an initial human prompt, a planner loop interleaves LLM calls with tool invocations and code execution. This tight coupling creates a new and poorly understood attack surface. A malicious prompt injected into a conversation can compromise later reasoning, trigger dangerous tool calls, or distort final outputs. Despite the centrality of such systems, we currently lack a principled semantic foundation for reasoning about their behaviour and safety. We address this gap by introducing an untyped call-by-value lambda calculus enriched with dynamic information-flow control and a small number of primitives for constructing prompt-response conversations. Our language includes a primitive that invokes an LLM: it serializes a value, sends it to the model as a prompt, and parses the response as a new term. This calculus faithfully represents planner loops and their vulnerabilities, including the mechanisms by which prompt injection alters subsequent computation. The semantics explicitly captures conversations, and so supports reasoning about defenses such as quarantined sub-conversations, isolation of generated code, and information-flow restrictions on what may influence an LLM call. A termination-insensitive noninterference theorem establishes integrity and confidentiality guarantees, demonstrating that a formal calculus can provide rigorous foundations for safe agentic programming.

2026-02-23T17:22:35Z Zac Garby Andrew D. Gordon David Sands http://arxiv.org/abs/2602.21257v1 Structured Prompt Language: Declarative Context Management for LLMs 2026-02-23T17:03:31Z

We present SPL (Structured Prompt Language), a declarative SQL-inspired language that treats large language models as generative knowledge bases and their context windows as constrained resources. SPL provides explicit WITH BUDGET/LIMIT token management, an automatic query optimizer, EXPLAIN transparency analogous to SQL's EXPLAIN ANALYZE, and native integration of retrieval-augmented generation (RAG) and persistent memory in a single declarative framework. SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script. Five extensions demonstrate the paradigm's breadth: (1) Text2SPL (multilingual NL->SPL translation); (2) Mixture-of-Models (MoM) routing that dispatches each PROMPT to a domain-specialist model at runtime; (3) Logical Chunking, an intelligent strategy for documents exceeding a single context window--expressed naturally through SPL's existing CTE syntax with no new constructs, decomposing a large query into a Map-Reduce pipeline that reduces attention cost from O(N^2) to O(N^2/k) and runs identically on cloud (parallel) or local hardware (sequential); (4) SPL-flow, a declarative agentic orchestration layer with resilient three-tier provider fallback; and (5) BENCHMARK for parallel multi-model comparison with automatic winner persistence. We provide a formal EBNF grammar, two pip-installable Python packages (spl-llm, spl-flow), and comparison against Prompty, DSPy, and LMQL. SPL reduces prompt boilerplate by 65% on average, surfaces a 68x cost spread across model tiers as a pre-execution signal, and runs the identical .spl script at $0.002 on OpenRouter or at zero marginal cost on a local Ollama instance--without modification.

2026-02-23T17:03:31Z 44 pages, 6 figures, 14 tables, 15 code-listings Wen G. Gong http://arxiv.org/abs/2602.19951v1 Taming Scope Extrusion in Gradual Imperative Metaprogramming 2026-02-23T15:19:41Z

Metaprogramming enables the generation of performant code, while gradual typing facilitates the smooth migration from untyped scripts to robust statically typed programs. However, combining these features with imperative state - specifically mutable references - reintroduces the classic peril of scope extrusion, where code fragments containing free variables escape their defining lexical context. While static type systems utilizing environment classifiers have successfully tamed this interaction, enforcing these invariants in a gradual language remains an open challenge. This paper presents $λ^{α,\star}_{\text{Ref}}$, the first gradual metaprogramming language that supports mutable references while guaranteeing scope safety. To put $λ^{α,\star}_{\text{Ref}}$ on a firm foundation, we also develop its statically typed sister language, $λ^α_{\text{Ref}}$, that introduces unrestricted subtyping for environment classifiers. Our key innovation, however, is the dynamic enforcement of the environment classifier discipline in $λ^{α,\star}_{\text{Ref}}$, enabling the language to mediate between statically verified scopes and dynamically verified scopes. The dynamic enforcement is carried out in a novel cast calculus $\mathrm{CC}^{α,\star}_{\text{Ref}}$ that uses an extension of Henglein's Coercion Calculus to handle code types, classifier polymorphism, and subtype constraints. We prove that $λ^{α,\star}_{\text{Ref}}$ satisfies type safety and scope safety. Finally, we provide a space-efficient implementation strategy for the dynamic scope checks, ensuring that the runtime overhead remains practical.

2026-02-23T15:19:41Z 34 pages, 19 figures Tianyu Chen Darshal Shetty Jeremy G. Siek Chao-Hong Chen Weixi Ma Arnaud Venet Rocky Liu http://arxiv.org/abs/2601.13040v2 CPU-less parallel execution of lambda calculus in digital logic 2026-02-23T14:16:39Z

While transistor density is still increasing, clock speeds are not, motivating the search for new parallel architectures. One approach is to completely abandon the concept of CPU -- and thus serial imperative programming -- and instead to specify and execute tasks in parallel, compiling from programming languages to data flow digital logic. It is well-known that pure functional languages are inherently parallel, due to the Church-Rosser theorem, and CPU-based parallel compilers exist for many functional languages. However, these still rely on conventional CPUs and their von Neumann bottlenecks. An alternative is to compile functional languages directly into digital logic to maximize available parallelism. It is difficult to work with complete modern functional languages due to their many features, so we demonstrate a proof-of-concept system using lambda calculus as the source language and compiling to digital logic. We show how functional hardware can be tailored to a simplistic functional language, forming the ground for a new model of CPU-less functional computation. At the algorithmic level, we use a tree-based representation, with data localized within nodes and communicated data passed between them. This is implemented by physical digital logic blocks corresponding to nodes, and buses enabling message passing. Node types and behaviors correspond to lambda grammar forms, and beta-reductions are performed in parallel allowing branches independent from one another to perform transformations simultaneously. As evidence for this approach, we present an implementation, along with simulation results, showcasing successful execution of lambda expressions. This suggests that the approach could be scaled to larger functional languages. Successful execution of a test suite of lambda expressions suggests that the approach could be scaled to larger functional languages.

2026-01-19T13:22:13Z Harry Fitchett Charles Fox http://arxiv.org/abs/2602.19868v1 Combining Small-Step and Big-Step Semantics to Verify Loop Optimizations 2026-02-23T14:12:17Z

Verified compilers aim to guarantee that compilation preserves the observable behavior of source programs. While small-step semantics are widely used in such compilers, they are not always the most convenient framework for structural transformations such as loop optimizations. This paper proposes an approach that leverages both small-step and big-step semantics: small-step semantics are used for local transformations, while big-step semantics are employed for structural transformations. An abstract behavioral semantics is introduced as a common interface between the two styles. Coinductive big-step semantics is extended to correctly handle divergence with both finite and infinite traces, bringing it on par with the expressiveness of small-step semantics. This enables the insertion of big-step transformations into the middle of an existing small-step pipeline, thereby fully preserving all top-level semantic preservation theorems. This approach is practically demonstrated in CompCert by implementing and verifying a few new loop optimizations in big-step Cminor, including loop unswitching and, notably, full loop unrolling.

2026-02-23T14:12:17Z 18 pages, 6 figures. Submitted to ITP 2026 David Knothe Oliver Bringmann http://arxiv.org/abs/2602.19762v1 Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs) 2026-02-23T12:12:39Z

In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.

2026-02-23T12:12:39Z Mohammed Javed Absar Muthu Baskaran Abhikrant Sharma Abhilash Bhandari Ankit Aggarwal Arun Rangasamy Dibyendu Das Fateme Hosseini Franck Slama Iulian Brumar Jyotsna Verma Krishnaprasad Bindumadhavan Mitesh Kothari Mohit Gupta Ravishankar Kolachana Richard Lethin Samarth Narang Sanjay Motilal Ladwa Shalini Jain Snigdha Suresh Dalvi Tasmia Rahman Venkat Rasagna Reddy Komatireddy Vivek Vasudevbhai Pandya Xiyue Shi Zachary Zipper http://arxiv.org/abs/2602.19686v1 A Flow Extension to Coroutine Types for Deadlock Detection in Go 2026-02-23T10:31:22Z

Coroutines, as an abstract programming construct, are a generalization of functions that can suspend execution part- way for later resumption. Coroutine Types are behavioral types to model interactions of coroutines with a single receiving operation followed by a single yielding operation. Coroutine Types have been applied to model-driven engineering, smart contracts, and test case generation. We contribute a Flow extension to Coroutine Types, so that coroutines with more than one receiving and yielding operation can be modeled. We accordingly revise the reduction rules of Coroutine Types. To show the usefulness of the Flow extension, we contribute a type system that maps expressions of the Go programming language to Coroutine Types. If the reduction result is 0, the two channel operations are paired properly and the program has no deadlocks. We choose Go because it is a popular programming language for distributed systems, but a frequent kind of bugs in Go is deadlocks due to the wrong use of concurrency features. We concentrate on the most commonly used semantics in Go: unbuffered channels with the keywords go and defer. Our Flow extension and the type system recognize 17 patterns of channels and goroutine interactions, including mismatched receivers and senders, nested goroutines, etc. We also integrate the Z3 SMT solver to take account of conditional execution and type inheritance. Other static or dynamic deadlock detectors crashed or gave wrong predictions in some patterns. Therefore, our type-based deadlock analyzer not only fills the gap in the landscape of value-based detection, but also complements existing detectors.

2026-02-23T10:31:22Z Accepted in ICSESS 2025, Macao Qiqi Jason Gu Lixue Liu Wei Ke http://arxiv.org/abs/2505.24183v5 QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation 2026-02-23T09:24:51Z

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.

2025-05-30T03:51:06Z Yaoyu Zhu Di Huang Hanqi Lyu Xiaoyun Zhang Chongxiao Li Wenxuan Shi Yutong Wu Jianan Mu Jinghua Wang Yang Zhao Pengwei Jin Shuyao Cheng Shengwen Liang Xishan Zhang Rui Zhang Zidong Du Qi Guo Xing Hu Yunji Chen http://arxiv.org/abs/2501.05616v5 Validating Quantum State Preparation Programs (Extended Version) 2026-02-23T00:13:41Z

One of the key steps in quantum algorithms is to prepare an initial quantum superposition state with different kinds of features. These so-called state preparation algorithms are essential to the behavior of quantum algorithms, and complicated state preparation algorithms are difficult to develop correctly and effectively. This paper presents Pqasm: a high-assurance framework implemented with the Coq proof assistant, allowing us to certify our Pqasm tool to correctly reflect quantum program behaviors. The key in the framework is to reduce the program correctness assurance of a program containing a quantum superposition state to the program correctness assurance for the program state without superposition. The reduction allows the development of an effective testing framework for testing quantum state preparation algorithm implementations on a classical computer - considered to be a hard problem with no clear solution until this point. We utilize the QuickChick property-based testing framework to test state preparation programs. We evaluated the effectiveness of our approach over 5 case studies implemented using Pqasm; such cases are not even simulatable in the current quantum simulators.

2025-01-09T23:35:26Z Version 5 Liyi Li Anshu Sharma Zoukarneini Difaizi Tagba Sean Frett Alex Potanin http://arxiv.org/abs/2602.20204v1 Analyzing Latency Hiding and Parallelism in an MLIR-based AI Kernel Compiler 2026-02-22T19:14:23Z

AI kernel compilation for edge devices depends on the compiler's ability to exploit parallelism and hide memory latency in the presence of hierarchical memory and explicit data movement. This paper reports a benchmark methodology and corresponding results for three compiler-controlled mechanisms in an MLIR-based compilation pipeline: vectorization (Vec), multi-threading (MT) across hardware contexts, and double buffering (DB) using ping--pong scratchpad buffers to overlap DMA transfers with compute. Using Triton/Inductor-generated kernels, we present an ablation ladder that separates the contribution of Vec, MT, and DB, and we quantify how MT speedup scales with problem size using GELU as a representative activation kernel. The results show that vectorization provides the primary gain for bandwidth-sensitive kernels, MT delivers substantial improvements once scheduling overhead is amortized, and DB provides additional benefit when transfers and compute can be overlapped (i.e., outside the extremes of purely memory-bound or purely compute-bound behavior).

2026-02-22T19:14:23Z Accepted at MLBench workshop as part of ASPLOS'26 Javed Absar Samarth Narang Muthu Baskaran http://arxiv.org/abs/2512.09412v2 Simple Modal Types for Functional Reactive Programming 2026-02-22T18:32:32Z

Functional reactive programming (FRP) is a declarative programming paradigm for implementing reactive programs at a high level of abstraction. It applies functional programming principles to construct and manipulate time-varying values, also known as signals. However, for this programming paradigm to work in practice, an FRP language must ensure that programs are causal, productive, and free of space leaks. Over the past fifteen years, several modal type systems to enforce these operational properties have been developed. We present a new FRP language with a significantly simplified modal type system that imposes fewer restrictions than previous modal FRP languages while still guaranteeing the central operational properties of causality, productivity, and absence of space leaks. The key enabling idea is to alter the semantics of signals so that the type system can safely allow more programs to type-check, thereby making the language more expressive, too. With this new semantics, signals are modelled as mutable references whose mutability is tightly controlled by the 'later' type modality. This disciplined form of mutability also enables more efficient in-place updates of signals, all while preserving a functional programming style.

2025-12-10T08:14:19Z Patrick Bahr http://arxiv.org/abs/2602.21251v1 AgenticTyper: Automated Typing of Legacy Software Projects Using Agentic AI 2026-02-21T17:53:30Z

Legacy JavaScript systems lack type safety, making maintenance risky. While TypeScript can help, manually adding types is expensive. Previous automated typing research focuses on type inference but rarely addresses type checking setup, definition generation, bug identification, or behavioral correctness at repository scale. We present AgenticTyper, a Large Language Model (LLM)-based agentic system that addresses these gaps through iterative error correction and behavior preservation via transpilation comparison. Evaluation on two proprietary repositories (81K LOC) shows that AgenticTyper resolves all 633 initial type errors in 20 minutes, reducing manual effort from one working day.

2026-02-21T17:53:30Z Accepted at ICSE 2026 Student Research Competition (SRC) Clemens Pohle http://arxiv.org/abs/2602.17037v2 Wink: Recovering from Misbehaviors in Coding Agents 2026-02-20T18:13:08Z

Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale.

2026-02-19T03:15:00Z Rahul Nanda Chandra Maddila Smriti Jha Euna Mehnaz Khan Matteo Paltenghi Satish Chandra