https://arxiv.org/api/qvd4s/o3fXwX8MbCL5eojdvufKs2026-06-23T08:31:40Z993493015http://arxiv.org/abs/2511.19764v1Understanding Accelerator Compilers via Performance Profiling2025-11-24T22:40:11ZAccelerator design languages (ADLs), high-level languages that compile to hardware units, help domain experts quickly design efficient application-specific hardware. ADL compilers optimize datapaths and convert software-like control flow constructs into control paths. Such compilers are necessarily complex and often unpredictable: they must bridge the wide semantic gap between high-level semantics and cycle-level schedules, and they typically rely on advanced heuristics to optimize circuits. The resulting performance can be difficult to control, requiring guesswork to find and resolve performance problems in the generated hardware. We conjecture that ADL compilers will never be perfect: some performance unpredictability is endemic to the problem they solve.
In lieu of compiler perfection, we argue for compiler understanding tools that give ADL programmers insight into how the compiler's decisions affect performance. We introduce Petal, a cycle-level Petal for the Calyx intermediate language (IL). Petal instruments the Calyx code with probes and then analyzes the trace from a register-transfer-level simulation. It maps the events in the trace back to high-level control constructs in the Calyx code to track the clock cycles when each construct was active. Using case studies, we demonstrate that Petal's cycle-level profiles can identify performance problems in existing accelerator designs. We show that these insights can also guide developers toward optimizations that the compiler was unable to perform automatically, including a reduction by 46.9\% of total cycles for one application.2025-11-24T22:40:11ZAyaka YorihiroGriffin BerlsteinPedro Pontes GarcíaKevin LaeuferAdrian Sampsonhttp://arxiv.org/abs/2511.19711v1CrypTorch: PyTorch-based Auto-tuning Compiler for Machine Learning with Multi-party Computation2025-11-24T21:21:55ZMachine learning (ML) involves private data and proprietary model parameters. MPC-based ML allows multiple parties to collaboratively run an ML workload without sharing their private data or model parameters using multi-party computing (MPC). Because MPC cannot natively run ML operations such as Softmax or GELU, existing frameworks use different approximations. Our study shows that, on a well-optimized framework, these approximations often become the dominating bottleneck. Popular approximations are often insufficiently accurate or unnecessarily slow, and these issues are hard to identify and fix in existing frameworks. To tackle this issue, we propose a compiler for MPC-based ML, CrypTorch. CrypTorch disentangles these approximations with the rest of the MPC runtime, allows easily adding new approximations through its programming interface, and automatically selects approximations to maximize both performance and accuracy. Built as an extension to PyTorch 2's compiler, we show that CrypTorch's auto-tuning alone provides 1.20--1.7$\times$ immediate speedup without sacrificing accuracy, and 1.31--1.8$\times$ speedup when some accuracy degradation is allowed, compared to our well-optimized baseline. Combined with better engineering and adoption of state-of-the-art practices, the entire framework brings 3.22--8.6$\times$ end-to-end speedup compared to the popular framework, CrypTen.2025-11-24T21:21:55Z28 pages, 17 figures. Submitted to PLDI 2026Jinyu LiuGang TanKiwan Maenghttp://arxiv.org/abs/2511.19422v1SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning2025-11-24T18:56:47ZRecent advancements in large language models (LLMs) have shown very impressive capabilities in code generation across many programming languages. However, even state-of-the-art LLMs generate programs that contains syntactic errors and fail to complete the given tasks, especially for low-resource programming languages (LRPLs). In addition, high training cost makes finetuning LLMs unaffordable with constrained computational resources, further undermining the effectiveness of LLMs for code generation. In this work, we propose SLMFix, a novel code generation pipeline that leverages a small language model (SLM) finetuned using reinforcement learning (RL) techniques to fix syntactic errors in LLM-generated programs to improve the quality of LLM-generated programs for domain-specific languages (DSLs). In specific, we applied RL on the SLM for the program repair task using a reward calculated using both a static validator and a static semantic similarity metric. Our experimental results demonstrate the effectiveness and generalizability of our approach across multiple DSLs, achieving more than 95% pass rate on the static validator. Notably, SLMFix brings substantial improvement to the base model and outperforms supervised finetuning approach even for 7B models on a LRPL, showing the potential of our approach as an alternative to traditional finetuning approaches.2025-11-24T18:56:47ZDavid Jiahao FuAryan GuptaAaron CouncilmanDavid GroveYu-Xiong WangVikram Advehttp://arxiv.org/abs/2503.04763v2MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study2025-11-24T18:41:20ZIn this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.2025-02-11T09:32:55ZJules ViennotGuillaume BaudartEmilio Jesùs Gallego AriasMarc Lelargehttp://arxiv.org/abs/2511.19521v1Mechanizing a Proof-Relevant Logical Relation for Timed Message-Passing Protocols2025-11-24T05:07:11ZSemantic typing has become a powerful tool for program verification, applying the technique of logical relations as not only a proof method, but also a device for prescribing program behavior. In recent work, Yao et al. scaled semantic typing to the verification of timed message-passing protocols, which are prevalent in, e.g., IoT and real-time systems applications. The appeal of semantic typing in this context is precisely because of its ability to support typed and untyped program components alike -- including physical objects -- which caters to the heterogeneity of these applications. Another demand inherent to these applications is timing: constraining the time or time window within which a message exchange must happen. Yao et al. equipped their logical relation not only with temporal predicates, but also with computable trajectories, to supply the evidence that an inhabitant can step from one time point to another one. While Yao et al. provide the formalization for such a verification tool, it lacks a mechanization. Mechanizing the system would not only provide a machine proof for it, but also facilitate scalability for future extensions and applications.
This paper tackles the challenge of mechanizing the resulting proof-relevant logical relation in a proof assistant. allowing trajectories to be interleaved, partitioned, and concatenated, while the intended equality on trajectories is the equality of their graphs when seen as processes indexed by time. Unfortunately, proof assistants based on intensional type theory only have modest support for such equations, forcing a prolific use of transports. This paper reports on the process of mechanizing Yao et al.'s results, comprising the logical relation, the algebra of computable trajectories with supporting lemmas, and the fundamental theorem of the logical relation, in the Rocq theorem prover.2025-11-24T05:07:11Z15 pages, 9 figuresTesla ZhangAsher KornfeldRui LiSonya SimkinYue YaoStephanie Balzerhttp://arxiv.org/abs/2506.06544v2Reasoning about External Calls2025-11-23T22:23:39ZIn today's complex software, internal trusted code is tightly intertwined with external untrusted code. To reason about internal code, programmers must reason about the potential effects of calls to external code, even though that code is not trusted and may not even be available. The effects of external calls can be limited, if internal code is programmed defensively, limiting potential effects by limiting access to the capabilities necessary to cause those effects.
This paper addresses the specification and verification of internal code that relies on encapsulation and object capabilities to limit the effects of external calls. We propose new assertions for access to capabilities, new specifications for limiting effects, and a Hoare logic to verify that a module satisfies its specification, even while making external calls. We illustrate the approach though a running example with mechanised proofs, and prove soundness of the Hoare logic.2025-06-06T21:24:26Z86 pages, 25 main paper, and 58 pages of appendices, many diagrams and figuresSophia DrossopoulouJulian MackaySusan EisenbachJames Noblehttp://arxiv.org/abs/2503.22688v4CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation2025-11-23T02:21:33ZLarge Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions. They offer limited insight into LLMs' abilities to generate code that strictly follows users' instructions in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating the instruction-following capabilities of LLMs in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. In both Static Conversation and Dynamic Conversation settings, we evaluate the performance of 6 state-of-the-art LLMs and summarize the important factors, additional repository context and gradually increasing interaction history influencing the instruction-following ability of LLMs in multi-turn interactions. Furthermore, we identify the potential direction for improvement: context management. The code and data are available at \href{https://github.com/zhu-zhu-ding/CodeIF-Bench}{https://github.com/zhu-zhu-ding/CodeIF-Bench}.2025-03-05T09:47:02ZPeiding WangLi ZhangFang LiuLin ShiMinxiao LiBo ShenAn Fuhttp://arxiv.org/abs/2410.01215v4From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging2025-11-22T16:20:24ZWhile large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.2024-10-02T03:57:21ZAccepted to ICSE 2026. Code and data available at https://github.com/YerbaPage/MGDebuggerYuling ShiSongsong WangChengcheng WanMin WangXiaodong Guhttp://arxiv.org/abs/2511.17838v1TensorRight: Automated Verification of Tensor Graph Rewrites2025-11-21T23:20:01ZTensor compilers, essential for generating efficient code for deep learning models across various applications, employ tensor graph rewrites as one of the key optimizations. These rewrites optimize tensor computational graphs with the expectation of preserving semantics for tensors of arbitrary rank and size. Despite this expectation, to the best of our knowledge, there does not exist a fully automated verification system to prove the soundness of these rewrites for tensors of arbitrary rank and size. Previous works, while successful in verifying rewrites with tensors of concrete rank, do not provide guarantees in the unbounded setting.
To fill this gap, we introduce TensorRight, the first automatic verification system that can verify tensor graph rewrites for input tensors of arbitrary rank and size. We introduce a core language, TensorRight DSL, to represent rewrite rules using a novel axis definition, called aggregated-axis, which allows us to reason about an unbounded number of axes. We achieve unbounded verification by proving that there exists a bound on tensor ranks, under which bounded verification of all instances implies the correctness of the rewrite rule in the unbounded setting. We derive an algorithm to compute this rank using the denotational semantics of TensorRight DSL. TensorRight employs this algorithm to generate a finite number of bounded-verification proof obligations, which are then dispatched to an SMT solver using symbolic execution to automatically verify the correctness of the rewrite rules. We evaluate TensorRight's verification capabilities by implementing rewrite rules present in XLA's algebraic simplifier. The results demonstrate that TensorRight can prove the correctness of 115 out of 175 rules in their full generality, while the closest automatic, bounded-verification system can express only 18 of these rules.2025-11-21T23:20:01Z61 pages, 13 figures, published in POPL 2025Volume 9, Issue POPL, 2025, Pages 832-863Jai AroraSirui LuDevansh JainTianfan XuFarzin HoushmandPhitchaya Mangpo PhothilimthanaMohsen LesaniPraveen NarayananKarthik Srinivasa MurthyRastislav BodikAmit SabneCharith Mendis10.1145/3704865http://arxiv.org/abs/2507.13290v2Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-11-20T21:09:31ZIn the past few years LLMs have emerged as a tool that can aid programmers by taking natural language descriptions and generating code based on it. However, the reliability of LLM code generation and current validation techniques for it are far from strong enough to be used for mission-critical or safety-critical applications. In this work we explore ways to offer formal guarantees of correctness to LLM generated code; such guarantees could improve the quality of general AI Code Assistants and support their use for critical applications. To address this challenge we propose to incorporate a Formal Query Language that can represent a user's intent in a formally defined but natural language-like manner that a user can confirm matches their intent. We then have a formal specification of the user intent which we can use to verify that LLM-generated code matches the user's intent. We implement these ideas in our system, Astrogator, for the Ansible programming language, widely used for system administration, including for critical systems. The system includes an intuitive formal query language, a calculus for representing the behavior of Ansible programs, and a symbolic interpreter and a unification algorithm which together are used for the verification. A key innovation in Astrogator is the use of a Knowledge Base to capture system-specific implementation dependencies that greatly reduce the need for system knowledge in expressing formal queries. On a benchmark suite of 21 code-generation tasks, our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.2025-07-17T16:54:42Z28 pages, 10 figuresAaron CouncilmanDavid Jiahao FuAryan GuptaChengxiao WangDavid GroveYu-Xiong WangVikram Advehttp://arxiv.org/abs/2511.16395v1CorrectHDL: Agentic HDL Design with LLMs Leveraging High-Level Synthesis as Reference2025-11-20T14:13:38ZLarge Language Models (LLMs) have demonstrated remarkable potential in hardware front-end design using hardware description languages (HDLs). However, their inherent tendency toward hallucination often introduces functional errors into the generated HDL designs. To address this issue, we propose the framework CorrectHDL that leverages high-level synthesis (HLS) results as functional references to correct potential errors in LLM-generated HDL designs.The input to the proposed framework is a C/C++ program that specifies the target circuit's functionality. The program is provided to an LLM to directly generate an HDL design, whose syntax errors are repaired using a Retrieval-Augmented Generation (RAG) mechanism. The functional correctness of the LLM-generated circuit is iteratively improved by comparing its simulated behavior with an HLS reference design produced by conventional HLS tools, which ensures the functional correctness of the result but can lead to suboptimal area and power efficiency. Experimental results demonstrate that circuits generated by the proposed framework achieve significantly better area and power efficiency than conventional HLS designs and approach the quality of human-engineered circuits. Meanwhile, the correctness of the resulting HDL implementation is maintained, highlighting the effectiveness and potential of agentic HDL design leveraging the generative capabilities of LLMs and the rigor of traditional correctness-driven IC design flows.2025-11-20T14:13:38Z7 pages, 15 figures, 2 tablesKangwei XuGrace Li ZhangUlf SchlichtmannBing Lihttp://arxiv.org/abs/2511.16080v1Operon: Incremental Construction of Ragged Data via Named Dimensions2025-11-20T06:16:31ZModern data processing workflows frequently encounter ragged data: collections with variable-length elements that arise naturally in domains like natural language processing, scientific measurements, and autonomous AI agents. Existing workflow engines lack native support for tracking the shapes and dependencies inherent to ragged data, forcing users to manage complex indexing and dependency bookkeeping manually. We present Operon, a Rust-based workflow engine that addresses these challenges through a novel formalism of named dimensions with explicit dependency relations. Operon provides a domain-specific language where users declare pipelines with dimension annotations that are statically verified for correctness, while the runtime system dynamically schedules tasks as data shapes are incrementally discovered during execution. We formalize the mathematical foundation for reasoning about partial shapes and prove that Operon's incremental construction algorithm guarantees deterministic and confluent execution in parallel settings. The system's explicit modeling of partially-known states enables robust persistence and recovery mechanisms, while its per-task multi-queue architecture achieves efficient parallelism across heterogeneous task types. Empirical evaluation demonstrates that Operon outperforms an existing workflow engine with 14.94x baseline overhead reduction while maintaining near-linear end-to-end output rates as workloads scale, making it particularly suitable for large-scale data generation pipelines in machine learning applications.2025-11-20T06:16:31ZSungbin MoonJiho ParkSuyoung HwangDonghyun KohSeunghyun MoonMinhyeong Leehttp://arxiv.org/abs/2509.13128v2Try-Mopsa: Relational Static Analysis in Your Pocket2025-11-19T23:04:30ZStatic analyzers are complex pieces of software with large dependencies. They can be difficult to install, which hinders adoption and creates barriers for students learning static analysis. This work introduces Try-Mopsa: a scaled-down version of the Mopsa static analysis platform, compiled into JavaScript to run purely as a client-side application in web browsers. Try-Mopsa provides a responsive interface that works on both desktop and mobile devices. Try-Mopsa features all the core components of Mopsa. In particular, it supports relational numerical domains. We present the interface, changes and adaptations required to have a pure JavaScript version of Mopsa. We envision Try-Mopsa as a convenient platform for onboarding or teaching purposes.2025-09-16T14:38:19ZRaphaël Monathttp://arxiv.org/abs/2511.15821v1BlueScript: A Disaggregated Virtual Machine for Microcontrollers2025-11-19T19:23:15ZVirtual machines (VMs) are highly beneficial for microcontroller development.
In particular, interactive programming environments greatly facilitate iterative development processes,
and higher execution speeds expand the range of applications that can be developed.
However, due to their limited memory size, microcontroller VMs provide a limited set of features.
Widely used VMs for microcontrollers often lack interactive responsiveness and/or high execution speed.
While researchers have investigated offloading certain VM components to other machines,the types of components that can be offloaded are still restricted.
In this paper, we propose a disaggregated VM that offloads as many components as possible to a host machine.
This makes it possible to exploit the abundant memory of the host machine and its powerful processing capability to provide rich features through the VM.
As an instance of a disaggregated VM, we design and implement a BlueScript VM.
The BlueScript VM is a virtual machine for microcontrollers that provides an interactive development environment.
We offload most of the components of the BlueScript VM to a host machine.
To reduce communication overhead between the host machine and the microcontroller,
we employed a data structure called a shadow machine on the host machine,
which mirrors the execution state of the microcontroller.
Through our experiments, we confirmed that offloading components does not seriously compromise their expected benefits.
We assess that an offloaded incremental compiler results in faster execution speed than MicroPython and Espruino,
while keeping interactivity comparable with MicroPython.
In addition, our experiments observe that the offloaded dynamic compiler improves VM performance.
Through this investigation, we demonstrate the feasibility of providing rich features even on VMs for memory-limited microcontrollers.2025-11-19T19:23:15ZThe Art, Science, and Engineering of Programming, 2025, Vol. 10, Issue 3, Article 21Fumika MochizukiUniversity of Tokyo, Tokyo, JapanTetsuro YamazakiUniversity of Tokyo, Tokyo, JapanShigeru ChibaUniversity of Tokyo, Tokyo, Japan10.22152/programming-journal.org/2025/10/21http://arxiv.org/abs/2511.15820v1Chorex: Restartable, Language-Integrated Choreographies2025-11-19T19:23:00ZWe built Chorex, a language that brings choreographic programming to Elixir as a path toward robust distributed applications. Chorex is unique among choreographic languages because it tolerates failure among actors: when an actor crashes, Chorex spawns a new process, restores state using a checkpoint, and updates the network configuration for all actors. Chorex also proves that full-featured choreographies can be implemented via metaprogramming, and that doing so achieves tight integration with the host language. For example, mismatches between choreography requirements and an actor implementation are reported statically and in terms of source code rather than macro-expanded code. This paper illustrates Chorex on several examples, ranging from a higher-order bookseller to a secure remote password protocol, details its implementation, and measures the overhead of checkpointing. We conjecture that Chorex's projection strategy, which outputs sets of stateless functions, is a viable approach for other languages to support restartable actors.2025-11-19T19:23:00ZThe Art, Science, and Engineering of Programming, 2025, Vol. 10, Issue 3, Article 20Ashton WiersdorfUniversity of Utah, USABen GreenmanUniversity of Utah, USA10.22152/programming-journal.org/2025/10/20