https://arxiv.org/api/RkvxCcN9XL7qsBlCOecjk0vSxzU 2026-06-14T15:32:26Z 9885 405 15 http://arxiv.org/abs/2604.05246v1 A Gradual Probabilistic Lambda Calculus 2026-04-06T23:13:56Z

Probabilistic programming languages have recently gained a lot of attention, in particular due to their applications in domains such as machine learning and differential privacy. To establish invariants of interest, many such languages include some form of static checking in the form of type systems. However, adopting such a type discipline can be cumbersome or overly conservative. Gradual typing addresses this problem by supporting a smooth transition between static and dynamic checking, and has been successfully applied for languages with different constructs and type abstractions. Nevertheless, its benefits have never been explored in the context of probabilistic languages. In this work, we present and formalize GPLC, a gradual source probabilistic lambda calculus. GPLC includes a binary probabilistic choice operator and allows programmers to gradually introduce/remove static type -- and probability -- annotations. The static semantics of GPLC heavily relies on the notion of probabilistic couplings, as required for defining several relations, such as consistency, precision, and consistent transitivity. The dynamic semantics of GPLC is given via elaboration to the target language TPLC, which features a distribution-based semantics interpreting programs as probability distributions over final values. Regarding the language metatheory, we establish that TPLC -- and therefore also GPLC -- is type safe and satisfies two of the so-called refined criteria for gradual languages, namely, that it is a conservative extension of a fully static variant and that it satisfies the gradual guarantee, behaving smoothly with respect to type precision.

2026-04-06T23:13:56Z Wenjia Ye Matías Toro Federico Olmedo http://arxiv.org/abs/2604.06258v1 Accurate Residues for Floating-Point Debugging 2026-04-06T22:46:06Z

Floating-point arithmetic is error-prone and unintuitive. Floating-point debuggers instrument programs to monitor floating-point arithmetic at run time and flag numerical issues. They estimate residues, i.e., the difference between actual floating-point and ideal real values, for every floating-point value in the program. Prior work explores various approaches for computing these residues accurately and efficiently. Unfortunately, the most efficient methods, based on "error-free transformations", have a high rate of false reports, while the most accurate methods, based on high-precision arithmetic, are very slow. This paper builds on error-free-transformations-based approaches and aims to improve their accuracy while preserving efficiency. To more accurately compute residues, this paper divides residue computation into two steps (rounding error computation and residue function evaluation) and shows how to perform each step accurately via careful improvements to the current state of the art. We evaluate on 44 large scientific computing workloads, focusing on the 14 benchmarks where prior tools produce false reports: our approach eliminates false reports on 10 benchmarks and substantially reduces them on the remaining 3 benchmarks. Moreover, complex numerical issues require additional care due to absorption, where two machine-precision residues cannot both be computed accurately in a single execution. This paper introduces residue override, which re-executes the program multiple times, computing different residues in different executions and assembling a final "patchwork" execution. We evaluate on 169 standard benchmarks drawn from numerical analysis papers and textbooks, requiring only 3.6 re-executions on average. Among 34 benchmarks with false reports in the initial run, residue override is triggered on 29 of them and reduces false reports on 25 of them, averaging 7.1 re-executions.

2026-04-06T22:46:06Z Yumeng He Pavel Panchekha http://arxiv.org/abs/2410.19940v4 Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification 2026-04-06T22:27:51Z

Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.

2024-10-25T19:25:00Z 11 pages, 13 figures, Appearing at ICSE '26, Rio de Janeiro, Brazil Saketh Ram Kasibatla Arpan Agarwal Yuriy Brun Sorin Lerner Talia Ringer Emily First 10.1145/3744916.3773178 http://arxiv.org/abs/2604.05137v1 EffiPair: Improving the Efficiency of LLM-generated Code with Relative Contrastive Feedback 2026-04-06T19:56:26Z

Large language models (LLMs) often generate code that is functionally correct but inefficient in runtime and memory. Prior approaches to improving code efficiency typically rely on absolute execution feedback, such as profiling a single program's runtime or memory usage, which is costly and provides weak guidance for refinement. We propose Relative Contrastive Feedback (RCF), an inference-time feedback mechanism that requires no model fine-tuning or parameter updates. RCF compares two structurally similar programs for the same task and highlights the differences associated with better efficiency. Building on this idea, we introduce EffiPair, an inference-time iterative refinement framework that operates entirely at test time by generating multiple candidate solutions, identifying informative program pairs with large efficiency gaps, summarizing their execution differences into lightweight feedback, and using this signal to produce more efficient solutions. By replacing isolated scalar feedback with pairwise contrastive comparisons, EffiPair provides more direct guidance while reducing profiling and prompting overhead. Experiments on code-efficiency benchmarks show that EffiPair consistently improves efficiency while preserving correctness. For instance, with DeepSeek-Chat V3.2, EffiPair achieves up to 1.5x speedup over generation without performance feedback, while reducing token usage by more than 90% compared to prior work.

2026-04-06T19:56:26Z Samira Hajizadeh Suman Jana http://arxiv.org/abs/2604.06253v1 FLeX: Fourier-based Low-rank EXpansion for multilingual transfer 2026-04-06T19:26:13Z

Cross-lingual code generation is critical in enterprise environments where multiple programming languages coexist. However, fine-tuning large language models (LLMs) individually for each language is computationally prohibitive. This paper investigates whether parameter-efficient fine-tuning methods and optimizer enhancements can improve cross-lingual transfer from Python to languages like Java. We fine-tune the Code Llama 7B model using low-rank adaptation (LoRA) to optimize a small subset of parameters and compare Adam and Sophia optimizers, while exploring a novel Fourier-based regularization technique. Our contributions include: (1)demonstrating that LoRA fine-tuning on a small, high-quality dataset (MBPP) can exceed the pass@1 performance of the more broadly fine-tuned Code Llama-Python-7B model (40.1% vs. 38.4%); (2) showing that while Sophia achieves faster convergence than Adam, final pass@1 scores show marginal differences; and (3) presenting evidence that Fourier-based regularization during fine-tuning significantly improves cross-lingual transfer, achieving 42.1% pass@1 on Java tasks compared to the 34.2% baseline. These findings suggest that combining LoRA, optimized training methods, and frequency-domain regularization can efficiently adapt single-language LLMs to perform well across multiple programming languages.

2026-04-06T19:26:13Z 19 pages, 25 figures, Stanford CS224N Custom Project Gaurav Narasimhan http://arxiv.org/abs/2604.05066v1 AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels 2026-04-06T18:12:39Z

Data movement is the primary bottleneck in modern computing systems. For loop-based programs common in high-performance computing (HPC) and AI workloads, including matrix multiplication, tensor contraction, stencil computation, and einsum operations, the cost of moving data through the memory hierarchy often exceeds the cost of arithmetic. This paper presents AutoLALA, an open-source tool that analyzes data locality in affine loop programs. The tool accepts programs written in a small domain-specific language (DSL), lowers them to polyhedral sets and maps, and produces closed-form symbolic formulas for reuse distance and data movement complexity. AutoLALA implements the fully symbolic locality analysis of Zhu et al. together with the data movement distance (DMD) framework of Smith et al. In particular, it computes reuse distance as the image of the access space under the access map, avoiding both stack simulation and Denning's recursive working-set formulation. We describe the DSL syntax and its formal semantics, the polyhedral lowering pipeline that constructs timestamp spaces and access maps via affine transformations, and the sequence of Barvinok counting operations used to derive symbolic reuse-interval and reuse-distance distributions. The system is implemented in Rust as a modular library spanning three crates, with safe bindings to the Barvinok library. We provide both a command-line interface and an interactive web playground with LaTeX rendering of the output formulas. The tool handles arbitrary affine loop nests, covering workloads such as tensor contractions, einsum expressions, stencil computations, and general polyhedral programs.

2026-04-06T18:12:39Z Yifan Zhu Yekai Pan Yanghui Wu Chen Ding http://arxiv.org/abs/2604.05006v1 Guidelines for Producing Concise LNT Models, Illustrated with Formal Models of the Algorand Consensus Protocol 2026-04-06T09:09:52Z

LNT is a modern language for the formal description of concurrent systems. It generalizes traditional process calculi and overcomes their known limitations by incorporating features such as an imperative programming style with direct assignments to variables, symmetric sequential composition, and explicit loop operators. The present article examines how these features can be taken advantage of to obtain LNT models as concise and readable as possible. The study is illustrated with a running example, the consensus protocol of the Algorand blockchain, a formal model of which was recently developed at the University of Urbino. It is shown that, using well-chosen transformations, the number of lines of LNT code can be divided by three, while improving readability. Also, various properties of the formal model are expressed and verified using visual checking, equivalence checking, and model checking.

2026-04-06T09:09:52Z In Proceedings MARS 2026, arXiv:2604.03053 EPTCS 443, 2026, pp. 43-83 Hubert Garavel 10.4204/EPTCS.443.5 http://arxiv.org/abs/2604.04527v1 ENCRUST: Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation 2026-04-06T08:46:14Z

We present Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation, a two-phase pipeline for translating real-world C projects to safe Rust. Existing approaches either produce unsafe output without memory-safety guarantees or translate functions in isolation, failing to detect cross-unit type mismatches or handle unsafe constructs requiring whole-program reasoning. Furthermore, function-level LLM pipelines require coordinated caller updates when type signatures change, while project-scale systems often fail to produce compilable output under real-world dependency complexity. Encrust addresses these limitations by decoupling boundary adaptation from function logic via an Application Binary Interface (ABI)-preserving wrapper pattern and validating each intermediate state against the integrated codebase. Phase 1 (Encapsulated Substitution) translates each function using an ABI-preserving wrapper that splits it into two components: a caller-transparent shim retaining the original raw-pointer signature, and a safe inner function targeted by the LLM with a clean, scope-limited prompt. This enables independent per-function type changes with automatic rollback on failure, without coordinated caller updates. A deterministic, type-directed wrapper elimination pass then removes wrappers after successful translation. Phase 2 (Agentic Refinement) resolves unsafe constructs beyond per-function scope, including static mut globals, skipped wrapper pairs, and failed translations, using an LLM agent operating on the whole codebase under a baseline-aware verification gate. We evaluate Encrust on 7 GNU Coreutils programs and 8 libraries from the Laertes benchmark, showing substantial unsafe-construct reduction across all 15 programs while maintaining full test-vector correctness.

2026-04-06T08:46:14Z Hohyun Sim Hyeonjoong Cho Ali Shokri Zhoulai Fu Binoy Ravindran http://arxiv.org/abs/2505.15858v3 Search-Based Multi-Trajectory Refinement for Safe C-to-Rust Translation with Large Language Models 2026-04-06T08:10:41Z

The C programming language has been foundational in building system-level software. However, its manual memory management model frequently leads to memory safety issues. In response, Rust has emerged as a memory-safe alternative. Moreover, automating the C-to-Rust translation empowered by the rapid advancements of the generative capabilities of LLMs is gaining growing interest for large volumes of legacy C code. Leveraging LLM for the C-to-Rust translation introduces distinct challenges, unlike the math or commonsense QA domains where the LLMs have been predominantly applied. First, the scarcity of parallel C-to-Rust datasets hinders the retrieval of suitable code translation exemplars for in-context learning. Second, unlike math or commonsense QA problems, the intermediate steps required for C-to-Rust are not well-defined. Third, it remains unclear how to organize and cascade these intermediate steps to construct a correct translation trajectory. While existing LLM-based approaches have achieved some success, they have relied on iterative code refinement along a single search trajectory on a C-to-Rust problem space and have not explored the use of systematic search mechanisms to navigate the space of possible refinement trajectories. To address these challenges in the C-to-Rust translation, we propose the MCTS-Guided LLM refinement technique for automated C-to-safe-Rust translation (LAC2R). LAC2R uses MCTS to systematically explore multiple refinement trajectories and organize the LLM-induced intermediate steps for correct translation. We experimentally demonstrated that LAC2R effectively conducts C-to-Rust translation on large-scale, real-world benchmarks. On small-scale benchmarks, LAC2R is the only method that simultaneously attains the highest safety ratio, perfect project-level correctness, and the fewest linter warnings among the compared methods.

2025-05-21T01:26:23Z HoHyun Sim Hyeonjoong Cho Yeonghyeon Go Sadegh AlMahdi Kazemi Zarkouei Zhoulai Fu Ali Shokri Binoy Ravindran http://arxiv.org/abs/2604.04345v1 Trace-Guided Synthesis of Effectful Test Generators 2026-04-06T01:38:25Z

Several recently proposed program logics have incorporated notions of underapproximation into their design, enabling them to reason about reachability rather than safety. In this paper, we explore how similar ideas can be integrated into an expressive type and effect system. We use the resulting underapproximate type specifications to guide the synthesis of test generators that probe the behavior of effectful black-box systems. A key novelty of our type language is its ability to capture underapproximate behaviors of effectful operations using symbolic traces that expose latent data and control dependencies, constraints that must be preserved by the test sequences the generator outputs. We implement this approach in a tool called Clouseau, and evaluate it on a diverse range of applications by integrating Clouseau's synthesized generators into property-based testing frameworks like QCheck and model-checking tools like P. In both settings, the generators synthesized by Clouseau are significantly more effective than the default testing strategy, and are competitive with state-of-the-art, handwritten solutions.

2026-04-06T01:38:25Z Zhe Zhou Ankush Desai Benjamin Delaware Suresh Jagannathan http://arxiv.org/abs/2604.04238v1 Agentic Code Optimization via Compiler-LLM Cooperation 2026-04-05T19:44:02Z

Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well-studied optimizations at progressively lower levels of abstraction, but may miss optimization opportunities that require high-level reasoning about a program's purpose. Recent work has proposed using LLMs to fill this gap. While LLMs can achieve large speedups on some programs, they frequently generate code that is incorrect. In this work, we propose a method to balance the correctness of conventional compiler optimizations with the ``creativity'' of LLM-based code generation: compiler-LLM cooperation. Our approach integrates existing compiler optimization passes with LLM-based code generation at multiple levels of abstraction, retaining the best features of both types of code optimization. We realize our approach with a multi-agent system that includes (1) LLM-based optimization agents for each level of abstraction, (2) individual compiler constituents as tools, (3) an LLM-based test generation agent that probes the correctness and performance of generated code, and (4) a guiding LLM that orchestrates the other components. The strategy enables LLM-based optimization of input programs at multiple levels of abstraction and introduces a method for distributing computational budget between levels. Our extensive evaluation shows that compiler-LLM cooperation outperforms both existing compiler optimizations and level-specific LLM-based baselines, producing speedups up to 1.25x.

2026-04-05T19:44:02Z Benjamin Mikek Danylo Vashchilenko Bryan Lu Panpan Xu http://arxiv.org/abs/2604.04236v1 NEURA: A Unified and Retargetable Compilation Framework for Coarse-Grained Reconfigurable Architectures 2026-04-05T19:36:51Z

Coarse-Grained Reconfigurable Architectures (CGRAs) are a promising and versatile accelerator platform, offering a balance between the performance and efficiency of specialized accelerators and the software programmability. However, their full potential is severely hindered by control flow in accelerated kernels, as the control flow (e.g., loops, branches) is fundamentally incompatible with the parallel, data-driven CGRA fabric. Prior strategies to resolve this mismatch in CGRA kernel acceleration are either inefficient, sacrificing performance for generality, or lack generality due to the difficulty of adapting them across different execution models. Thus, a general and unified solution for efficient CGRA kernel acceleration remains elusive. This paper introduces NEURA, a unified and retargetable compilation framework that systematically resolves the control-dataflow mismatch in CGRAs. NEURA's core innovation is a novel, pure dataflow intermediate representation (IR) built on a predicated type system. In this IR, control contexts are embedded as a predicate within each data, making control an intrinsic property of data. This mechanism enables NEURA to systematically flatten complex control flow into a single unified dataflow graph. This unified representation decouples kernel representation from hardware, empowering NEURA to retarget diverse CGRAs with different execution models and microarchitectural features. When targeted to a high-performance spatio-temporal CGRA, NEURA delivers a 2.20x speedup on kernel benchmarks and up to 2.71x geometric mean speedup on real-world applications over state-of-the-art (SOTA) high-performance baselines. It also provides a competitive solution against the SOTA low-power CGRA when retargeted to a spatial-only CGRA. NEURA is open-source and available at https://github.com/coredac/neura.

2026-04-05T19:36:51Z Accepted by PLDI 2026 Shangkun Li Jinming Ge Diyuan Tao Zeyu Li Jiawei Liang Linfeng Du Jiang Xu Wei Zhang Cheng Tan http://arxiv.org/abs/2604.03986v1 COBOL-Coder: Domain-Adapted Large Language Models for COBOL Code Generation and Translation 2026-04-05T06:12:49Z

COBOL remains a critical language for mainframe systems, yet existing large language models (LLMs) struggle to generate and translate COBOL code correctly. This paper reports our experience in developing and evaluating domain-adapted LLMs for COBOL and mainframe software engineering. We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data. We evaluate COBOL-Coder on two tasks: code generation (on COBOLEval and COBOLCodeBench) and code translation (on COBOL-JavaTrans, our proposed benchmark for bidirectional COBOL-Java translation). In our experiments, COBOL-Coder achieves up to a 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval, compared to 41.8 percent and 16.4 for GPT-4o, while most open-source baselines (e.g., CodeGemma, CodeLlama, StarCoder2) fail to produce compilable programs. For Java-to-COBOL translation, COBOL-Coder reaches 34.93 Pass-1, whereas general-purpose LLMs achieve near-zero scores. To assess the usability of LLM-generated code in real-world settings, we conduct a survey with experienced COBOL developers. Participants consistently report that COBOL-Coder exhibits stronger COBOL awareness, has more reliable program structure, and is better aligned with enterprise practices than general-purpose LLMs.

2026-04-05T06:12:49Z Anh T. V. Dau Shin Hwei Tan Jinqiu Yang Nghi D. Q. Bui Anh Tuan Nguyen http://arxiv.org/abs/2604.03978v1 COBOLAssist: Analyzing and Fixing Compilation Errors for LLM-Powered COBOL Code Generation 2026-04-05T05:51:54Z

Legacy programming languages such as COBOL (Common Business-Oriented Language) remain critical in business computing. However, maintaining legacy COBOL systems is increasingly challenging due to a declining pool of skilled developers and the persistence of COBOL errors that require deep domain expertise to resolve. This paper investigates the challenges of COBOL compilation errors and introduces a framework leveraging large language models (LLMs) to address these issues. We first categorize the common compilation errors in LLM-generated COBOL code into three groups: incomplete code errors, syntax errors, and type-related errors. We further propose COBOLAssist, a technique to enhance code correctness through iterative repairs guided by compilation feedback. Our evaluation using five LLMs including GPT variants and mAInframer, shows a high prevalence of incorrect program structures and function usage in COBOL programs and demonstrates the effectiveness of COBOLAssist, with the compilation success rates increasing from 29.5\% to 64.38\% for GPT-4o-mini and from 41.8\% to 95.89\% for GPT-4o. It also improves pass@1 significantly, for example from 9.1 to 22.6 for GPT-4. Notably, while mAInframer-34B achieves the highest compilation success rate, its functional correctness remains limited. This research not only highlights the limitations in current LLMs for COBOL but also demonstrates a practical path forward for automated debugging in legacy systems.

2026-04-05T05:51:54Z Anh T. V. Dau Shin Hwei Tan Jinqiu Yang Nghi D. Q. Bui Anh Tuan Nguyen http://arxiv.org/abs/2604.03971v1 Automated Expected Cost Analysis for Quantum Programs 2026-04-05T05:36:55Z

In recent years, quantum computing has gained a substantial amount of momentum, and the capabilities of quantum devices are continually expanding and improving. Nevertheless, writing a quantum program from scratch remains tedious and error-prone work, showcasing the clear demand for automated tool support. We present Qet, a fully automated static program analysis tool that yields a precise expected cost analysis of mixed classical-quantum programs. Qet supports programs with advanced features like mid-circuit measurements and classical control flow. The methodology of our prototype implementation is based on a recently proposed quantum expectation transformer framework, generalising Dijkstra's predicate transformer and Hoare logic. The prototype implementation Qet is evaluated on a number of case studies taken from the literature and online references. Qet is able to fully automatically infer precise upper bounds on the expected costs that previously could only be derived by tedious manual calculations.

2026-04-05T05:36:55Z Georg Moser Michael Schaper