https://arxiv.org/api/RkvxCcN9XL7qsBlCOecjk0vSxzU2026-06-14T15:32:26Z988540515http://arxiv.org/abs/2604.05246v1A Gradual Probabilistic Lambda Calculus2026-04-06T23:13:56ZProbabilistic programming languages have recently gained a lot of attention, in particular due to their applications in domains such as machine learning and differential privacy. To establish invariants of interest, many such languages include some form of static checking in the form of type systems. However, adopting such a type discipline can be cumbersome or overly conservative.
Gradual typing addresses this problem by supporting a smooth transition between static and dynamic checking, and has been successfully applied for languages with different constructs and type abstractions. Nevertheless, its benefits have never been explored in the context of probabilistic languages.
In this work, we present and formalize GPLC, a gradual source probabilistic lambda calculus. GPLC includes a binary probabilistic choice operator and allows programmers to gradually introduce/remove static type -- and probability -- annotations. The static semantics of GPLC heavily relies on the notion of probabilistic couplings, as required for defining several relations, such as consistency, precision, and consistent transitivity. The dynamic semantics of GPLC is given via elaboration to the target language TPLC, which features a distribution-based semantics interpreting programs as probability distributions over final values. Regarding the language metatheory, we establish that TPLC -- and therefore also GPLC -- is type safe and satisfies two of the so-called refined criteria for gradual languages, namely, that it is a conservative extension of a fully static variant and that it satisfies the gradual guarantee, behaving smoothly with respect to type precision.2026-04-06T23:13:56ZWenjia YeMatÃas ToroFederico Olmedohttp://arxiv.org/abs/2604.06258v1Accurate Residues for Floating-Point Debugging2026-04-06T22:46:06ZFloating-point arithmetic is error-prone and unintuitive. Floating-point debuggers instrument programs to monitor floating-point arithmetic at run time and flag numerical issues. They estimate residues, i.e., the difference between actual floating-point and ideal real values, for every floating-point value in the program. Prior work explores various approaches for computing these residues accurately and efficiently. Unfortunately, the most efficient methods, based on "error-free transformations", have a high rate of false reports, while the most accurate methods, based on high-precision arithmetic, are very slow. This paper builds on error-free-transformations-based approaches and aims to improve their accuracy while preserving efficiency. To more accurately compute residues, this paper divides residue computation into two steps (rounding error computation and residue function evaluation) and shows how to perform each step accurately via careful improvements to the current state of the art. We evaluate on 44 large scientific computing workloads, focusing on the 14 benchmarks where prior tools produce false reports: our approach eliminates false reports on 10 benchmarks and substantially reduces them on the remaining 3 benchmarks. Moreover, complex numerical issues require additional care due to absorption, where two machine-precision residues cannot both be computed accurately in a single execution. This paper introduces residue override, which re-executes the program multiple times, computing different residues in different executions and assembling a final "patchwork" execution. We evaluate on 169 standard benchmarks drawn from numerical analysis papers and textbooks, requiring only 3.6 re-executions on average. Among 34 benchmarks with false reports in the initial run, residue override is triggered on 29 of them and reduces false reports on 25 of them, averaging 7.1 re-executions.2026-04-06T22:46:06ZYumeng HePavel Panchekhahttp://arxiv.org/abs/2410.19940v4Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification2026-04-06T22:27:51ZFormal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.2024-10-25T19:25:00Z11 pages, 13 figures, Appearing at ICSE '26, Rio de Janeiro, BrazilSaketh Ram KasibatlaArpan AgarwalYuriy BrunSorin LernerTalia RingerEmily First10.1145/3744916.3773178http://arxiv.org/abs/2604.05137v1EffiPair: Improving the Efficiency of LLM-generated Code with Relative Contrastive Feedback2026-04-06T19:56:26ZLarge language models (LLMs) often generate code that is functionally correct but inefficient in runtime and memory. Prior approaches to improving code efficiency typically rely on absolute execution feedback, such as profiling a single program's runtime or memory usage, which is costly and provides weak guidance for refinement. We propose Relative Contrastive Feedback (RCF), an inference-time feedback mechanism that requires no model fine-tuning or parameter updates. RCF compares two structurally similar programs for the same task and highlights the differences associated with better efficiency. Building on this idea, we introduce EffiPair, an inference-time iterative refinement framework that operates entirely at test time by generating multiple candidate solutions, identifying informative program pairs with large efficiency gaps, summarizing their execution differences into lightweight feedback, and using this signal to produce more efficient solutions. By replacing isolated scalar feedback with pairwise contrastive comparisons, EffiPair provides more direct guidance while reducing profiling and prompting overhead. Experiments on code-efficiency benchmarks show that EffiPair consistently improves efficiency while preserving correctness. For instance, with DeepSeek-Chat V3.2, EffiPair achieves up to 1.5x speedup over generation without performance feedback, while reducing token usage by more than 90% compared to prior work.2026-04-06T19:56:26ZSamira HajizadehSuman Janahttp://arxiv.org/abs/2604.06253v1FLeX: Fourier-based Low-rank EXpansion for multilingual transfer2026-04-06T19:26:13ZCross-lingual code generation is critical in enterprise environments where multiple programming languages coexist. However, fine-tuning large language models (LLMs) individually for each language is computationally prohibitive. This paper investigates whether parameter-efficient fine-tuning methods and optimizer enhancements can improve cross-lingual transfer from Python to languages like Java. We fine-tune the Code Llama 7B model using low-rank adaptation (LoRA) to optimize a small subset of parameters and compare Adam and Sophia optimizers, while exploring a novel Fourier-based regularization technique. Our contributions include: (1)demonstrating that LoRA fine-tuning on a small, high-quality dataset (MBPP) can exceed the pass@1 performance of the more broadly fine-tuned Code Llama-Python-7B model (40.1% vs. 38.4%); (2) showing that while Sophia achieves faster convergence than Adam, final pass@1 scores show marginal differences; and (3) presenting evidence that Fourier-based regularization during fine-tuning significantly improves cross-lingual transfer, achieving 42.1% pass@1 on Java tasks compared to the 34.2% baseline. These findings suggest that combining LoRA, optimized training methods, and frequency-domain regularization can efficiently adapt single-language LLMs to perform well across multiple programming languages.2026-04-06T19:26:13Z19 pages, 25 figures, Stanford CS224N Custom ProjectGaurav Narasimhanhttp://arxiv.org/abs/2604.05066v1AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels2026-04-06T18:12:39ZData movement is the primary bottleneck in modern computing systems. For loop-based programs common in high-performance computing (HPC) and AI workloads, including matrix multiplication, tensor contraction, stencil computation, and einsum operations, the cost of moving data through the memory hierarchy often exceeds the cost of arithmetic.
This paper presents AutoLALA, an open-source tool that analyzes data locality in affine loop programs. The tool accepts programs written in a small domain-specific language (DSL), lowers them to polyhedral sets and maps, and produces closed-form symbolic formulas for reuse distance and data movement complexity. AutoLALA implements the fully symbolic locality analysis of Zhu et al. together with the data movement distance (DMD) framework of Smith et al. In particular, it computes reuse distance as the image of the access space under the access map, avoiding both stack simulation and Denning's recursive working-set formulation.
We describe the DSL syntax and its formal semantics, the polyhedral lowering pipeline that constructs timestamp spaces and access maps via affine transformations, and the sequence of Barvinok counting operations used to derive symbolic reuse-interval and reuse-distance distributions. The system is implemented in Rust as a modular library spanning three crates, with safe bindings to the Barvinok library.
We provide both a command-line interface and an interactive web playground with LaTeX rendering of the output formulas. The tool handles arbitrary affine loop nests, covering workloads such as tensor contractions, einsum expressions, stencil computations, and general polyhedral programs.2026-04-06T18:12:39ZYifan ZhuYekai PanYanghui WuChen Dinghttp://arxiv.org/abs/2604.05006v1Guidelines for Producing Concise LNT Models, Illustrated with Formal Models of the Algorand Consensus Protocol2026-04-06T09:09:52ZLNT is a modern language for the formal description of concurrent systems. It generalizes traditional process calculi and overcomes their known limitations by incorporating features such as an imperative programming style with direct assignments to variables, symmetric sequential composition, and explicit loop operators. The present article examines how these features can be taken advantage of to obtain LNT models as concise and readable as possible. The study is illustrated with a running example, the consensus protocol of the Algorand blockchain, a formal model of which was recently developed at the University of Urbino. It is shown that, using well-chosen transformations, the number of lines of LNT code can be divided by three, while improving readability. Also, various properties of the formal model are expressed and verified using visual checking, equivalence checking, and model checking.
2026-04-06T09:09:52ZIn Proceedings MARS 2026, arXiv:2604.03053EPTCS 443, 2026, pp. 43-83Hubert Garavel10.4204/EPTCS.443.5http://arxiv.org/abs/2604.04527v1ENCRUST: Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation2026-04-06T08:46:14ZWe present Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation, a two-phase pipeline for translating real-world C projects to safe Rust. Existing approaches either produce unsafe output without memory-safety guarantees or translate functions in isolation, failing to detect cross-unit type mismatches or handle unsafe constructs requiring whole-program reasoning. Furthermore, function-level LLM pipelines require coordinated caller updates when type signatures change, while project-scale systems often fail to produce compilable output under real-world dependency complexity. Encrust addresses these limitations by decoupling boundary adaptation from function logic via an Application Binary Interface (ABI)-preserving wrapper pattern and validating each intermediate state against the integrated codebase. Phase 1 (Encapsulated Substitution) translates each function using an ABI-preserving wrapper that splits it into two components: a caller-transparent shim retaining the original raw-pointer signature, and a safe inner function targeted by the LLM with a clean, scope-limited prompt. This enables independent per-function type changes with automatic rollback on failure, without coordinated caller updates. A deterministic, type-directed wrapper elimination pass then removes wrappers after successful translation. Phase 2 (Agentic Refinement) resolves unsafe constructs beyond per-function scope, including static mut globals, skipped wrapper pairs, and failed translations, using an LLM agent operating on the whole codebase under a baseline-aware verification gate. We evaluate Encrust on 7 GNU Coreutils programs and 8 libraries from the Laertes benchmark, showing substantial unsafe-construct reduction across all 15 programs while maintaining full test-vector correctness.2026-04-06T08:46:14ZHohyun SimHyeonjoong ChoAli ShokriZhoulai FuBinoy Ravindranhttp://arxiv.org/abs/2505.15858v3Search-Based Multi-Trajectory Refinement for Safe C-to-Rust Translation with Large Language Models2026-04-06T08:10:41ZThe C programming language has been foundational in building system-level software. However, its manual memory management model frequently leads to memory safety issues. In response, Rust has emerged as a memory-safe alternative. Moreover, automating the C-to-Rust translation empowered by the rapid advancements of the generative capabilities of LLMs is gaining growing interest for large volumes of legacy C code. Leveraging LLM for the C-to-Rust translation introduces distinct challenges, unlike the math or commonsense QA domains where the LLMs have been predominantly applied. First, the scarcity of parallel C-to-Rust datasets hinders the retrieval of suitable code translation exemplars for in-context learning. Second, unlike math or commonsense QA problems, the intermediate steps required for C-to-Rust are not well-defined. Third, it remains unclear how to organize and cascade these intermediate steps to construct a correct translation trajectory. While existing LLM-based approaches have achieved some success, they have relied on iterative code refinement along a single search trajectory on a C-to-Rust problem space and have not explored the use of systematic search mechanisms to navigate the space of possible refinement trajectories. To address these challenges in the C-to-Rust translation, we propose the MCTS-Guided LLM refinement technique for automated C-to-safe-Rust translation (LAC2R). LAC2R uses MCTS to systematically explore multiple refinement trajectories and organize the LLM-induced intermediate steps for correct translation. We experimentally demonstrated that LAC2R effectively conducts C-to-Rust translation on large-scale, real-world benchmarks. On small-scale benchmarks, LAC2R is the only method that simultaneously attains the highest safety ratio, perfect project-level correctness, and the fewest linter warnings among the compared methods.2025-05-21T01:26:23ZHoHyun SimHyeonjoong ChoYeonghyeon GoSadegh AlMahdi Kazemi ZarkoueiZhoulai FuAli ShokriBinoy Ravindranhttp://arxiv.org/abs/2604.04345v1Trace-Guided Synthesis of Effectful Test Generators2026-04-06T01:38:25ZSeveral recently proposed program logics have incorporated notions of underapproximation into their design, enabling them to reason about
reachability rather than safety. In this paper, we explore how similar ideas can be integrated into an expressive type and effect system. We
use the resulting underapproximate type specifications to guide the synthesis of test generators that probe the behavior of effectful
black-box systems. A key novelty of our type language is its ability to capture underapproximate behaviors of effectful operations using
symbolic traces that expose latent data and control dependencies, constraints that must be preserved by the test sequences the generator
outputs. We implement this approach in a tool called Clouseau, and evaluate it on a diverse range of applications by integrating Clouseau's
synthesized generators into property-based testing frameworks like QCheck and model-checking tools like P. In both settings, the generators
synthesized by Clouseau are significantly more effective than the default testing strategy, and are competitive with state-of-the-art,
handwritten solutions.2026-04-06T01:38:25ZZhe ZhouAnkush DesaiBenjamin DelawareSuresh Jagannathanhttp://arxiv.org/abs/2604.04238v1Agentic Code Optimization via Compiler-LLM Cooperation2026-04-05T19:44:02ZGenerating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well-studied optimizations at progressively lower levels of abstraction, but may miss optimization opportunities that require high-level reasoning about a program's purpose. Recent work has proposed using LLMs to fill this gap. While LLMs can achieve large speedups on some programs, they frequently generate code that is incorrect. In this work, we propose a method to balance the correctness of conventional compiler optimizations with the ``creativity'' of LLM-based code generation: compiler-LLM cooperation. Our approach integrates existing compiler optimization passes with LLM-based code generation at multiple levels of abstraction, retaining the best features of both types of code optimization. We realize our approach with a multi-agent system that includes (1) LLM-based optimization agents for each level of abstraction, (2) individual compiler constituents as tools, (3) an LLM-based test generation agent that probes the correctness and performance of generated code, and (4) a guiding LLM that orchestrates the other components. The strategy enables LLM-based optimization of input programs at multiple levels of abstraction and introduces a method for distributing computational budget between levels. Our extensive evaluation shows that compiler-LLM cooperation outperforms both existing compiler optimizations and level-specific LLM-based baselines, producing speedups up to 1.25x.2026-04-05T19:44:02ZBenjamin MikekDanylo VashchilenkoBryan LuPanpan Xuhttp://arxiv.org/abs/2604.04236v1NEURA: A Unified and Retargetable Compilation Framework for Coarse-Grained Reconfigurable Architectures2026-04-05T19:36:51ZCoarse-Grained Reconfigurable Architectures (CGRAs) are a promising and versatile accelerator platform, offering a balance between the performance and efficiency of specialized accelerators and the software programmability. However, their full potential is severely hindered by control flow in accelerated kernels, as the control flow (e.g., loops, branches) is fundamentally incompatible with the parallel, data-driven CGRA fabric. Prior strategies to resolve this mismatch in CGRA kernel acceleration are either inefficient, sacrificing performance for generality, or lack generality due to the difficulty of adapting them across different execution models. Thus, a general and unified solution for efficient CGRA kernel acceleration remains elusive.
This paper introduces NEURA, a unified and retargetable compilation framework that systematically resolves the control-dataflow mismatch in CGRAs. NEURA's core innovation is a novel, pure dataflow intermediate representation (IR) built on a predicated type system. In this IR, control contexts are embedded as a predicate within each data, making control an intrinsic property of data. This mechanism enables NEURA to systematically flatten complex control flow into a single unified dataflow graph. This unified representation decouples kernel representation from hardware, empowering NEURA to retarget diverse CGRAs with different execution models and microarchitectural features. When targeted to a high-performance spatio-temporal CGRA, NEURA delivers a 2.20x speedup on kernel benchmarks and up to 2.71x geometric mean speedup on real-world applications over state-of-the-art (SOTA) high-performance baselines. It also provides a competitive solution against the SOTA low-power CGRA when retargeted to a spatial-only CGRA. NEURA is open-source and available at https://github.com/coredac/neura.2026-04-05T19:36:51ZAccepted by PLDI 2026Shangkun LiJinming GeDiyuan TaoZeyu LiJiawei LiangLinfeng DuJiang XuWei ZhangCheng Tanhttp://arxiv.org/abs/2604.03986v1COBOL-Coder: Domain-Adapted Large Language Models for COBOL Code Generation and Translation2026-04-05T06:12:49ZCOBOL remains a critical language for mainframe systems, yet existing large language models (LLMs) struggle to generate and translate COBOL code correctly. This paper reports our experience in developing and evaluating domain-adapted LLMs for COBOL and mainframe software engineering. We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data. We evaluate COBOL-Coder on two tasks: code generation (on COBOLEval and COBOLCodeBench) and code translation (on COBOL-JavaTrans, our proposed benchmark for bidirectional COBOL-Java translation). In our experiments, COBOL-Coder achieves up to a 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval, compared to 41.8 percent and 16.4 for GPT-4o, while most open-source baselines (e.g., CodeGemma, CodeLlama, StarCoder2) fail to produce compilable programs. For Java-to-COBOL translation, COBOL-Coder reaches 34.93 Pass-1, whereas general-purpose LLMs achieve near-zero scores. To assess the usability of LLM-generated code in real-world settings, we conduct a survey with experienced COBOL developers. Participants consistently report that COBOL-Coder exhibits stronger COBOL awareness, has more reliable program structure, and is better aligned with enterprise practices than general-purpose LLMs.2026-04-05T06:12:49ZAnh T. V. DauShin Hwei TanJinqiu YangNghi D. Q. BuiAnh Tuan Nguyenhttp://arxiv.org/abs/2604.03978v1COBOLAssist: Analyzing and Fixing Compilation Errors for LLM-Powered COBOL Code Generation2026-04-05T05:51:54ZLegacy programming languages such as COBOL (Common Business-Oriented Language) remain critical in business computing. However, maintaining legacy COBOL systems is increasingly challenging due to a declining pool of skilled developers and the persistence of COBOL errors that require deep domain expertise to resolve. This paper investigates the challenges of COBOL compilation errors and introduces a framework leveraging large language models (LLMs) to address these issues. We first categorize the common compilation errors in LLM-generated COBOL code into three groups: incomplete code errors, syntax errors, and type-related errors. We further propose COBOLAssist, a technique to enhance code correctness through iterative repairs guided by compilation feedback. Our evaluation using five LLMs including GPT variants and mAInframer, shows a high prevalence of incorrect program structures and function usage in COBOL programs and demonstrates the effectiveness of COBOLAssist, with the compilation success rates increasing from 29.5\% to 64.38\% for GPT-4o-mini and from 41.8\% to 95.89\% for GPT-4o. It also improves pass@1 significantly, for example from 9.1 to 22.6 for GPT-4. Notably, while mAInframer-34B achieves the highest compilation success rate, its functional correctness remains limited. This research not only highlights the limitations in current LLMs for COBOL but also demonstrates a practical path forward for automated debugging in legacy systems.2026-04-05T05:51:54ZAnh T. V. DauShin Hwei TanJinqiu YangNghi D. Q. BuiAnh Tuan Nguyenhttp://arxiv.org/abs/2604.03971v1Automated Expected Cost Analysis for Quantum Programs2026-04-05T05:36:55ZIn recent years, quantum computing has gained a substantial amount of momentum, and the capabilities of quantum devices are continually expanding and improving. Nevertheless, writing a quantum program from scratch remains tedious and error-prone work, showcasing the clear demand for automated tool support. We present Qet, a fully automated static program analysis tool that yields a precise expected cost analysis of mixed classical-quantum programs. Qet supports programs with advanced features like mid-circuit measurements and classical control flow. The methodology of our prototype implementation is based on a recently proposed quantum expectation transformer framework, generalising Dijkstra's predicate transformer and Hoare logic. The prototype implementation Qet is evaluated on a number of case studies taken from the literature and online references. Qet is able to fully automatically infer precise upper bounds on the expected costs that previously could only be derived by tedious manual calculations.2026-04-05T05:36:55ZGeorg MoserMichael Schaper