https://arxiv.org/api/3DlJ6PbEAw0h5snamQ9OWEmxPUY 2026-06-21T22:57:33Z 9918 840 15 http://arxiv.org/abs/2512.13185v1 Probabilistic Programming Meets Automata Theory: Exact Inference using Weighted Automata 2025-12-15T10:49:52Z

Probabilistic programs encode stochastic models as ordinary-looking programs with primitives for sampling numbers from predefined distributions and conditioning. Their applications include, among many others, machine learning and modeling of autonomous systems. The analysis of probabilistic programs is often quantitative - it involves reasoning about numerical properties like probabilities and expectations. A particularly important quantitative property of probabilistic programs is their posterior distribution, i.e., the distribution over possible outputs for a given input (or prior) distribution. Computing the posterior distribution exactly is known as exact inference. We present our current research using weighted automata, a generalization of the well-known finite automata, for performing exact inference in a restricted class of discrete probabilistic programs. This is achieved by encoding distributions over program variables - possibly with infinite support - as certain weighted automata. The semantics of our programming language then corresponds to common automata-theoretic constructions, such as product, concatenation, and others.

2025-12-15T10:49:52Z Dominik Geißler Tobias Winkler http://arxiv.org/abs/2507.00057v2 Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation 2025-12-13T19:27:44Z

Generating code from a natural language programming task is one of the most successful applications of Large Language Models (LLMs). Yet, the generated program may be buggy. Without an oracle, such as an existing, correct implementation or a formal specification, can we somehow estimate how likely the generated program is correct? In this paper, we propose a measure of incorrectness, called *incoherence*, that can be estimated efficiently in the absence of an oracle and allows us to establish a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. In our experiments, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives for the average task. In fact, *an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation*. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via incoherence.

2025-06-26T22:00:50Z Accepted at AAAI'26 (extended version). 8 pages + refs and appendix 40th Annual AAAI Conference on Artificial Intelligence (AAAI), 2026 Thomas Valentin Ardi Madadi Gaetano Sapia Marcel Böhme http://arxiv.org/abs/2512.11762v1 The Relative Monadic Metalanguage 2025-12-12T18:17:25Z

Relative monads provide a controlled view of computation. We generalise the monadic metalanguage to a relative setting and give a complete semantics with strong relative monads. Adopting this perspective, we generalise two existing program calculi from the literature. We provide a linear-non-linear language for graded monads, LNL-RMM, along with a semantic proof that it is a conservative extension of the graded monadic metalanguage. Additionally, we provide a complete semantics for the arrow calculus, showing it is a restricted relative monadic metalanguage. This motivates the introduction of ARMM, a computational lambda calculus-style language for arrows that conservatively extends the arrow calculus.

2025-12-12T18:17:25Z 41 pages. Published in Proceedings of the ACM on Programming Languages (POPL 2026) Jack Liell-Cock Zev Shirazi Sam Staton 10.1145/3776702 http://arxiv.org/abs/2512.11577v1 Context-Dependent Effects and Concurrency in Guarded Interaction Trees 2025-12-12T14:04:34Z

Guarded Interaction Trees are a structure and a fully formalized framework for representing higher-order computations with higher-order effects in Rocq. We present an extension of Guarded Interaction Trees to support formal reasoning about context-dependent effects. That is, effects whose behaviors depend on the evaluation context, e.g., call/cc, shift and reset. Using and reasoning about such effects is challenging since certain compositionality principles no longer hold in the presence of such effects. For example, the so-called ``bind rule'' in modern program logics is no longer valid. The goal of our extension is to support representation and reasoning about context-dependent effects in the most painless way possible. To that end, our extension is conservative: the reasoning principles for context-independent effects remain the same. We use it to give direct-style denotational semantics for higher-order programming languages with call/cc and with delimited continuations. We extend the program logic for Guarded Interaction Trees to account for context-dependent effects, and we use the program logic to prove that the denotational semantics is adequate with respect to the operational semantics. Additionally, we retain the ability to combine multiple effects in a modular way, which we demonstrate by showing type soundness for safe interoperability of a programming language with delimited continuations and a programming language with higher-order store. Furthermore, as another contribution, in addition to context-dependent effects, we show how to extend Guarded Interaction Trees with preemptive concurrency. To support implementation and verification of concurrent data structures and algorithms in the presence of preemptive concurrency one requires atomic state modification operations, e.g., compare-and-exchange.

2025-12-12T14:04:34Z Sergei Stepanenko Emma Nardino Virgil Marionneau Dan Frumin Amin Timany Lars Birkedal http://arxiv.org/abs/2509.05504v2 Comparing Methods for the Cross-Level Verification of SystemC Peripherals with Symbolic Execution 2025-12-12T11:20:58Z

Virtual Prototypes (VPs) are important tools in modern hardware development. At high abstractions, they are often implemented in SystemC and offer early analysis of increasingly complex designs. These complex designs often combine one or more processors, interconnects, and peripherals to perform tasks in hardware or interact with the environment. Verifying these subsystems is a well-suited task for VPs, as they allow reasoning across different abstraction levels. While modern verification techniques like symbolic execution can be seamlessly integrated into VP-based workflows, they require modifications in the SystemC kernel. Hence, existing approaches modify and replace the SystemC kernel, or ignore the opportunity of cross-level scenarios completely, and would not allow focusing on special challenges of particular subsystems like peripherals. We propose CrosSym and SEFOS, two opposing approaches for a versatile symbolic execution of peripherals. CrosSym modifies the SystemC kernel, while SEFOS instead modifies a modern symbolic execution engine. Our extensive evaluation applies our tools to various peripherals on different levels of abstractions. Both tools' extensive sets of features are demonstrated for (1) different verification scenarios, and (2) identifying 300+ mutants. In comparison with each other, SEFOS convinces with the unmodified SystemC kernel and peripheral, while CrosSym offers slightly better runtime and memory usage. In comparison to the state-of-the-art, that is limited to Transaction Level Modelling (TLM), our tools offered comparable runtime, while enabling cross-level verification with symbolic execution.

2025-09-05T21:28:26Z Karl Aaron Rudkowski Sallar Ahmadi-Pour Rolf Drechsler http://arxiv.org/abs/2512.15766v1 LOOPRAG: Enhancing Loop Transformation Optimization with Retrieval-Augmented Large Language Models 2025-12-12T11:09:48Z

Loop transformations are semantics-preserving optimization techniques, widely used to maximize objectives such as parallelism. Despite decades of research, applying the optimal composition of loop transformations remains challenging due to inherent complexities, including cost modeling for optimization objectives. Recent studies have explored the potential of Large Language Models (LLMs) for code optimization. However, our key observation is that LLMs often struggle with effective loop transformation optimization, frequently leading to errors or suboptimal optimization, thereby missing opportunities for performance improvements. To bridge this gap, we propose LOOPRAG, a novel retrieval-augmented generation framework designed to guide LLMs in performing effective loop optimization on Static Control Part. We introduce a parameter-driven method to harness loop properties, which trigger various loop transformations, and generate diverse yet legal example codes serving as a demonstration source. To effectively obtain the most informative demonstrations, we propose a loop-aware algorithm based on loop features, which balances similarity and diversity for code retrieval. To enhance correct and efficient code generation, we introduce a feedback-based iterative mechanism that incorporates compilation, testing and performance results as feedback to guide LLMs. Each optimized code undergoes mutation, coverage and differential testing for equivalence checking. We evaluate LOOPRAG on PolyBench, TSVC and LORE benchmark suites, and compare it against compilers (GCC-Graphite, Clang-Polly, Perspective and ICX) and representative LLMs (DeepSeek and GPT-4). The results demonstrate average speedups over base compilers of up to 11.20$\times$, 14.34$\times$, and 9.29$\times$ for PolyBench, TSVC, and LORE, respectively, and speedups over base LLMs of up to 11.97$\times$, 5.61$\times$, and 11.59$\times$.

2025-12-12T11:09:48Z Accepted to ASPLOS 2026 Yijie Zhi Yayu Cao Jianhua Dai Xiaoyang Han Jingwen Pu Qingran Wu Sheng Cheng Ming Cai http://arxiv.org/abs/2512.11200v1 Theoretical Foundations of GPU-Native Compilation for Rapid Code Iteration 2025-12-12T01:14:36Z

Current AI code generation systems suffer from significant latency bottlenecks due to CPU-GPU data transfers during compilation, execution, and testing phases. We establish theoretical foundations for three complementary approaches to GPU-native compilation that eliminate these transfers: (1) parallel traditional compilation adapted for GPU execution, (2) neural compilation using learned sequence-to-sequence translation with probabilistic verification, and (3) hybrid architectures combining both strategies. We derive latency and energy bounds demonstrating potential speedups of 10-100x for code iteration cycles. Our analysis shows that traditional GPU compilation provides 2-5x improvements through transfer elimination, neural compilation achieves 10-100x speedups via massive parallelism, and hybrid approaches offer practical deployment paths with guaranteed correctness. We formalize the probabilistic verification framework that enables trading compilation accuracy for parallel exploration, and discuss implications for self-improving AI systems and future analog computing substrates.

2025-12-12T01:14:36Z 9 pages , 2 tables Adilet Metinov Gulida M. Kudakeeva Gulnara D. Kabaeva http://arxiv.org/abs/2512.10799v1 Zorya: Automated Concolic Execution of Single-Threaded Go Binaries 2025-12-11T16:43:51Z

Go's adoption in critical infrastructure intensifies the need for systematic vulnerability detection, yet existing symbolic execution tools struggle with Go binaries due to runtime complexity and scalability challenges. In this work, we build upon Zorya, a concolic execution framework that translates Go binaries to Ghidra's P-Code intermediate representation to address these challenges. We added the detection of bugs in concretely not taken paths and a multi-layer filtering mechanism to concentrate symbolic reasoning on panic-relevant paths. Evaluation on five Go vulnerabilities demonstrates that panic-reachability gating achieves 1.8-3.9x speedups when filtering 33-70% of branches, and that Zorya detects all panics while existing tools detect at most two. Function-mode analysis proved essential for complex programs, running roughly two orders of magnitude faster than starting from main. This work establishes that specialized concolic execution can achieve practical vulnerability detection in language ecosystems with runtime safety checks.

2025-12-11T16:43:51Z Karolina Gorna Nicolas Iooss Yannick Seurin Rida Khatoun 10.1145/3748522.3779940 http://arxiv.org/abs/2512.10779v1 Lax Modal Lambda Calculi 2025-12-11T16:15:39Z

Intuitionistic modal logics (IMLs) extend intuitionistic propositional logic with modalities such as the box and diamond connectives. Advances in the study of IMLs have inspired several applications in programming languages via the development of corresponding type theories with modalities. Until recently, IMLs with diamonds have been misunderstood as somewhat peculiar and unstable, causing the development of type theories with diamonds to lag behind type theories with boxes. In this article, we develop a family of typed-lambda calculi corresponding to sublogics of a peculiar IML with diamonds known as Lax logic. These calculi provide a modal logical foundation for various strong functors in typed-functional programming. We present possible-world and categorical semantics for these calculi and constructively prove normalization, equational completeness and proof-theoretic inadmissibility results. Our main results have been formalized using the proof assistant Agda.

2025-12-11T16:15:39Z To appear at CSL 2026 Nachiappan Valliappan http://arxiv.org/abs/2505.14690v3 SGL: A Structured Graphics Language 2025-12-10T16:44:45Z

This paper introduces SGL, a graphics language that is aesthetically similar to SQL. As a graphical counterpart to SQL, SGL enables specification of statistical graphics within SQL query interfaces. SGL is based on a grammar of graphics that has been customized to support a SQL aesthetic. This paper presents the fundamental components of the SGL language alongside examples, and describes SGL's underlying grammar of graphics via comparison to its closest predecessor, the layered grammar of graphics.

2025-04-19T02:15:06Z Jon Chapman http://arxiv.org/abs/2503.05507v2 Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs? 2025-12-10T10:47:02Z

Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.

2025-03-07T15:23:13Z Qingyuan Liang Zhao Zhang Zeyu Sun Zheng Lin Qi Luo Yueyi Xiao Yizhou Chen Yuqun Zhang Haotian Zhang Lu Zhang Bin Chen Yingfei Xiong http://arxiv.org/abs/2510.14719v2 Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References 2025-12-10T03:26:05Z

Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.

2025-10-16T14:20:00Z Accepted to CGO'26 Hongzheng Chen Bin Fan Alexander Collins Bastian Hagedorn Evghenii Gaburov Masahiro Masuda Matthew Brookhart Chris Sullivan Jason Knight Zhiru Zhang Vinod Grover http://arxiv.org/abs/2510.16809v2 When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation 2025-12-09T12:20:58Z

Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.

2025-10-19T12:29:13Z Accepted to ICSE 2026 (RECODE workshop) Amirkia Rafiei Oskooei Kaan Baturalp Cosdan Husamettin Isiktas Mehmet S. Aktas http://arxiv.org/abs/2304.03393v3 Polymorphic Coverage Types 2025-12-08T18:19:24Z

Test input generators are an important part of property-based testing (PBT) frameworks, and a key expectation is that they be capable of producing all acceptable elements that satisfy both the function's input type and the generator-provided constraints. However, it is not readily apparent how to validate whether a particular generator's output satisfies this coverage requirement. In practice, developers typically rely on manual inspection and post-mortem analysis of test runs to determine whether a generator provides sufficient coverage; these approaches are error-prone and difficult to scale as generators grow more complex. To address this problem, we present a new refinement-type-based verification procedure for validating the coverage provided by input test generators, based on a novel interpretation of types that embeds "must-style" underapproximate reasoning principles as a fundamental part of the type system. In our formulation, the types associated with expressions capture the set of values guaranteed to be produced by the expression, rather than the usual interpretation in which types represent the set of values an expression may produce. We formalize the notion of coverage types in a rich core language supporting higher-order functions and inductive datatypes. To better support real-world test generators, we extend this type system with type and qualifier polymorphism, enabling static verification of coverage guarantees for test input generators constructed using the monadic combinators found in most PBT frameworks. Finally, we have implemented a coverage type checker for OCaml programs based on this core calculus and present a detailed evaluation of its utility using a corpus of benchmarks drawn from both the PBT literature and open-source projects.

2023-04-06T21:46:54Z This submission is an extended version of our PLDI 2023 conference paper: Covering All the Bases: Type-Based Verification of Test Input Generators Zhe Zhou Ashish Mishra Benjamin Delaware Suresh Jagannathan http://arxiv.org/abs/2512.07511v1 Canonical bidirectional typechecking 2025-12-08T12:47:25Z

We demonstrate that the checkable/synthesisable split in bidirectional typechecking coincides with existing dualities in polarised System L, also known as polarised $μ\tildeμ$-calculus. Specifically, positive terms and negative coterms are checkable, and negative terms and positive coterms are synthesisable. This combines a standard formulation of bidirectional typechecking with Zeilberger's `cocontextual' variant. We extend this to ordinary `cartesian' System L using Mc Bride's co-de Bruijn formulation of scopes, and show that both can be combined in a linear-nonlinear style, where linear types are positive and cartesian types are negative. This yields a remarkable 3-way coincidence between the shifts of polarised System L, LNL calculi, and bidirectional calculi.

2025-12-08T12:47:25Z Zanzi Mihejevs Jules Hedges