https://arxiv.org/api/3DlJ6PbEAw0h5snamQ9OWEmxPUY2026-06-21T22:57:33Z991884015http://arxiv.org/abs/2512.13185v1Probabilistic Programming Meets Automata Theory: Exact Inference using Weighted Automata2025-12-15T10:49:52ZProbabilistic programs encode stochastic models as ordinary-looking programs with primitives for sampling numbers from predefined distributions and conditioning. Their applications include, among many others, machine learning and modeling of autonomous systems. The analysis of probabilistic programs is often quantitative - it involves reasoning about numerical properties like probabilities and expectations. A particularly important quantitative property of probabilistic programs is their posterior distribution, i.e., the distribution over possible outputs for a given input (or prior) distribution. Computing the posterior distribution exactly is known as exact inference. We present our current research using weighted automata, a generalization of the well-known finite automata, for performing exact inference in a restricted class of discrete probabilistic programs. This is achieved by encoding distributions over program variables - possibly with infinite support - as certain weighted automata. The semantics of our programming language then corresponds to common automata-theoretic constructions, such as product, concatenation, and others.2025-12-15T10:49:52ZDominik GeißlerTobias Winklerhttp://arxiv.org/abs/2507.00057v2Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation2025-12-13T19:27:44ZGenerating code from a natural language programming task is one of the most successful applications of Large Language Models (LLMs). Yet, the generated program may be buggy. Without an oracle, such as an existing, correct implementation or a formal specification, can we somehow estimate how likely the generated program is correct?
In this paper, we propose a measure of incorrectness, called *incoherence*, that can be estimated efficiently in the absence of an oracle and allows us to establish a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. In our experiments, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives for the average task. In fact, *an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation*. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via incoherence.2025-06-26T22:00:50ZAccepted at AAAI'26 (extended version). 8 pages + refs and appendix40th Annual AAAI Conference on Artificial Intelligence (AAAI), 2026Thomas ValentinArdi MadadiGaetano SapiaMarcel Böhmehttp://arxiv.org/abs/2512.11762v1The Relative Monadic Metalanguage2025-12-12T18:17:25ZRelative monads provide a controlled view of computation. We generalise the monadic metalanguage to a relative setting and give a complete semantics with strong relative monads. Adopting this perspective, we generalise two existing program calculi from the literature. We provide a linear-non-linear language for graded monads, LNL-RMM, along with a semantic proof that it is a conservative extension of the graded monadic metalanguage. Additionally, we provide a complete semantics for the arrow calculus, showing it is a restricted relative monadic metalanguage. This motivates the introduction of ARMM, a computational lambda calculus-style language for arrows that conservatively extends the arrow calculus.2025-12-12T18:17:25Z41 pages. Published in Proceedings of the ACM on Programming Languages (POPL 2026)Jack Liell-CockZev ShiraziSam Staton10.1145/3776702http://arxiv.org/abs/2512.11577v1Context-Dependent Effects and Concurrency in Guarded Interaction Trees2025-12-12T14:04:34ZGuarded Interaction Trees are a structure and a fully formalized framework for representing higher-order computations with higher-order effects in Rocq. We present an extension of Guarded Interaction Trees to support formal reasoning about context-dependent effects. That is, effects whose behaviors depend on the evaluation context, e.g., call/cc, shift and reset. Using and reasoning about such effects is challenging since certain compositionality principles no longer hold in the presence of such effects. For example, the so-called ``bind rule'' in modern program logics is no longer valid. The goal of our extension is to support representation and reasoning about context-dependent effects in the most painless way possible. To that end, our extension is conservative: the reasoning principles for context-independent effects remain the same. We use it to give direct-style denotational semantics for higher-order programming languages with call/cc and with delimited continuations. We extend the program logic for Guarded Interaction Trees to account for context-dependent effects, and we use the program logic to prove that the denotational semantics is adequate with respect to the operational semantics. Additionally, we retain the ability to combine multiple effects in a modular way, which we demonstrate by showing type soundness for safe interoperability of a programming language with delimited continuations and a programming language with higher-order store. Furthermore, as another contribution, in addition to context-dependent effects, we show how to extend Guarded Interaction Trees with preemptive concurrency. To support implementation and verification of concurrent data structures and algorithms in the presence of preemptive concurrency one requires atomic state modification operations, e.g., compare-and-exchange.2025-12-12T14:04:34ZSergei StepanenkoEmma NardinoVirgil MarionneauDan FruminAmin TimanyLars Birkedalhttp://arxiv.org/abs/2509.05504v2Comparing Methods for the Cross-Level Verification of SystemC Peripherals with Symbolic Execution2025-12-12T11:20:58ZVirtual Prototypes (VPs) are important tools in modern hardware development. At high abstractions, they are often implemented in SystemC and offer early analysis of increasingly complex designs. These complex designs often combine one or more processors, interconnects, and peripherals to perform tasks in hardware or interact with the environment. Verifying these subsystems is a well-suited task for VPs, as they allow reasoning across different abstraction levels. While modern verification techniques like symbolic execution can be seamlessly integrated into VP-based workflows, they require modifications in the SystemC kernel. Hence, existing approaches modify and replace the SystemC kernel, or ignore the opportunity of cross-level scenarios completely, and would not allow focusing on special challenges of particular subsystems like peripherals. We propose CrosSym and SEFOS, two opposing approaches for a versatile symbolic execution of peripherals. CrosSym modifies the SystemC kernel, while SEFOS instead modifies a modern symbolic execution engine. Our extensive evaluation applies our tools to various peripherals on different levels of abstractions. Both tools' extensive sets of features are demonstrated for (1) different verification scenarios, and (2) identifying 300+ mutants. In comparison with each other, SEFOS convinces with the unmodified SystemC kernel and peripheral, while CrosSym offers slightly better runtime and memory usage. In comparison to the state-of-the-art, that is limited to Transaction Level Modelling (TLM), our tools offered comparable runtime, while enabling cross-level verification with symbolic execution.2025-09-05T21:28:26ZKarl Aaron RudkowskiSallar Ahmadi-PourRolf Drechslerhttp://arxiv.org/abs/2512.15766v1LOOPRAG: Enhancing Loop Transformation Optimization with Retrieval-Augmented Large Language Models2025-12-12T11:09:48ZLoop transformations are semantics-preserving optimization techniques, widely used to maximize objectives such as parallelism. Despite decades of research, applying the optimal composition of loop transformations remains challenging due to inherent complexities, including cost modeling for optimization objectives. Recent studies have explored the potential of Large Language Models (LLMs) for code optimization. However, our key observation is that LLMs often struggle with effective loop transformation optimization, frequently leading to errors or suboptimal optimization, thereby missing opportunities for performance improvements. To bridge this gap, we propose LOOPRAG, a novel retrieval-augmented generation framework designed to guide LLMs in performing effective loop optimization on Static Control Part. We introduce a parameter-driven method to harness loop properties, which trigger various loop transformations, and generate diverse yet legal example codes serving as a demonstration source. To effectively obtain the most informative demonstrations, we propose a loop-aware algorithm based on loop features, which balances similarity and diversity for code retrieval. To enhance correct and efficient code generation, we introduce a feedback-based iterative mechanism that incorporates compilation, testing and performance results as feedback to guide LLMs. Each optimized code undergoes mutation, coverage and differential testing for equivalence checking. We evaluate LOOPRAG on PolyBench, TSVC and LORE benchmark suites, and compare it against compilers (GCC-Graphite, Clang-Polly, Perspective and ICX) and representative LLMs (DeepSeek and GPT-4). The results demonstrate average speedups over base compilers of up to 11.20$\times$, 14.34$\times$, and 9.29$\times$ for PolyBench, TSVC, and LORE, respectively, and speedups over base LLMs of up to 11.97$\times$, 5.61$\times$, and 11.59$\times$.2025-12-12T11:09:48ZAccepted to ASPLOS 2026Yijie ZhiYayu CaoJianhua DaiXiaoyang HanJingwen PuQingran WuSheng ChengMing Caihttp://arxiv.org/abs/2512.11200v1Theoretical Foundations of GPU-Native Compilation for Rapid Code Iteration2025-12-12T01:14:36ZCurrent AI code generation systems suffer from significant latency bottlenecks due to CPU-GPU data transfers during compilation, execution, and testing phases. We establish theoretical foundations for three complementary approaches to GPU-native compilation that eliminate these transfers: (1) parallel traditional compilation adapted for GPU execution, (2) neural compilation using learned sequence-to-sequence translation with probabilistic verification, and (3) hybrid architectures combining both strategies. We derive latency and energy bounds demonstrating potential speedups of 10-100x for code iteration cycles. Our analysis shows that traditional GPU compilation provides 2-5x improvements through transfer elimination, neural compilation achieves 10-100x speedups via massive parallelism, and hybrid approaches offer practical deployment paths with guaranteed correctness. We formalize the probabilistic verification framework that enables trading compilation accuracy for parallel exploration, and discuss implications for self-improving AI systems and future analog computing substrates.2025-12-12T01:14:36Z9 pages , 2 tablesAdilet MetinovGulida M. KudakeevaGulnara D. Kabaevahttp://arxiv.org/abs/2512.10799v1Zorya: Automated Concolic Execution of Single-Threaded Go Binaries2025-12-11T16:43:51ZGo's adoption in critical infrastructure intensifies the need for systematic vulnerability detection, yet existing symbolic execution tools struggle with Go binaries due to runtime complexity and scalability challenges. In this work, we build upon Zorya, a concolic execution framework that translates Go binaries to Ghidra's P-Code intermediate representation to address these challenges. We added the detection of bugs in concretely not taken paths and a multi-layer filtering mechanism to concentrate symbolic reasoning on panic-relevant paths. Evaluation on five Go vulnerabilities demonstrates that panic-reachability gating achieves 1.8-3.9x speedups when filtering 33-70% of branches, and that Zorya detects all panics while existing tools detect at most two. Function-mode analysis proved essential for complex programs, running roughly two orders of magnitude faster than starting from main. This work establishes that specialized concolic execution can achieve practical vulnerability detection in language ecosystems with runtime safety checks.2025-12-11T16:43:51ZKarolina GornaNicolas IoossYannick SeurinRida Khatoun10.1145/3748522.3779940http://arxiv.org/abs/2512.10779v1Lax Modal Lambda Calculi2025-12-11T16:15:39ZIntuitionistic modal logics (IMLs) extend intuitionistic propositional logic with modalities such as the box and diamond connectives. Advances in the study of IMLs have inspired several applications in programming languages via the development of corresponding type theories with modalities. Until recently, IMLs with diamonds have been misunderstood as somewhat peculiar and unstable, causing the development of type theories with diamonds to lag behind type theories with boxes. In this article, we develop a family of typed-lambda calculi corresponding to sublogics of a peculiar IML with diamonds known as Lax logic. These calculi provide a modal logical foundation for various strong functors in typed-functional programming. We present possible-world and categorical semantics for these calculi and constructively prove normalization, equational completeness and proof-theoretic inadmissibility results. Our main results have been formalized using the proof assistant Agda.2025-12-11T16:15:39ZTo appear at CSL 2026Nachiappan Valliappanhttp://arxiv.org/abs/2505.14690v3SGL: A Structured Graphics Language2025-12-10T16:44:45ZThis paper introduces SGL, a graphics language that is aesthetically similar to SQL. As a graphical counterpart to SQL, SGL enables specification of statistical graphics within SQL query interfaces. SGL is based on a grammar of graphics that has been customized to support a SQL aesthetic.
This paper presents the fundamental components of the SGL language alongside examples, and describes SGL's underlying grammar of graphics via comparison to its closest predecessor, the layered grammar of graphics.2025-04-19T02:15:06ZJon Chapmanhttp://arxiv.org/abs/2503.05507v2Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?2025-12-10T10:47:02ZGrammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.2025-03-07T15:23:13ZQingyuan LiangZhao ZhangZeyu SunZheng LinQi LuoYueyi XiaoYizhou ChenYuqun ZhangHaotian ZhangLu ZhangBin ChenYingfei Xionghttp://arxiv.org/abs/2510.14719v2Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References2025-12-10T03:26:05ZModern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines--a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.2025-10-16T14:20:00ZAccepted to CGO'26Hongzheng ChenBin FanAlexander CollinsBastian HagedornEvghenii GaburovMasahiro MasudaMatthew BrookhartChris SullivanJason KnightZhiru ZhangVinod Groverhttp://arxiv.org/abs/2510.16809v2When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation2025-12-09T12:20:58ZLarge Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.2025-10-19T12:29:13ZAccepted to ICSE 2026 (RECODE workshop)Amirkia Rafiei OskooeiKaan Baturalp CosdanHusamettin IsiktasMehmet S. Aktashttp://arxiv.org/abs/2304.03393v3Polymorphic Coverage Types2025-12-08T18:19:24ZTest input generators are an important part of property-based testing (PBT) frameworks, and a key expectation is that they be capable of producing all acceptable elements that satisfy both the function's input type and the generator-provided constraints. However, it is not readily apparent how to validate whether a particular generator's output satisfies this coverage requirement. In practice, developers typically rely on manual inspection and post-mortem analysis of test runs to determine whether a generator provides sufficient coverage; these approaches are error-prone and difficult to scale as generators grow more complex. To address this problem, we present a new refinement-type-based verification procedure for validating the coverage provided by input test generators, based on a novel interpretation of types that embeds "must-style" underapproximate reasoning principles as a fundamental part of the type system. In our formulation, the types associated with expressions capture the set of values guaranteed to be produced by the expression, rather than the usual interpretation in which types represent the set of values an expression may produce. We formalize the notion of coverage types in a rich core language supporting higher-order functions and inductive datatypes. To better support real-world test generators, we extend this type system with type and qualifier polymorphism, enabling static verification of coverage guarantees for test input generators constructed using the monadic combinators found in most PBT frameworks. Finally, we have implemented a coverage type checker for OCaml programs based on this core calculus and present a detailed evaluation of its utility using a corpus of benchmarks drawn from both the PBT literature and open-source projects.2023-04-06T21:46:54ZThis submission is an extended version of our PLDI 2023 conference paper: Covering All the Bases: Type-Based Verification of Test Input GeneratorsZhe ZhouAshish MishraBenjamin DelawareSuresh Jagannathanhttp://arxiv.org/abs/2512.07511v1Canonical bidirectional typechecking2025-12-08T12:47:25ZWe demonstrate that the checkable/synthesisable split in bidirectional typechecking coincides with existing dualities in polarised System L, also known as polarised $μ\tildeμ$-calculus. Specifically, positive terms and negative coterms are checkable, and negative terms and positive coterms are synthesisable. This combines a standard formulation of bidirectional typechecking with Zeilberger's `cocontextual' variant. We extend this to ordinary `cartesian' System L using Mc Bride's co-de Bruijn formulation of scopes, and show that both can be combined in a linear-nonlinear style, where linear types are positive and cartesian types are negative. This yields a remarkable 3-way coincidence between the shifts of polarised System L, LNL calculi, and bidirectional calculi.2025-12-08T12:47:25ZZanzi MihejevsJules Hedges