https://arxiv.org/api/Jnv1v4YBAAnj936MNRmcZ6Qv9/s2026-06-28T10:42:21Z9951154515http://arxiv.org/abs/2502.11844v3BaxBench: Can LLMs Generate Correct and Secure Backends?2025-05-30T13:01:16ZAutomatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.2025-02-17T14:37:47ZMark VeroNiels MündlerVictor ChibotaruVeselin RaychevMaximilian BaaderNikola JovanovićJingxuan HeMartin Vechevhttp://arxiv.org/abs/2505.24324v1SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation2025-05-30T08:06:30ZIn recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming languages, such as Swift, with high quality. By examining widely established multilingual benchmarks like HumanEval-XL and MultiPL-E, we identified critical issues specific to their Swift components, making them insufficient or even irrelevant for assessing LLM coding capabilities on Swift. Unlike these existing approaches, which prioritize rapid scaling and generalization by automatically translating Python-centric benchmarks with LLMs, we adopt a quality-over-quantity methodology. We present SwiftEval, the first Swift-oriented benchmark consisting of 28 carefully hand-crafted problems, and evaluate 44 popular Code LLMs on it. Our results show significant LLM scores drop for problems requiring language-specific features, most noticeable in the models of smaller sizes.2025-05-30T08:06:30ZAccepted to FORGE'25 Benchmarking on 15.01.2025, to be published by IEEE under the CC BY-NC-ND 4.0 license. This is the accepted version of the article (5 pages, 2 figures, 1 table). DOI will be added upon publicationIvan PetrukhaYana KurliakNataliia Stulovahttp://arxiv.org/abs/2501.08249v2Verifying Device Drivers with Pancake2025-05-30T00:48:15ZDevice driver bugs are the leading cause of OS compromises, and their formal verification is therefore highly desirable. To the best of our knowledge, no realistic and performant driver has been verified for a non-trivial device. We propose Pancake, an imperative language for systems programming that features a well-defined and verification-friendly semantics. Leveraging the verified compiler backend of the CakeML functional language, we develop a compiler for Pancake that guarantees that the binary retains the semantics of the source code. Usng automatic translation of Pancake to the Viper SMT front-end, we verify a performant driver for an Ethernet NIC.2025-01-14T16:40:05Z15 pages, 5 figuresJunming ZhaoMiki TanakaJohannes Åman PohjolaAlessandro LegnaniTiana Tsang UngH. TruongTsun Wang SauThomas SewellRob SisonHira SyedaMagnus MyreenMichael NorrishGernot Heiserhttp://arxiv.org/abs/2501.18160v3RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing2025-05-29T22:08:26ZCode auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability.
This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit at https://github.com/PurCL/RepoAudit.2025-01-30T05:56:30Z18 pages, 11 tables, 6 figures, 3 listingsJinyao GuoChengpeng WangXiangzhe XuZian SuXiangyu Zhanghttp://arxiv.org/abs/2505.23296v1Is spreadsheet syntax better than numeric indexing for cell selection?2025-05-29T09:49:14ZSelecting a subset of cells is a common task in data engineering, for example, to remove errors or select only specific parts of a table. Multiple approaches to express this selection exist. One option is numeric indexing, commonly found in general programming languages, where a tuple of numbers identifies the cell. Alternatively, the separate dimensions can be referred to using different enumeration schemes like "A1" for the first cell, commonly found in software such as spreadsheet systems.
In a large-scale controlled experiment with student participants as proxy for data practitioners, we compare the two options with respect to speed and correctness of reading and writing code.
The results show that, when reading code, participants make less mistakes using spreadsheet-style syntax. Additionally, when writing code, they make fewer mistakes and are faster when using spreadsheet syntax compared to numeric syntax.
From this, a domain-specific syntax, such as spreadsheet syntax for data engineering, appears to be a promising alternative to explore in future tools to support practitioners without a software engineering background.2025-05-29T09:49:14ZPhilip HeltwegDirk RiehleGeorg-Daniel Schwarzhttp://arxiv.org/abs/2502.21026v3Artemis: Toward Accurate Detection of Server-Side Request Forgeries through LLM-Assisted Inter-Procedural Path-Sensitive Taint Analysis2025-05-29T07:34:13ZServer-side request forgery (SSRF) vulnerabilities are inevitable in PHP web applications. Existing static tools in detecting vulnerabilities in PHP web applications neither contain SSRF-related features to enhance detection accuracy nor consider PHP's dynamic type features. In this paper, we present Artemis, a static taint analysis tool for detecting SSRF vulnerabilities in PHP web applications. First, Artemis extracts both PHP built-in and third-party functions as candidate source and sink functions. Second, Artemis constructs both explicit and implicit call graphs to infer functions' relationships. Third, Artemis performs taint analysis based on a set of rules that prevent over-tainting and pauses when SSRF exploitation is impossible. Fourth, Artemis analyzes the compatibility of path conditions to prune false positives. We have implemented a prototype of Artemis and evaluated it on 250 PHP web applications. Artemis reports 207 true vulnerable paths (106 true SSRFs) with 15 false positives. Of the 106 detected SSRFs, 35 are newly found and reported to developers, with 24 confirmed and assigned CVE IDs.2025-02-28T13:14:58ZFull version of paper accepted by OOPSLA '25Yuchen JiTing DaiZhichao ZhouYutian TangJingzhu He10.1145/3720488http://arxiv.org/abs/2503.04779v4Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference2025-05-29T06:07:32ZLarge Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.2025-02-22T13:27:31ZAccepted to ACL 2025 (Main Conference)Thanh Le-CongBach LeToby Murrayhttp://arxiv.org/abs/2505.23061v1DINGO: Constrained Inference for Diffusion LLMs2025-05-29T04:04:54ZDiffusion LLMs have emerged as a promising alternative to conventional autoregressive LLMs, offering significant potential for improved runtime efficiency. However, existing diffusion models lack the ability to provably enforce user-specified formal constraints, such as regular expressions, which makes them unreliable for tasks that require structured outputs, such as fixed-schema JSON generation. Unlike autoregressive models that generate tokens sequentially, diffusion LLMs predict a block of tokens in parallel. This parallelism makes traditional constrained decoding algorithms, which are designed for sequential token prediction, ineffective at preserving the true output distribution. To address this limitation, we propose DINGO, a dynamic programming-based constrained decoding strategy that is both efficient and provably distribution-preserving. DINGO enables sampling of output strings with the highest probability under the model's predicted distribution, while strictly satisfying any user-specified regular expression. On standard symbolic math and JSON generation benchmarks, DINGO achieves up to a 68 percentage point improvement over unconstrained inference2025-05-29T04:04:54ZDINGO an algorithm to provably apply constraints to diffusion LLM generationsTarun SureshDebangshu BanerjeeShubham UgareSasa MisailovicGagandeep Singhhttp://arxiv.org/abs/2505.22610v1TPDE: A Fast Adaptable Compiler Back-End Framework2025-05-28T17:25:53ZFast machine code generation is especially important for fast start-up just-in-time compilation, where the compilation time is part of the end-to-end latency. However, widely used compiler frameworks like LLVM do not prioritize fast compilation and require an extra IR translation step increasing latency even further; and rolling a custom code generator is a substantial engineering effort, especially when targeting multiple architectures.
Therefore, in this paper, we present TPDE, a compiler back-end framework that adapts to existing code representations in SSA form. Using an IR-specific adapter providing canonical access to IR data structures and a specification of the IR semantics, the framework performs one analysis pass and then performs the compilation in just a single pass, combining instruction selection, register allocation, and instruction encoding. The generated target instructions are primarily derived code written in high-level language through LLVM's Machine IR, easing portability to different architectures while enabling optimizations during code generation.
To show the generality of our framework, we build a new back-end for LLVM from scratch targeting x86-64 and AArch64. Performance results on SPECint 2017 show that we can compile LLVM-IR 8--24x faster than LLVM -O0 while being on-par in terms of run-time performance. We also demonstrate the benefits of adapting to domain-specific IRs in JIT contexts, particularly WebAssembly and database query compilation, where avoiding the extra IR translation further reduces compilation latency.2025-05-28T17:25:53Z23 pages, 10 figuresTobias SchwarzTechnical University of MunichTobias KammTechnical University of MunichAlexis EngelkeTechnical University of Munichhttp://arxiv.org/abs/2505.21225v2Custom Representations of Inductive Families2025-05-28T13:25:26ZInductive families provide a convenient way of programming with dependent types. Yet, when it comes to compilation, their default linked-tree runtime representations, as well as the need to convert between different indexed views of the same data, can lead to unsatisfactory runtime performance. In this paper, we introduce a language with dependent types, and inductive families with customisable representations. Representations are a version of Wadler's views, refined to inductive families like in Epigram, but with compilation guarantees: a represented inductive family will not leave any runtime traces behind, without relying on heuristics such as deforestation. This way, we can build a library of convenient inductive families based on a minimal set of primitives, whose re-indexing and conversion functions are erased during compilation. We show how we can express optimisation techniques such as representing Nat-like types as GMP-style big integers, without special casing in the compiler. With dependent types, reasoning about data representations is also possible through a provided modality. This yields computationally irrelevant isomorphisms between the original and represented data.2025-05-27T14:10:31ZTo appear in the proceedings of TFP 2025Constantine TheocharisEdwin Bradyhttp://arxiv.org/abs/2505.22155v1An instance of FreeCHR with refined operational semantics2025-05-28T09:19:33ZConstraint Handling Rules (CHR) is a rule-based programming language which is typically embedded into a general-purpose language. There exists a plethora of implementations of CHR for numerous host languages. However, the existing implementations often reinvent the way to embed CHR, which impedes maintenance and weakens assertions of correctness. To formalize and thereby unify the embedding of CHR into arbitrary host languages, we introduced the framework FreeCHR and proved it to be a valid representation of classical CHR. Until now, this framework only includes a translation of the very abstract operational semantics of CHR which, due to its abstract nature, introduces several practical issues. In this paper, we introduce an execution algorithm for FreeCHR. We derive it from the refined operational semantics of CHR, which resolve the issues introduced by the very abstract semantics. We also prove soundness of the algorithm with respect to the very abstract semantics of FreeCHR. Hereby we provide a unified and an easy to implement guideline for new CHR implementations, as well as an algorithmic definition of the refined operational semantics.2025-05-28T09:19:33ZThis is a preprint of a paper submitted to the 27th International Symposium on Principles and Practice of Declarative Programming (PPDP '25). arXiv admin note: text overlap with arXiv:2504.04962Sascha RechenbergerThom Frühwirthhttp://arxiv.org/abs/2505.21661v1KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads2025-05-27T18:38:15ZIn this work, we propose KPerfIR, a novel multilevel compiler-centric infrastructure to enable the development of customizable, extendable, and portable profiling tools tailored for modern artificial intelligence (AI) workloads on modern GPUs. Our approach integrates profiling capabilities directly into the compiler workflow, allowing profiling functionalities to be implemented as compiler passes, offering a programmable and reusable framework for performance analysis. This design bridges the gap between compilers and profilers, enabling fine-grained insights into complex optimization challenges such as overlapping the execution of fine-grained function units on GPUs. KPerfIR is integrated into the Triton infrastructure to highlight the power of a compiler-centric approach to advance performance analysis and optimization in the ever-evolving landscape of AI compilers. Our evaluation shows that our tool incurs low overhead (8.2%), provides accurate measurements (2% relative error), and delivers actionable insights into complicated GPU intra-kernel optimizations.2025-05-27T18:38:15ZAccepted to OSDI 2025Yue GuanYuanwei FangKeren ZhouCorbin RobeckManman RenZhongkai YuYufei DingAdnan Azizhttp://arxiv.org/abs/2307.09776v3Full LTL Synthesis over Infinite-state Arenas2025-05-27T10:28:56ZRecently, interest has increased in applying reactive synthesis to richer-than-Boolean domains. A major (undecidable) challenge in this area is to establish when certain repeating behaviour terminates in a desired state when the number of steps is unbounded. Existing approaches struggle with this problem, or can handle at most deterministic games with Büchi goals. This work goes beyond by contributing the first effectual approach to synthesis with full LTL objectives, based on Boolean abstractions that encode both safety and liveness properties of the underlying infinite arena. We take a CEGAR approach: attempting synthesis on the Boolean abstraction, checking spuriousness of abstract counterstrategies through invariant checking, and refining the abstraction based on counterexamples. We reduce the complexity, when restricted to predicates, of abstracting and synthesising by an exponential through an efficient binary encoding. This also allows us to eagerly identify useful fairness properties. Our discrete synthesis tool outperforms the state-of-the-art on linear integer arithmetic (LIA) benchmarks from literature, solving almost double as many syntesis problems as the current state-of-the-art. It also solves slightly more problems than the second-best realisability checker, in one-third of the time. We also introduce benchmarks with richer objectives that other approaches cannot handle, and evaluate our tool on them.2023-07-19T06:33:51ZExtended version of identically-titled paper to be published in the proceedings of CAV 2025Shaun AzzopardiLuca Di StefanoNir PitermanGerardo Schneider10.1007/978-3-031-98685-7_13http://arxiv.org/abs/2505.20855v1Local Type Inference for Context-Free Session Types2025-05-27T08:05:23ZWe address the problem of local type inference for a language based on System F with context-free session types. We present an algorithm that leverages the bidirectional type checking approach to propagate type information, enabling first class polymorphism while addressing the intricacies brought about by the sequential composition operator and type equivalence. The algorithm improves the language's usability by eliminating the need for type annotations at type application sites.2025-05-27T08:05:23ZIn Proceedings PLACES 2025, arXiv:2505.19078EPTCS 420, 2025, pp. 1-11Bernardo AlmeidaLASIGE, Faculty of Sciences, University of Lisbon, PortugalAndreia MordidoLASIGE, Faculty of Sciences, University of Lisbon, PortugalVasco T. VasconcelosLASIGE, Faculty of Sciences, University of Lisbon, Portugal10.4204/EPTCS.420.1http://arxiv.org/abs/2505.20850v1An Efficient Implementation of Guard-Based Synchronization for an Object-Oriented Programming Language2025-05-27T08:02:25ZIn the shared variable model of concurrency, guarded atomic actions restrict the possible interference between processes by regions of atomic execution. The guard specifies the condition for entering an atomic region. That is a convenient model for the specification and verification of concurrent programs, but has eschewed efficient execution so far. This article shows how guarded atomic actions, when attached to objects, can be implemented highly efficiently using a combination of coroutines, operating-system worker threads, and dedicated management of object queues and stacks. The efficiency of an experimental language, Lime, is shown to compare favourably with that of C/Pthreads, Go, Erlang, Java, and Haskell on synthetic benchmarks.
2025-05-27T08:02:25ZIn Proceedings PLACES 2025, arXiv:2505.19078EPTCS 420, 2025, pp. 44-53Shucai YaoHuawai Technologies CanadaEmil SekerinskiMcMaster University10.4204/EPTCS.420.5