https://arxiv.org/api/VvnuQfiFBO9/5h2VSsW63DiA2Dk2026-03-22T12:02:43Z253024515http://arxiv.org/abs/2511.12294v2ProofWright: Towards Agentic Formal Verification of CUDA2026-03-18T07:09:05ZLarge Language Models (LLMs) are increasingly used to automatically generate optimized CUDA kernels, substantially improving developer productivity. However, despite rapid generation, these kernels often contain subtle correctness bugs and lack formal safety guarantees. Runtime testing is inherently unreliable - limited input coverage and reward hacking can mask incorrect behavior - while manual formal verification is reliable but cannot scale to match LLM output rates, creating a critical validation bottleneck.
We present ProofWright, an agentic verification framework that bridges this gap by integrating automated formal verification with LLM-based code generation. ProofWright provides end-to-end guarantees of memory safety, thread safety, and semantic correctness for LLM-generated CUDA kernels. On KernelBench L1, ProofWright verifies safety properties for 74% of generated kernels, uncovers subtle correctness errors missed by conventional testing, and establishes semantic equivalence for a class of element-wise kernels. With a modest overhead of 3 minutes per kernel, ProofWright demonstrates that scalable, automated formal verification of LLM-generated GPU code is feasible - offering a path toward trustworthy high-performance code generation without sacrificing developer productivity.2025-11-15T17:06:50ZBodhisatwa ChatterjeeDrew ZagieboyloSana DamaniSiva HariChristos Kozyrakishttp://arxiv.org/abs/2603.17399v1Bootstrapping Coding Agents: The Specification Is the Program2026-03-18T06:21:06ZA coding agent can bootstrap itself. Starting from a 926-word specification and a first implementation produced by an existing agent (Claude Code), a newly generated agent re-implements the same specification correctly from scratch. This reproduces, in the domain of AI coding agents, the classical bootstrap sequence known from compiler construction, and instantiates the meta-circular property known from Lisp. The result carries a practical implication: the specification, not the implementation, is the stable artifact of record. Improving an agent means improving its specification; the implementation is, in principle, regenerable at any time.2026-03-18T06:21:06ZTo appear in IEEE SoftwareMartin Monperrushttp://arxiv.org/abs/2603.18071v1Circumventing Platform Defenses at Scale: Automated Content Replication from YouTube to Blockchain-Based Decentralized Storage2026-03-18T04:48:36ZWe present YouTube-Synch [1], a production system for automated, large-scale content extraction and replication from YouTube to decentralized storage on Joystream. The system continuously mirrors videos from more than 10,000 creator-authorized channels while handling platform constraints such as API quotas, rate limiting, bot detection, and OAuth token churn. We report a 3.5-year longitudinal case study covering 15 releases and 144 pull requests, from early API dependence to API-free operation. A key finding is that YouTube's defense layers are operationally coupled: bypassing one control often triggers another, creating cascading failures. We analyze three incidents with measured impact: 28 duplicate on-chain objects caused by database throughput issues, loss of over 10,000 channels after OAuth mass expiration, and 719 daily errors from queue pollution. For each, we describe the architectural response. Contributions include a three-generation proxy stack with behavior variance injection, a trust-minimized ownership verification protocol that replaces OAuth for channel control, write-ahead logging with cross-system state reconciliation, and containerized deployment. Results show that sustained architectural adaptation can maintain reliable cross-platform replication at production scale.2026-03-18T04:48:36ZZeeshan Akramhttp://arxiv.org/abs/2603.17330v1MLmisFinder: A Specification and Detection Approach of Machine Learning Service Misuses2026-03-18T03:48:20ZMachine Learning (ML) cloud services, offered by leading providers such as Amazon, Google, and Microsoft, enable the integration of ML components into software systems without building models from scratch. However, the rapid adoption of ML services, coupled with the growing complexity of business requirements, has led to widespread misuses, compromising the quality, maintainability, and evolution of ML service-based systems. Though prior research has studied patterns and antipatterns in service-based and ML-based systems separately, automatic detection of ML service misuses remains a challenge. In this paper, we propose MLmisFinder, an automatic approach to detect ML service misuses in software systems, aiming to identify instances of improper use of ML services to help developers properly integrate ML components in ML service-based systems. We propose a metamodel that captures the data needed to detect misuses in ML service-based systems and apply a set of rule-based detection algorithms for seven misuse types. We evaluated MLmisFinder on 107 software systems collected from open-source GitHub repositories and compared it with a state-of-the-art baseline. Our results show that MLmisFinder effectively detects ML service misuses, achieving an average precision of 96.7\% and recall of 97\%, outperforming the state-of-the-art baseline. MLmisFinder also scaled efficiently to detect misuses across 817 ML service-based systems and revealed that such misuses are widespread, especially in areas such as data drift monitoring and schema validation.2026-03-18T03:48:20ZHadil Ben AmorNiruthiha SelvanayagamManel AbdellatifTaher A. GhalebNaouel Mohahttp://arxiv.org/abs/2603.17266v1Revisiting Vulnerability Patch Identification on Data in the Wild2026-03-18T01:45:39ZAttacks can exploit zero-day or one-day vulnerabilities that are not publicly disclosed. To detect these vulnerabilities, security researchers monitor development activities in open-source repositories to identify unreported security patches. The sheer volume of commits makes this task infeasible to accomplish manually. Consequently, security patch detectors commonly trained and evaluated on security patches linked from vulnerability reports in the National Vulnerability Database (NVD). In this study, we assess the effectiveness of these detectors when applied in-the-wild. Our results show that models trained on NVD-derived data show substantially decreased performance, with decreases in F1-score of up to 90\% when tested on in-the-wild security patches, rendering them impractical for real-world use. An analysis comparing security patches identified in-the-wild and commits linked from NVD reveals that they can be easily distinguished from each other. Security patches associated with NVD have different distribution of commit messages, vulnerability types, and composition of changes. These differences suggest that NVD may be unsuitable as the \textit{sole} source of data for training models to detect security patches. We find that constructing a dataset that combines security patches from NVD data with a small subset of manually identified security patches can improve model robustness.2026-03-18T01:45:39ZIvana Clairine IrsanRatnadira WidyasariTing ZhangHuihui HuangFerdian ThungYikun LiLwin Khin SharEng Lieh OuhHong Jin KangDavid Lohttp://arxiv.org/abs/2603.18059v1Guardrails as Infrastructure: Policy-First Control for Tool-Orchestrated Workflows2026-03-18T01:19:33ZTool-using automation systems, from scripts and CI bots to agentic assistants, fail in recurring patterns. Common failures include unsafe side effects, invalid arguments, uncontrolled retries, and leakage of sensitive outputs. Many mitigations are model-centric and prompt-dependent, so they are brittle and do not generalize to non-LLM callers. We present Policy-First Tooling, a model-agnostic permission layer that mediates tool invocation through explicit constraints, risk-aware gating, recovery controls, and auditable explanations. The paper contributes a compact policy DSL, a runtime enforcement architecture with actionable rationale and fix hints, and a reproducible benchmark based on trace replay with controlled fault and misuse injection. In 225 controlled runs across five policy packs and three fault profiles, stricter packs improve violation prevention from 0.000 in P0 to 0.681 in P4, while task success drops from 0.356 to 0.067. Retry amplification decreases from 3.774 in P0 to 1.378 in P4, and leakage recall reaches 0.875 under injected secret outputs. These results make safety to utility trade-offs explicit and measurable.2026-03-18T01:19:33ZAkshey SigdelRista Baralhttp://arxiv.org/abs/2510.04468v2Improving IR-based Bug Localization with Semantics-Driven Query Reduction2026-03-17T23:49:13ZDespite decades of research, software bug localization remains challenging due to heterogeneous content and inherent ambiguities in bug reports. Existing methods, such as Information Retrieval (IR)-based approaches, often attempt to match source documents to bug reports, overlooking the context and semantics of the source code. On the other hand, Large Language Models (LLMs) (e.g., Transformer models) show promising results in understanding both texts and code. However, they have not yet been adapted well to localize software bugs using bug reports. They could also be data or resource-intensive. To bridge this gap, we propose, IQLoc, a novel approach that capitalizes on the strengths of both IR and LLMs for bug localization. In particular, we leverage the transformer-based model's understanding of code semantics to reason about its suspiciousness and to reformulate search queries and thus enhance bug localization using Information Retrieval. To evaluate IQLoc, we refine the Bench4BL benchmark dataset and extend it by incorporating ~30% more recent bug reports, resulting in a benchmark containing ~7.5K bug reports. We evaluated IQLoc using three performance metrics and compare it against eight baseline techniques. Experimental results demonstrate its superiority, achieving up to 100.40% and 78.08% in MAP, 61.49% and 64.58% in MRR, and 76.98% and 100.90% in HIT@K for the test bug reports with random and time-wise splits, respectively. Moreover, IQLoc improves MAP by 118.70% for bug reports with stack traces, 111.87% for those that include code elements, and 127.45% for those containing only descriptions in natural language. By integrating program semantic understanding into Information Retrieval, IQLoc mitigates several longstanding challenges of traditional IR-based approaches in bug localization.2025-10-06T03:43:38Z62 pages, 18 figures, 11 tablesAsif Mohammed SamirMohammad Masudur Rahmanhttp://arxiv.org/abs/2506.10204v2Code Roulette: How Prompt Variability Affects LLM Code Generation2026-03-17T23:20:31ZCode generation is one of the most active areas of application of Large Language Models (LLMs). While LLMs lower barriers to writing code and accelerate development process, the overall quality of generated programs depends on the quality of given prompts. Specifically, functionality and quality of generated code can be sensitive to user's background and familiarity with software development. It is therefore important to quantify LLM's sensitivity to variations in the input. To this end we propose an evaluation pipeline for LLM code generation with a focus on measuring sensitivity to prompt augmentations, completely agnostic to a specific programming tasks and LLMs, and thus widely applicable. We provide extensive experimental evidence illustrating utility of our method and share our code for the benefit of the community.2025-06-11T21:43:48ZExtended version of the paper accepted to LLM4Code @ ICSE 2026Andrei PaleyesRadzim SendykaDiana RobinsonChristian CabreraNeil D. Lawrencehttp://arxiv.org/abs/2603.17193v1Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization2026-03-17T22:46:42ZFormal specifications, such as pre- and post-conditions provide a solid basis for performing thorough program verification. However, developers rarely provide such formal specifications, hence if AI could help in constructing them, it would make formal verification possible or at least make automated testing much more effective. This paper presents a study on the ability of LLMs in generating formal FULL pre- and post-conditions of a program, given its natural language description. 24 state-of-the-art LLMs were evaluated on a freshly prepared dataset of 40 tasks. The paper investigates specifications of varying difficulty and discusses a set of more refined performance metrics in addition the general accept@1 performance. It also investigates the impact of using automatically generated tests for validation of the solutions proposed by LLMs. The results of the experiment reveal that, in general LLMs can produce valid pre- and post-conditions based on natural language descriptions of programs. Incorrect solutions from proprietary models are also often near correct. A closer inspection shows that open-source models tend to result in a higher proportion of erroneous results while proprietary models tend to have slightly higher false negative rates. Interestingly, the results also show that augmenting the manually prepared dataset with automatically generated tests leads to the exposure of wrong solutions, which would have otherwise been accepted as correct. In general, LLMs perform better in formalizing pre-conditions than on post-conditions and proprietary models perform better than open ones. However, none of the LLMs were able to correctly formalize all the tasks in our benchmark. Overall, the effectiveness and reliability of LLMs in formalizing pre- and post-conditions could be greatly improved by using a good test suite that checks the correctness of the LLM generated formalizations.2026-03-17T22:46:42Z18 pagesI. S. W. B. PrasetyaFitsum KifetewDavide Prandihttp://arxiv.org/abs/2603.17174v1Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning2026-03-17T22:08:45ZCode generation large language models (LLMs) are increasingly integrated into modern software development workflows. Recent work has shown that these models are vulnerable to backdoor and poisoning attacks that induce the generation of insecure code, yet effective defenses remain limited. Existing scanning approaches rely on token-level generation consistency to invert attack targets, which is ineffective for source code where identical semantics can appear in diverse syntactic forms. We present CodeScan, which, to the best of our knowledge, is the first poisoning-scanning framework tailored to code generation models. CodeScan identifies attack targets by analyzing structural similarities across multiple generations conditioned on different clean prompts. It combines iterative divergence analysis with abstract syntax tree (AST)-based normalization to abstract away surface-level variation and unify semantically equivalent code, isolating structures that recur consistently across generations. CodeScan then applies LLM-based vulnerability analysis to determine whether the extracted structures contain security vulnerabilities and flags the model as compromised when such a structure is found. We evaluate CodeScan against four representative attacks under both backdoor and poisoning settings across three real-world vulnerability classes. Experiments on 108 models spanning three architectures and multiple model sizes demonstrate 97%+ detection accuracy with substantially lower false positives than prior methods.2026-03-17T22:08:45ZPreprintShenao YanShimaa AhmedShan JinSunpreet S. AroraYiwei CaiYizhen WangYuan Honghttp://arxiv.org/abs/2603.17162v1Energy Flow Graph: Modeling Software Energy Consumption2026-03-17T21:47:10ZThe growing energy demands of computational systems necessitate a fundamental shift from performance-centric design to one that treats energy consumption as one of the primary design considerations. Current approaches treat energy consumption as an aggregate, deterministic property, overlooking the path-dependent nature of computation, where different execution paths through the same software consume dramatically different energy. We introduce the Energy Flow Graph (EFG), a formal model that represents computational processes as state-transition systems with energy costs for both states and transitions. EFG enables various applications in software engineering, including static analysis of energy-optimal execution paths and a multiplicative cascade model that predicts combined optimization effects without exhaustive testing. Our early experiments demonstrate EFG's versatility across domains: in software programs validated through 3.5 million executions, 15.6% of solutions exhibit high path-dependent variance (CV $>$ 0.1), while structural optimization reveals up to 705$\times$ energy reduction. In AI pipelines, the cascade model predicts optimization combinations within 5.1% error, enabling selection from 4.2 million possibilities using only 22 measurements. The EFG transforms energy optimization from trial-and-error to systematic analysis, providing a foundation for green software engineering across computational domains.2026-03-17T21:47:10ZSaurabhsingh RajputTushar Sharmahttp://arxiv.org/abs/2603.17150v1Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents2026-03-17T21:28:59ZAgentic AI systems can now generate code with remarkable fluency, but a fundamental question remains: \emph{does the generated code actually do what the user intended?} The gap between informal natural language requirements and precise program behavior -- the \emph{intent gap} -- has always plagued software engineering, but AI-generated code amplifies it to an unprecedented scale. This article argues that \textbf{intent formalization} -- the translation of informal user intent into a set of checkable formal specifications -- is the key challenge that will determine whether AI makes software more reliable or merely more abundant. Intent formalization offers a tradeoff spectrum suitable to the reliability needs of different contexts: from lightweight tests that disambiguate likely misinterpretations, through full functional specifications for formal verification, to domain-specific languages from which correct code is synthesized automatically. The central bottleneck is \emph{validating specifications}: since there is no oracle for specification correctness other than the user, we need semi-automated metrics that can assess specification quality with or without code, through lightweight user interaction and proxy artifacts such as tests. We survey early research that demonstrates the \emph{potential} of this approach: interactive test-driven formalization that improves program correctness, AI-generated postconditions that catch real-world bugs missed by prior methods, and end-to-end verified pipelines that produce provably correct code from informal specifications. We outline the open research challenges -- scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, designing human-AI specification interactions -- that define a research agenda spanning AI, programming languages, formal methods, and human-computer interaction.2026-03-17T21:28:59Z10 pagesShuvendu K. Lahirihttp://arxiv.org/abs/2603.17133v1A Longitudinal Study of Usability in Identity-Based Software Signing2026-03-17T20:53:46ZIdentity-based software signing tools aim to make software artifact provenance verifiable while reducing the operational burden of long-lived key management. However, there is limited cross-tool longitudinal evidence about which usability problems arise in practice and how those problems evolve as tools mature. This gap matters because unusable signing and verification workflows can lead to incomplete adoption, misconfiguration, or skipped verification, undermining intended integrity guarantees.
We conducted the first mining-software-repositories study of five open-source identity-based signing ecosystems: Sigstore, OpenPubKey, HashiCorp Vault, Keyfactor, and Notary v2. We analyzed approximately 3,900 GitHub issues from Nov. 2021 to Nov. 2025. We coded each issue for the reported usability concern and the implicated architectural component, and compared patterns across tools and over time. Across ecosystems, reported concerns concentrate in verification workflows, policy and configuration surfaces, and integration boundaries. Longitudinal Poisson trend analysis shows substantial declines in reported issues for most ecosystems. However, across usability themes, workflow- and documentation-related concerns decline unevenly across tools and concern types, and verification workflows and configuration surfaces remain persistent friction points. These results indicate that identity-based signing reduces some usability burdens while relocating complexity to verification semantics, policy configuration, and deployment integration. Designing future signing ecosystems therefore requires treating verification semantics and release workflows as first-class usability targets rather than peripheral integration concerns.2026-03-17T20:53:46ZKelechi G. KaluHieu TranSantiago Torres-AriasSooyeon JeongJames C. Davishttp://arxiv.org/abs/2505.13766v4Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques2026-03-17T20:50:55ZSoftware Quality Assurance (SQA) is critical for delivering reliable, secure, and efficient software products. The Software Quality Assurance Process aims to provide assurance that work products and processes comply with predefined provisions and plans. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance existing SQA processes by automating tasks like requirement analysis, code review, test generation, and compliance checks. Simultaneously, established standards such as ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001/ISO/IEC 90003, CMMI, and TMM provide structured frameworks for ensuring robust quality practices. This paper surveys the intersection of LLM-based SQA methods and these recognized standards, highlighting how AI-driven solutions can augment traditional approaches while maintaining compliance and process maturity. We first review the foundational software quality standards and the technical fundamentals of LLMs in software engineering. Next, we explore various LLM-based SQA applications, including requirement validation, defect detection, test generation, and documentation maintenance. We then map these applications to key software quality frameworks, illustrating how LLMs can address specific requirements and metrics within each standard. Empirical case studies and open-source initiatives demonstrate the practical viability of these methods. At the same time, discussions on challenges (e.g., data privacy, model bias, explainability) underscore the need for deliberate governance and auditing. Finally, we propose future directions encompassing adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards for AI-driven software quality.2025-05-19T22:49:30Z16 pages, 2 Table, 7 FiguresAvinash Patilhttp://arxiv.org/abs/2603.18049v1Conditional Execution of Transpiler Passes Based on Per-Script Feature Detection2026-03-17T19:54:22ZAs the ECMAScript specification evolves, industrial-scale JavaScript compilers face the challenge of supporting modern language syntax while maintaining compatibility for diverse execution environments. Traditionally, compilers solve this by running transpilation passes in a monolithic pipeline, where the transpilation passes are chosen to execute strictly based on a target language level. This results in significant computational waste, as compilers perform expensive Abstract Syntax Tree (AST) traversals to lower features that may not exist in the actual input source code. We present a compiler improvement that conditionally executes transpiler passes based on accurately tracking and dynamically maintaining the exact set of language features present in the compilation unit throughout the transpilation process. It is implemented in the production Google Closure Compiler. By populating and maintaining a FeatureSet at every JavaScript script-level, it dynamically skips running unnecessary lowering passes. We detail the architectural safeguards -- including strategic pass ordering and dynamic validation of the transpiled code for feature-correctness. Evaluation of this improvement on large-scale production monorepos produced a considerable reduction in compilation time and saved compute and memory usage.2026-03-17T19:54:22ZPreprint. Under review at SOAP 2026. Implementation available in Google Closure CompilerRishipal Singh Bhatia