https://arxiv.org/api/RmNBbYQDcYyiwj8AJKCNEteRSYU 2026-03-22T15:53:06Z 25302 90 15 http://arxiv.org/abs/2505.23135v3 VERINA: Benchmarking Verifiable Code Generation 2026-03-16T23:40:56Z

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o3, achieves a 72.6\% code correctness rate, 52.3\% for specification soundness and completeness, and a mere 4.9\% proof success rate (based on one trial per task). We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

2025-05-29T06:12:52Z Zhe Ye Zhengxu Yan Jingxuan He Timothe Kasriel Kaiyu Yang Dawn Song http://arxiv.org/abs/2603.16011v1 Evaluating Agentic Optimization on Large Codebases 2026-03-16T23:40:19Z

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

2026-03-16T23:40:19Z Preprint version Atharva Sehgal James Hou Akanksha Sarkar Ishaan Mantripragada Swarat Chaudhuri Jennifer J. Sun Yisong Yue http://arxiv.org/abs/2603.15935v1 Test Code Review in the Era of GitHub Actions: A Replication Study 2026-03-16T21:31:30Z

Test code is indispensable in software development, ensuring the correctness of production code and supporting maintainability. Nonetheless, errors or omissions in the test code can conceal production defects. While code review is widely adopted to assess code quality and correctness, little research has examined how test code is reviewed. Spadini et al.'s research on Gerrit (a pre-commit review model) found that test code receives significantly less discussion than production code. However, the most popular review model is currently based on pull requests (PRs), in which contributors propose changes for discussion and approval, a more negotiable and flexible model compared to Gerrit. Furthermore, GitHub Actions (GHA) has become widely used to automate pre-checks and testing, potentially impacting review practices. This leads us to explore whether Spadini et al.'s findings still hold for the PR model in the era of GHA? Our work replicates and extends their work. We focus on GitHub PRs and analyze six open-source projects. We investigate the impact of the PR model and GHA on test code review. Our results show that GitHub's PR model fosters more balanced discussions between test and production files than Gerrit, albeit with lower overall comment density. However, despite cross-project heterogeneity, GHA adoption triggered a sharp pivot toward production code. Post-GHA, for PRs involving tests, both review probability and comment density reached a median of zero. These findings reveal how evolving continuous integration pipelines can marginalize test code review. The observed decline in test-centric discussion under GHA warrants concern regarding long-term software quality. Our work also presents recommendations for stakeholders involved in the software development life cycle.

2026-03-16T21:31:30Z Hui Sun Yinan Wu Wesley K. G. Assunção Kathryn T. Stolee http://arxiv.org/abs/2603.15921v1 VIBEPASS: Can Vibe Coders Really Pass the Vibe Check? 2026-03-16T21:14:28Z

As Large Language Models shift the programming toward human-guided ''vibe coding'', agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults -- a capability central to autonomous software engineering yet never systematically evaluated. We present \name{}, the first empirical decomposition that jointly evaluates two coupled tasks: \emph{Fault-Triggering Test Generation (FT-Test)} constructing a discriminative witness that exposes a latent bug, and \emph{Fault-targeted Program Repair (FPR)}, repairing it under varying diagnostic conditions. \name{} pairs competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down. Evaluating 12 frontier LLMs, we find that fault-targeted reasoning does not scale with general coding ability. Models produce syntactically valid test inputs at near-ceiling rates yet collapse on discriminative generation, with fault hypothesis generation -- not output validation -- as the dominant bottleneck. Test-guided repair reveals a complementary insight: when self-generated tests successfully witness a fault, the resulting repair matches or outperforms repair guided by externally provided tests, but tests that fail to witness the fault actively degrade repair below unguided baselines. Together, these results reframe the challenge of autonomous debugging: the binding bottleneck is not code synthesis or test validity but fault-target reasoning, a capability that remains deficient across all frontier models. As Large Language Models shift the programming toward human-guided ''vibe coding'', agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults -- a capability central to autonomous software engineering yet never systematically evaluated.

2026-03-16T21:14:28Z Srijan Bansal Jiao Fangkai Yilun Zhou Austin Xu Shafiq Joty Semih Yavuz http://arxiv.org/abs/2603.15911v1 Human-AI Synergy in Agentic Code Review 2026-03-16T20:56:18Z

Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.

2026-03-16T20:56:18Z 14 pages, 6 figures, 14 tables Suzhen Zhong Shayan Noei Ying Zou Bram Adams http://arxiv.org/abs/2603.15883v1 Self-Admitted Technical Debt in Scientific Software: Prioritization, Sentiment, and Propagation Across Artifacts 2026-03-16T20:23:07Z

Self-admitted technical debt (SATD) impairs scientific software (SSW), yet its prioritization, sentiment, persistence, and propagation remains underexplored. Understanding how SSW developers express, and address SATD is crucial for improving SSW maintenance, and tooling. This study investigates how SATD types and artifacts in SSW are prioritized, how sentiment relates to urgency, SATD removal and resolution rates, and the extent to which SATD propagates across artifacts. We analyzed nine SSW repositories using a SATD classification model and a semantic embedding-based prioritization heuristic. SATD was examined across multiple artifacts, with sentiment assessed via a fine-tuned transformer. Propagation was traced, priority scores compared to static analysis, and removal and resolution rates quantified. SATD in comments, commits, and pull requests receive higher priority than SATD in issues, with negative sentiment amplifying urgency. Resolution and removal rates lag behind open-source software (OSS) averages. Most SATD remains confined to the originating artifact, but longer propagation chains are rare and correlate with higher priority, highlighting persistent and high impact debt. Prioritization is influenced by artifact type and sentiment, while low removal and resolution rates signal persistent debt. Cross-artifact propagation marks high priority, unresolved SATD, providing empirical guidance for targeted monitoring, review prioritization, and tool supported maintenance in SSW.

2026-03-16T20:23:07Z Eric L. Melin Nasir U. Eisty Gregory R. Watson Addi Malviya-Thakur http://arxiv.org/abs/2603.15833v1 A Comparative Analysis of Backbone Algorithms for Configurable Software Systems 2026-03-16T19:04:38Z

The backbone of a Boolean formula is the set of literals that must be true in every assignment that satisfies the formula. This concept is fundamental to key operations on variability models, including propagating user configuration decisions to identify implied feature selections, detecting dead features and dead code blocks, and preprocessing formulas to accelerate knowledge compilation into tractable representations such as binary decision diagrams. Despite its importance, previous empirical studies have evaluated backbone algorithms solely on SAT competition formulas (typically engineered to test the limits of SAT solvers), leading to inconsistent conclusions about their performance. This study provides the first comprehensive evaluation of formulas derived from real-world variability models, analyzing 21 configurations of 5 state-of-the-art algorithms on 2,371 formulas from configurable systems ranging from 100 variables and 179 clauses to 186,059 variables and 527,240 clauses. The results indicate that variability model formulas are structurally distinct, with higher clause density but greater clause simplicity. Our research provides clear algorithm selection guidelines: Algorithm 2/3 (iterative with solution filtering) is recommended for formulas with 1,000 or fewer variables, while Algorithm 5 (chunked core-based) with adaptive chunk size selection provides the best practical performance for larger formulas. Also, the results show that filtering heuristics have negligible or negative effects on performance for variability models. Finally, the study identifies a research gap: while Algorithm 5 with optimal chunk size can achieve runtime reductions exceeding 50\% compared to Algorithm 2/3 (the one that product line tools implement), the optimal chunk size varies unpredictably across formulas and cannot currently be estimated, opening directions for future research.

2026-03-16T19:04:38Z Luis Cambelo Ruben Heradio Jose-Miguel Horcas Dictino Chaos David Fernandez-Amoros http://arxiv.org/abs/2603.15727v1 ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems 2026-03-16T17:55:43Z

Autonomous LLM-based agents increasingly operate as long-running processes forming densely interconnected multi-agent ecosystems, whose security properties remain largely unexplored. In particular, OpenClaw, an open-source platform with over 40{,}000 active instances, has stood out recently with its persistent configurations, tool-execution privileges, and cross-platform messaging capabilities. In this work, we present ClawWorm, the first self-replicating worm attack against a production-scale agent framework, achieving a fully autonomous infection cycle initiated by a single message: the worm first hijacks the victim's core configuration to establish persistent presence across session restarts, then executes an arbitrary payload upon each reboot, and finally propagates itself to every newly encountered peer without further attacker intervention. We evaluate the attack on a controlled testbed across three distinct infection vectors and three payload types, demonstrating high success rates in end-to-end infection, sustained multi-hop propagation, and payload independence from the worm mechanism. We analyse the architectural root causes underlying these vulnerabilities and propose defence strategies targeting each identified trust boundary. Code and samples will be released upon completion of responsible disclosure.

2026-03-16T17:55:43Z Yihao Zhang Zeming Wei Xiaokun Luan Chengcan Wu Zhixin Zhang Jiangrong Wu Haolin Wu Huanran Chen Jun Sun Meng Sun http://arxiv.org/abs/2603.15566v1 Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents 2026-03-16T17:27:30Z

As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata. Lore requires no infrastructure beyond git, is queryable via a standalone CLI tool, and is discoverable by any agent capable of running shell commands. The paper formalizes the protocol, compares it against five competing approaches, stress-tests it against its strongest objections, and outlines an empirical validation path.

2026-03-16T17:27:30Z 8 pages, 1 figure, 1 table. Preprint available at https://doi.org/10.5281/zenodo.19051840 Ivan Stetsenko 10.5281/zenodo.19051840 http://arxiv.org/abs/2603.15559v1 Probabilistic Model Checking Taken by Storm 2026-03-16T17:22:21Z

This tutorial paper presents a hands-on perspective on probabilistic model checking with the Storm model checker. Storm is a decade-old model checker that excels in performance and a rich Python-based ecosystem, which makes it easy to integrate in various workflows. This tutorial focuses on Markov decision processes (MDP), which are popular in a variety of fields. It demonstrates the basic workflow, from Python-based modeling, model checking with a variety of properties, to the extraction of policies. Further, it showcases the support for recent topics that focus on different types of uncertainty, such as interval MDP and POMDP, and the ability to quickly implement simple algorithms on top of existing data structures.

2026-03-16T17:22:21Z Matthias Volk Linus Heck Sebastian Junges Joost-Pieter Katoen Tim Quatmann http://arxiv.org/abs/2507.14642v3 Efficient Story Point Estimation With Comparative Learning 2026-03-16T16:24:42Z

Story points are unitless, project-specific effort estimates that help developers plan their sprints. Traditionally, developers have collaboratively estimated story points using planning poker or other manual techniques. Machine learning can reduce this burden, but only with sufficient context from the historical decisions made by the project team. That is, state-of-the-art models, such as GPT2SP and FastText-SVM, only make accurate (within-project) predictions when they are trained on data from the same project. The goal of this study is to streamline story point estimation by evaluating a comparative learning-based framework for calibrating project-specific story point prediction models. Instead of assigning a specific story point value to every backlog item, developers are presented with pairs of items and asked to indicate which item requires more effort. Using these comparative judgments, a machine learning model was trained to predict the story point estimates. We empirically evaluated our technique using data from 23,313 manual estimates across 16 projects. The model trained on comparative judgments achieved, on average, a 0.34 Spearman's rank correlation coefficient between its predictions and the ground truth story points. This is similar to, if not better than, the performance of a state-of-the-art regression model trained on ground truth story points. Through human subject experiments, the advantages of comparative judgments were validated - higher confidence, lower annotation time, and comparable agreement were observed for comparative judgments compared to direct ratings. In summary, the proposed comparative learning approach is more efficient than regression-based approaches, given its better performance, lower required annotation time, and higher training data reliability.

2025-07-19T14:36:19Z Monoshiz Mahbub Khan Xiaoyin Xi Andrew Meneely Yiming Tang Zhe Yu http://arxiv.org/abs/2603.15427v1 Formalisms for Robotic Mission Specification and Execution: A Comparative Analysis 2026-03-16T15:34:48Z

Robots are increasingly deployed across diverse domains and designed for multi-purpose operation. As robotic systems grow in complexity and operate in dynamic environments, the need for structured, expressive, and scalable mission-specification approaches becomes critical, with mission specifications often defined in the field by domain experts rather than robotics specialists. However, there is no standard or widely accepted formalism for specifying missions in single- or multi-robot systems. A variety of formalisms, such as Behavior Trees, State Machines, Hierarchical Task Networks, and Business Process Model and Notation, have been adopted in robotics to varying degrees, each providing different levels of abstraction, expressiveness, and support for integration with human workflows and external devices. This paper presents a systematic analysis of these four formalisms with respect to their suitability for robot mission specification. Our study focuses on mission-level descriptions rather than robot software development. We analyze their underlying control structures and mission concepts, evaluate their expressiveness and limitations in modeling real-world missions, and assess the extent of available tool support. By comparing the formalisms and validating our findings with experts, we provide insights into their applicability, strengths, and shortcomings in robotic system modeling. The results aim to support practitioners and researchers in selecting appropriate modeling approaches for designing robust and adaptable robot and multi-robot missions.

2026-03-16T15:34:48Z Gianluca Filippone Sara Pettinari Patrizio Pelliccione http://arxiv.org/abs/2603.15401v1 SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering? 2026-03-16T15:16:31Z

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

2026-03-16T15:16:31Z Tingxu Han Yi Zhang Wei Song Chunrong Fang Zhenyu Chen Youcheng Sun Lijie Hu http://arxiv.org/abs/2603.15400v1 Multi-Objective Load Balancing for Heterogeneous Edge-Based Object Detection Systems 2026-03-16T15:15:56Z

The rapid proliferation of the Internet of Things (IoT) and smart applications has led to a surge in data generated by distributed sensing devices. Edge computing is a mainstream approach to managing this data by pushing computation closer to the data source, typically onto resource-constrained devices such as single-board computers (SBCs). In such environments, the unavoidable heterogeneity of hardware and software makes effective load balancing particularly challenging. In this paper, we propose a multi-objective load balancing method tailored to heterogeneous, edge-based object detection systems. We study a setting in which multiple device-model pairs expose distinct accuracy, latency, and energy profiles, while both request intensity and scene complexity fluctuate over time. To handle this dynamically varying environment, our approach uses a two-stage decision mechanism: it first performs accuracy-aware filtering to identify suitable device-model candidates that provide accuracy within the acceptable range, and then applies a weighted-sum scoring function over expected latency and energy consumption to select the final execution target. We evaluate the proposed load balancer through extensive experiments on real-world datasets, comparing against widely used baseline strategies. The results indicate that the proposed multi-objective load balancing method halves energy consumption and achieves an 80% reduction in end-to-end latency, while incurring only a modest, up to 10%, decrease in detection accuracy relative to an accuracy-centric baseline.

2026-03-16T15:15:56Z Daghash K. Alqahtani Maria A. Rodriguez Muhammad Aamir Cheema Adel N. Toosi http://arxiv.org/abs/2302.08018v3 Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking 2026-03-16T15:11:30Z

The increasing use of Machine Learning (ML) software can lead to unfair and unethical decisions, thus fairness bugs in software are becoming a growing concern. Addressing these fairness bugs often involves sacrificing ML performance, such as accuracy. To address this issue, we present a novel counterfactual approach that uses counterfactual thinking to tackle the root causes of bias in ML software. In addition, our approach combines models optimized for both performance and fairness, resulting in an optimal solution in both aspects. We conducted a thorough evaluation of our approach on 10 benchmark tasks using a combination of 5 performance metrics, 3 fairness metrics, and 15 measurement scenarios, all applied to 8 real-world datasets. The conducted extensive evaluations show that the proposed method significantly improves the fairness of ML software while maintaining competitive performance, outperforming state-of-the-art solutions in 84.6% of overall cases based on a recent benchmarking tool.

2023-02-16T01:27:26Z Zichong Wang Yang Zhou David Lo Wenbin Zhang