https://arxiv.org/api/fzABFwzc++h37tzhxfDxEffOUmM 2026-06-21T16:16:53Z 27359 135 15 http://arxiv.org/abs/2606.15485v1 The Perils of Agency: How Developers Perceive, Prioritize, and Address Risks in Agentic AI Products 2026-06-13T21:49:32Z

Agentic AI systems act autonomously, use tools, adapt to context, and operate in complex real-world environments. However, these same characteristics can create or exacerbate product risks. We studied how industry developers (n=35) perceive, prioritize, and address the risks in their agentic AI products. We found that developers' perceptions of risk were closely tied to the qualities that made the product agentic, such as autonomy, tool use, and usage in a real-world context. Developers prioritized product and business risks before considering downstream societal risks like job displacement and end-user privacy. This prioritization also impacted developers' ability and motivation to mitigate agentic risks. Finally, developers lacked mature controls for containing agentic risks, often relying on constraining the same characteristics that make agents useful: e.g., autonomy and goal complexity. These findings reveal a capability vs. risk control tension in agentic AI development: developers need to address risks that emerge from agentic capabilities, yet they currently have limited support for doing so without constraining agentic functionality.

2026-06-13T21:49:32Z Hao-Ping Lee Jessica He David Piorkowski Thomas Serban von Davier Jodi Forlizzi Sauvik Das http://arxiv.org/abs/2606.15480v1 A Scalability Analysis of Quantitative Confidence Assessment Methods for Assurance Cases 2026-06-13T21:32:52Z

This paper proposes a model to estimate the decision complexity and effort required to apply quantitative confidence assessment methods to assurance cases. The model considers both the worst and average case for these measures and characterizes how these quantities scale with argument size. Prior work has indicated that the additional effort required to apply these methods is a barrier to their adoption by assurance case practitioners. Researchers developing new methods, or improving existing methods, can use this model to estimate the effort required to apply their method. The proposed model is parameterized using data from published case studies and is applied to three existing quantitative confidence assessment methods: the Bayesian Belief Network method, the Dempster-Shafer Theory method, and the Certus method. The results show that, while Certus has the highest worst-case decision complexity, its average-case effort is lower than the BBN and DST methods.

2026-06-13T21:32:52Z Preprint. Version of Record to appear in SafeCOMP'26 Workshop Proceedings published by Springer Simon Diemert Jens H. Weber http://arxiv.org/abs/2606.19380v1 AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures 2026-06-13T18:29:53Z

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

2026-06-13T18:29:53Z Kenneth Ge Andre Assis http://arxiv.org/abs/2606.06563v2 AI-Driven Test Case Generation from Natural Language Requirements: A Survey of Techniques and Research Gaps 2026-06-13T16:07:09Z

Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensive activities in development. Requirements-based test generation allows test cases to be derived early from requirements artifacts, but generating them directly from natural language is challenging due to inherent ambiguity and imprecision. Recent advances in AI, natural language processing (NLP), and large language models (LLMs) have made automating this pipeline increasingly feasible, while introducing new risks including hallucination, reduced traceability, and inconsistent evaluation. This survey addresses four research questions: what AI and NLP techniques have been proposed for generating test cases from natural language requirements; what tools and frameworks support these approaches; how generated test cases are evaluated; and what research gaps remain. Following Kitchenham and Charters' systematic review guidelines, we searched major scholarly databases spanning 2000-2025 and, after applying strict inclusion criteria, identified 21 primary studies. The literature is organized into three evolutionary eras, revealing that no existing approach simultaneously satisfies six key quality dimensions: automation, ambiguity handling, domain applicability, traceability, evaluation thoroughness, and hallucination control. The survey makes three main contributions: a three-era evolutionary synthesis of AI-based test generation; a six-criteria gap analysis showing no current approach fully addresses all quality dimensions; and four actionable research guidelines targeting hallucination, traceability, complexity sensitivity, and compliance.

2026-06-04T15:07:23Z 22 pages, 7 figures, 4 tables Orimoloye Folorunsho Hassan Reza http://arxiv.org/abs/2606.17094v1 LogCopilot: Automating Log Aggregation Analysis through Large Language Models 2026-06-13T15:47:53Z

Logs record the runtime behavior of software and are widely used in various tasks such as debugging, testing, and fault diagnosis. With the increase in system size and complexity, log analysis has gradually become a challenging task. Current industrial systems typically use log aggregation systems such as Grafana Loki and ELK to simplify the log collection and analysis process. Engineers write queries using the DSL query language provided by these systems can complete a variety of log analysis tasks. However, writing these queries is often time-consuming and labor-intensive, as it requires engineers to have a thorough understanding of the DSL syntax and the detailed information contained in the logs. To address these challenges, this paper proposes LogCopilot, an automated log aggregation analysis framework based on large language models (LLMs). LogCopilot accepts natural language log analysis instructions and accomplishes automated log analysis through knowledge retrieval and tool calling. LogCopilot constructs a hierarchical knowledge base to represent and provide key knowledge in logs. And it achieves automated log aggregation analysis by generating and executing LogQL queries. The evaluation based on four log datasets confirm the effectiveness of LogCopilot, which achieves an average accuracy of 76.8% and outperforms baseline approaches. Moreover, experiment results shows that LogCopilot is effective in LogQL query generation.

2026-06-13T15:47:53Z Senyu Xie Chenxi Zhang Tong Zhou Jiacheng Liu Xiaoyu Hong Qingshan Li Xin Peng http://arxiv.org/abs/2605.15245v2 Assistance to Autonomy: A Systematic Literature Review of Agentic AI across the Software Development Life Cycle 2026-06-13T14:49:13Z

Agentic AI in software product development is increasingly adopted by organizations, yet the field lacks a consolidated synthesis of where adoption is mature, which architectural patterns dominate, and what limitations and coping mechanisms exist in industrial deployments. This systematic literature review addresses these gaps by establishing a body of knowledge as a starting point. Following Kitchenham guidelines, we queried four major research databases, obtaining over 1600 candidate publications. To handle this volume, we developed and validated a domain-agnostic multi-agent screening pipeline that extends prior LLM-assisted review tools by combining automatic metadata curation, inter-agent iterative dialogue, and conflict-resolution defaults that minimize false negatives. From the 92 manually verified primary studies, our thematic synthesis reveals that output verifiability is the primary enabler of agentic adoption: later SDLC phases, whose outputs are objectively evaluable through executable feedback, demonstrate the highest maturity and industrial presence, while earlier phases remain almost exclusively academic proofs-of-concept. We identify the Planner-Executor-Reviewer role specialization as the dominant architectural pattern, with the Reviewer agent implementing verifiability through executable feedback loops. Across all challenge categories, industrial mitigation strategies converge on confining agent actions to verifiable, bounded spaces. This study contributes a comprehensive characterization of the current literature on agentic systems in software product development, and a methodological contribution in the form of an AI-assisted tool to automate the screening phase in high-volume SLR domains.

2026-05-14T10:46:51Z 17 pages Spyridon Alvanakis Apostolou Jan Bosch Helena Holmström Olsson http://arxiv.org/abs/2606.15283v1 AI-driven Software Development: A Pragmatic Path to Agentic Development Processes 2026-06-13T12:43:38Z

Generative AI is transforming software development from localized tool support into development work that is embedded in processes, tools, and organizational structures. Its use now extends beyond code completion to requirements, architecture, implementation, testing, review, operations, and maintenance. Existing research shows a differentiated picture. Productivity gains are possible, but depend on task type, codebase characteristics, and developers' experience. At the same time, AI-generated artifacts require additional control and governance. Building on these observations, this paper develops a pragmatic organizing framework for the transition toward AI-driven Software Development. It describes a progression from informal and assistive AI use through integrated AI workflows toward controlled agentic development processes. The focus is not on individual tools or models, but on the technical, organizational, and quality-assurance mechanisms needed to embed AI across central software engineering activities. Particular importance is assigned to a harness that connects project context, tool access, verification, permissions, logging, and human approval. The paper draws on current research, practice-oriented sources, established software engineering practices, and project experience. A mid-sized software company is used as an exploratory case study to assess the plausibility of the framework and to illustrate how prerequisites, governance requirements, design practices, and transformation paths can be shaped in a concrete organizational context. The paper provides a conceptual basis for further scholarly discussion and empirical investigation of AI-driven Software Development.

2026-06-13T12:43:38Z Peter Mandl Paul Mandl http://arxiv.org/abs/2606.15122v1 The Hitchhiker's Guide to Program Analysis, Part III: Mostly Harmless LLMs 2026-06-13T05:26:08Z

LLMs are increasingly used in bug analysis to reason about code and judge whether a potential bug can be triggered in realistic execution contexts, with recent work showing promising empirical results. However, empirical effectiveness does not make a plausible model-generated rationale sufficient for discharging warnings. This distinction is especially important for no-bug decisions: dismissing a report or warning requires establishing that the reported error state is unreachable in the program context being analyzed, not merely offering a plausible explanation for why it may not occur. We argue that program-behavior reasoning should be grounded in formal analysis, rather than performed directly by LLMs. We present Evident, a bug analysis system that separates LLM assistance from program-behavior reasoning, delegating the latter to backend analysis. Given a warning specifying the reported location and data flow, Evident uses an LLM only to construct a warning-specific analysis harness. Evident then validates the harness before invoking the backend. The backend performs the harness-relative check: whether the reported error state is unreachable under the constructed harness and its assumptions. We evaluate Evident on 200 real Android kernel driver warnings from two existing static detectors. Evident correctly classifies 151 cases (76%), including discharging 111 false alarms, without discharging any confirmed bug in the dataset; the remaining cases are either unresolved or conservatively retained as potential bugs. Evident also rediscovers a confirmed vulnerability overlooked by both prior LLM-based filtering and manual triage.

2026-06-13T05:26:08Z Haonan Li Tianyang Zhou Manu Sridharan Hang Zhang Zhiyun Qian http://arxiv.org/abs/2603.27249v3 "An Endless Stream of AI Slop": How Developers Discuss the Burden of AI-Assisted Software Development 2026-06-13T05:01:26Z

"AI slop", that is, low-quality AI-generated content, is increasingly affecting software development, from generated code and pull requests to documentation and bug reports. However, there is limited empirical research on how developers perceive and respond to this phenomenon. We qualitatively analyzed how developers discuss AI slop in 1,154 Reddit and Hacker News posts, developing a codebook of 15 codes organized into three thematic clusters: Review Friction (how AI slop burdens reviewers, erodes trust, and prompts countermeasures), Quality Degradation (damage to codebases, knowledge resources, and developer competence), and Forces and Consequences (systemic incentives, mandated adoption, craft erosion, and workforce disruption). Our findings frame AI slop as a tragedy of the commons, where individual productivity gains externalize costs onto reviewers, maintainers, and the broader community. We report the concerns developers raise and the mitigation strategies they propose, with implications for tool developers, team leads, and educators.

2026-03-28T11:50:53Z 7 pages, 2 figures, 1 table Sebastian Baltes Marc Cheong Christoph Treude http://arxiv.org/abs/2606.15084v1 Specifications for Humans, Agents, and Tooling 2026-06-13T03:32:26Z

Specifications are the central mechanism for communicating intents, requirements, and constraints in software development. When they are explicit, clear, and reliable, they are an effective means for collaboration and cooperation. They allow for stakeholders to specify what they want, developers (or AI agents) to understand and implement the needed functionality, for clients to effectively use the system, and for automated tooling to validate the correctness for each of these steps. This tool paper outlines the Bosque API (BAPI) ecosystem, a software ecosystem designed to support modern spec-centered development. The BAPI specification language works in a fully polyglot ecosystem and provides a suite of features, including unparalleled expressivity, test generation, validation, and sand-boxing to support the complete application development lifecycle. These are critical to supporting emerging security and coding (both API implementation & usage) challenges presented by agentic AI systems.

2026-06-13T03:32:26Z Mark Marron http://arxiv.org/abs/2603.24624v2 ReSyn: A Generalized Recursive Regular Expression Synthesis Framework 2026-06-13T02:41:50Z

Existing Programming-By-Example (PBE) systems often rely on simplified benchmarks that fail to capture the high structural complexity of real-world regexes, such as deeper nesting and frequent use of union operations. To overcome the resulting performance drop, we propose ReSyn, a synthesizer-agnostic divide-and-conquer framework that decomposes complex synthesis problem into manageable sub-problems. We also introduce Set2Regex, a parameter-efficient synthesizer capturing the permutation invariance of examples. Experimental results demonstrate that ReSyn significantly boosts accuracy across various synthesizers, and its combination with Set2Regex establishes a new state-of-the-art on challenging real-world benchmark. The complete source code, datasets, and pre-trained model checkpoints are publicly available at https://github.com/mrseongminkim/ReSyn.

2026-03-25T01:29:42Z Accepted at IJCAI 2026 Seongmin Kim Hyunjoon Cheon Su-Hyeon Kim Yo-Sub Han Sang-Ki Ko http://arxiv.org/abs/2605.30208v2 Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency 2026-06-12T22:21:34Z

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

2026-05-28T16:44:07Z Chris Adams Arjun Singh Banga Parveen Bansal Souvik Bhattacharya Payal Bhuptani Rujin Cao Pedro Canahuati Nate Cook Brian Ellis Prabhakar Goyal Gurinder Grewal Tianyu He Matt Labunka Alex Manners David Molnar Ging Cee Ng Vishal Parekh Jiefu Pei Frederic Sagnes James Saindon Will Shackleton Sid Sidhu Gursharan Singh Karthik Chengayan Sridhar Matt Steiner Pratibha Udmalpet Sean Xia Stacey Yan Audris Mockus Peter Rigby Nachiappan Nagappan http://arxiv.org/abs/2606.14948v1 Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment 2026-06-12T20:46:04Z

LLMs have substantially improved software engineering yet real-world development requires architectural understanding. Such understanding is prohibitively expensive to label manually and impossible to verify through tests alone. We propose an agentic judging pipeline using a strong LLM as a scalable proxy for expert architectural evaluation, comprising two judges: the Architecture Complexity Judge (ACJ), which estimates codebase-specific architectural understanding a task demands, and the Architecture Quality Judge (AQJ), which evaluates patch conformance to repository-specific architectural conventions via source-grounded rubrics. Fine-tuning Qwen3-8B/14B/32B on 3,360 curated instances achieves resolved rates of up to 27.2% on SWE-bench Verified - up to 540% over the base model and 256% over unfiltered fine-tuning. Meanwhile, the trained models achieve strong cross-language generalization and consistent improvements in architectural patch quality.

2026-06-12T20:46:04Z Kirill Vasilevski Justina Ximing Dong Justina Benjamin Rombaut Justina Ruochen Deng Justina Jiahuei Lin Justina Arthur Leung Dayi Lin Boyuan Chen Shaowei Wang Ahmed E. Hassan http://arxiv.org/abs/2605.03505v2 LATS-RCA: Language Agent Tree Search for Root Cause Analysis in Microservices 2026-06-12T20:12:05Z

Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice systems (MSS). However, existing approaches typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose the Language Agent Tree Search for RCA (LATS-RCA) in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search over root-cause hypotheses, where multiple agents iteratively analyze logs and metrics to collect evidence, and reflection scores guide the search toward the most likely root causes for abnormal services. We evaluate LATS-RCA on the open benchmark (LO2), achieving 91.3\% diagnostic accuracy and analyzing the associated computational cost. Variation among the frontier-tier LLMs (Claude Sonnet 4.5, GPT-5, and Gemini 3 Pro) is small, between 89.7\% and 91.3\%, demonstrating our approach is model-agnostic. We also conduct an exploratory study by evaluating LATS-RCA on real-world incidents from a web-hosting company's (Zoner Oy) production MSS that serves over 300,000 websites across Europe. We find that LATS-RCA correctly diagnoses 65.1\% of the production incidents on average over multiple runs. This reveals key challenges of real-world RCA, including multi-factor root causes, large-scale system complexity, and incomplete observability, which are absent from open benchmarks. Future work should develop more realistic open datasets for RCA and validate LATS-RCA with additional datasets. Our replication package is available at https://github.com/kottinov/lats-rca.

2026-05-05T08:39:42Z Accepted at the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2026) Alexander Naakka Yuqing Wang Mika V Mäntylä http://arxiv.org/abs/2606.14924v1 Pandas for Reproducible Data Analysis: From Spreadsheets to Research-Grade Python Workflows 2026-06-12T20:00:40Z

Spreadsheet-heavy analytical work remains common in business analytics, operations reporting, and applied research, yet workbooks that grow through formulas, manual edits, and copy-paste refresh are difficult to audit, reproduce, and govern at scale. When tabular work requires repeatability, validation, version control, automated refresh, or integration with statistics and machine learning, analysts need a transformation layer that preserves familiar table concepts while making assumptions explicit. This paper treats the Python pandas library as that layer: a practical bridge between spreadsheet practice and research-grade workflows, not a wholesale replacement for Excel. The paper contributes an Excel-to-pandas migration mapping, a taxonomy of nine workflow categories, seven end-to-end examples drawn from business analytics and applied research, a failure-mode catalog, and reusable code recipes for governed tabular work. pandas is most useful when tabular analysis must be repeatable, auditable, and defensible, while Excel can remain a familiar input and output interface for stakeholders who need workbooks.

2026-06-12T20:00:40Z 39 pages, 8 figures Sidney Shapiro Daniel Pearson Emiliano Sebastian Gonzalez Venegas