https://arxiv.org/api/QBA+qD2Ser8E0Q5Mwdo+5d5AJU42026-06-21T20:37:59Z2735919515http://arxiv.org/abs/2606.12986v1The Rise of AI-Native Software Engineering: Implications for Practice, Education, and the Future Workforce2026-06-11T07:23:49ZGenerative Artificial Intelligence (GenAI), Large Language Models (LLMs), and emerging Agentic AI constitute the most disruptive transformation in the history of software engineering (SE), reshaping development processes, required competencies, professional roles, and the educational outcomes that universities must deliver. This paper presents a systematic review of 48 verified, influential peer-reviewed publications (2016--2026) drawn from leading venues in software engineering, machine learning, computing education, human--AI collaboration, and software productivity. Studies were discovered, screened, and analyzed through a four-agent research workflow (Literature Discovery, Scientometric Analysis, Curriculum Transformation, and Workforce Impact) and were verified against primary sources. We synthesize the evidence along nine themes and three trajectories -- practice, education, and workforce -- and report a scientometric inflection in which annual LLM-for-SE output grew roughly five-fold after late 2022. From this synthesis we contribute: (i) a conceptual framework for AI-native software engineering organized around \emph{intent}, \emph{collaboration}, and \emph{verification}; (ii) a nine-dimension competency model spanning specification, critical evaluation, agent orchestration, and metacognition; (iii) a four-phase university curriculum roadmap with AI-resilient assessment; (iv) faculty-development and workforce-transformation strategies; and (v) a prioritized agenda of eleven research gaps. The evidence base is internally contradictory on the magnitude and direction of productivity effects, underscoring that benefits are strongly context-dependent and that educating engineers for judgment, verification, and orchestration -- rather than code production alone -- is the central challenge of the AI-native era.2026-06-11T07:23:49ZMamdouh Alenezihttp://arxiv.org/abs/2512.21781v2The State of the SBOM Tool Ecosystems: A Comparative Analysis of SPDX and CycloneDX2026-06-11T04:43:18ZSoftware Bills of Materials (SBOMs) improve software release transparency by documenting components and dependencies, but their practical value depends on the tools that generate, analyze, and manage them. This paper compares the tool ecosystems of the two dominant SBOM formats: SPDX and CycloneDX. We analyze 108 open-source and 62 proprietary SBOM tools, compare ecosystem-level health metrics across 470 SPDX and 171 CycloneDX tools, examine 36,990 issue reports from open-source tools, and study the top 250 open-source projects using each format. Our results show that CycloneDX-using projects often exhibit stronger developer engagement and selected project health indicators, while SPDX benefits from a larger, more mature tool ecosystem and broader industry adoption. These findings highlight the complementary strengths of both ecosystems and identify opportunities for improving SBOM tooling across formats.2025-12-25T20:50:51Zhttps://sailresearch.github.io/26-ali-sbom-emseZhimin ZhaoAbdul Ali BangashTongxu GeArshdeep SinghZitao WangBram Adamshttp://arxiv.org/abs/2606.12864v1Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming2026-06-11T03:48:18ZDespite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.2026-06-11T03:48:18ZTingqiang XuHangrui ZhouTianle CaiAlex GuKaifeng Lyuhttp://arxiv.org/abs/2603.16013v3Safety Case Patterns for VLA-based driving systems: Insights from SimLingo2026-06-11T03:02:37ZVision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving as well as understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. For instance, the integration of open-ended natural language inputs (e.g., user or navigation instructions) into the multimodal control loop may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.2026-03-16T23:43:38ZGerhard YuFuyuki IshikawaOluwafemi OduAlvine Boaye Bellehttp://arxiv.org/abs/2601.19072v3HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation2026-06-11T01:26:21ZLarge Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.2026-01-27T01:16:45ZAccepted at FSE'26: Industry Track, Full-Length, Peer-ReviewedKla TantithamthavornHong Yi LinPatanamon ThongtanunamWachiraphan CharoenwetMinwoo JeongMing Wu10.1145/3803437.3805236http://arxiv.org/abs/2605.26144v2VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents2026-06-11T00:11:54ZWe present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.2026-05-22T20:29:12ZProject page: https://kaboider.github.io/VIS_APP/; Code: https://github.com/kaboider/VIS_APP_Code; Dataset: https://huggingface.co/datasets/JunJiaGuo/VIS-APP-BenchJunJia GuoJoeYuhang YaoJoe JiaweiJoe ZhouJingdi Chenhttp://arxiv.org/abs/2602.13379v2Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents2026-06-10T21:28:47ZLLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.2026-02-13T18:38:18ZXu LiSimon YuMinzhou PanYiyou SunBo LiDawn SongXue LinWeiyan Shihttp://arxiv.org/abs/2510.20340v4Classport: Designing Runtime Dependency Introspection for Java2026-06-10T20:50:37ZRuntime introspection of dependencies, i.e., the ability to observe which dependencies are currently used during program execution, is fundamental for Software Supply Chain security. Yet, Java has no support for it. We solve this problem with Classport, a blueprint and system that embeds dependency information into Java class files, enabling the retrieval of dependency information at runtime. We evaluate Classport on six real-world projects, demonstrating the feasibility in identifying dependencies at runtime.2025-10-23T08:39:30ZSerena CofanoDaniel WilliamsAman SharmaMartin Monperrushttp://arxiv.org/abs/2606.12620v1HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection2026-06-10T19:21:19ZThanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.2026-06-10T19:21:19ZAccepted to LREC 2026LREC 2026 proceedings (pp. 1520-1532)Luke PattersonLi WangAdam Faulkner10.63317/4edsbxrqe8nahttp://arxiv.org/abs/2511.13271v2Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming2026-06-10T18:44:12ZThe rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.2025-11-17T11:42:24Z9 pages, 4 figures, published at AIWARE 2025Rufeng ChenShuaishuai JiangJiyun ShenAJung MoonLili Wei10.1109/AIware69974.2025.00016http://arxiv.org/abs/2606.12592v1Characterizing Tests in IoT Software: Practices, Challenges and Opportunities2026-06-10T18:39:13ZThe Internet of Things (IoT) is experiencing rapid growth. Smart devices are emerging in smart homes and industrial applications, performing mission-critical tasks. Bugs in IoT software can lead to severe consequences. For example, a buggy smart lock can allow unauthorized access to a private property. Testing is a primary practice to expose software bugs and ensure software quality. However, little is known about how IoT software is tested. To bridge this gap, we conducted the first empirical study on test cases in open-source IoT software. Specifically, we evaluated the effectiveness of test cases in IoT software, explored the challenges inherent in testing IoT software, and analyzed the usage of mock objects. Our results indicate that while IoT software often contains a considerable number of tests, their effectiveness remains limited. We identified the primary challenges in testing IoT software as managing complex interactions with various external dependencies, such as other network-reliant IoT components, file systems, operating systems, and databases. We also observed that the use of mock objects in IoT software closely aligns with our identified testing challenges. This alignment demonstrates the potential of mocking as a solution to enhance test coverage and address the complexities of IoT software testing.2026-06-10T18:39:13Z15 pages, 4 figuresIEEE Transactions on Software Engineering, 2026Rufeng ChenHengcheng ZhuWuqi ZhangZixu ZhouLili Wei10.1109/TSE.2026.3700841http://arxiv.org/abs/2604.25363v2Commit-Aware Learning-Based Test Case Prioritization for Continuous Integration2026-06-10T16:56:28ZRegression testing in Continuous Integration (CI) pipelines is increasingly costly due to the growing size and execution frequency of test suites. Test Case Prioritization (TCP) mitigates this problem by reordering tests to expose faults earlier. However, most existing techniques rely primarily on historical execution data and coverage metrics, neglecting the rich structural information contained in code changes. This paper proposes a commit-aware, learning-based TCP method that combines structural properties of version-control diffs, test coverage relations, and historical execution behavior into a unified predictive model. Given a new commit, the method estimates the probability that each test suite will reveal at least one failure and prioritizes test execution accordingly. We evaluate our method on five Defects4J projects using a leave-one-project-out cross-project validation setting. Results show that the commit-aware TCP significantly outperform non-commit-aware-baselines in both classification and prioritization effectiveness. Our findings show that including commit structural semantics substantially enhances regression fault detection and enables robust, generalizable learning-based TCP in CI environments.2026-04-28T08:30:25ZLorenzo AbbondanteGerardo Canforahttp://arxiv.org/abs/2606.12320v1A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents2026-06-10T16:54:47ZEnterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access control, data-loss prevention, perimeter inspection -- governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context, calls tools, invokes connectors, and modifies systems of record on an enterprise's behalf, so risk moves inside the workflow, into sequences of individually-permitted actions that may transform a business process no one authorized. Existing policy engines do not extend to this regime: they evaluate request-time decisions against atomic principals, where agentic systems require stateful evaluation against composite principals whose authority attenuates through delegation chains.
We present a reference architecture for the runtime governance of production agents, built from four composable primitives: a five-plane decomposition (a reasoning plane that adjudicates intent, and four enforcement planes -- network, identity, endpoint, data -- that realize the decision), stop-anywhere mediation, composite principals with capability attenuation, and audit as a structured evidence substrate. We define a taxonomy of six interruption primitives that generalize allow and deny, state and argue for four correctness invariants, and demonstrate the foreclosure of seven production-agent threats across five concrete workflows. A reference implementation of the policy-engine core supplies measured evidence: attenuation correctness and evidence reconstructability hold on every trial, adjudication runs in single-digit microseconds, and the audit substrate's tamper-evidence behaves exactly as designed. We are explicit about scope: the architecture governs delegated action, not model behavior, and a full-system evaluation against a live agent benchmark is the invited next step.2026-06-10T16:54:47Z65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future workKrti Tallamhttp://arxiv.org/abs/2606.12231v1Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study2026-06-10T15:39:47ZThe adoption of AI-powered Integrated Development Environments (AI IDEs) has introduced "Rules" as a novel software artifact, allowing developers to persistently inject project-specific constraints and architectural guidelines into the context of Large Language Models (LLMs). Despite their role in aligning AI behavior with developer intent, the taxonomy, evolution, and practical impact of these rules remain largely unexplored. To bridge this gap, we conducted a mixed-methods empirical study on AI IDE rules. By mining 83 open-source projects and extracting 7,310 rules, we established a comprehensive taxonomy comprising 5 primary and 25 secondary categories. We then triangulated these artifacts with survey responses from 99 practitioners. Our analysis identified a contrast between developer priorities and actual configurations: while practitioners rate architectural constraints as highly important, rule files in repositories primarily consist of low-level workflow and code formatting constraints. Furthermore, our analysis of 1,540 rule evolution events revealed that rules are updated frequently. Repository data further indicate that rule evolution is primarily driven by constructive context expansions (29.17%) and enrichments (26.59%). In contrast, surveyed developers reported modifying rules primarily to correct AI errors (77.78%), typically by adding new negative constraints rather than editing existing ones. Finally, an artifact compliance assessment of 160 rule evolution events revealed that updating rules significantly improves the adherence of software artifacts, with the average artifact compliance rate increasing by 22.99% (from 49.14% to 72.13%) following an update. Our study provides empirical insights that can help developers optimize prompting strategies and guide tool builders in designing automated conflict-detection and context-management mechanisms for AI IDEs.2026-06-10T15:39:47Z52 pages, 21 images, 8 tables, Manuscript submitted to a Journal (2026)Guangzong CaiRuiyin LiPeng LiangZengyang LiMojtaba Shahinhttp://arxiv.org/abs/2606.12212v1Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps2026-06-10T15:29:05ZThe rapid integration of large language models (LLMs) into mobile applications has introduced a new class of credential security risk: leaked credentials that grant unauthorized access to LLM inference services, causing financial damage to developers. Prior work on credential leakage has focused primarily on Android apps; to date, no empirical study has systematically investigated LLM API key leakage in iOS applications.
We present the first in-depth empirical study of API key leakage in LLM-integrated apps. We construct a high-quality dataset of 444 iOS applications, filtered from 1092 candidates through a standardized process, and develop LLMKeyLens, a dynamic analysis framework that detects LLM API key leakage via traffic interception, provider-specific key extraction, and active validity confirmation, requiring neither source code access nor binary decryption. Our analysis reveals that 282 applications expose exploitable LLM API credentials in network traffic, spanning at least ten providers. We identify three leakage patterns: JWT-based token leakage (48%), unauthenticated backend proxy access (33%), and plaintext API key transmission (19%). To assess remediation, we re-analyzed the same 282 vulnerable applications three months after responsible disclosure; only 28% had remediated the reported vulnerability, while 72% remained exploitable, with persistent issues stemming from unauthenticated backends and broken JWT implementations.
Our findings show that LLM API key leakage is both prevalent and persistent in the iOS ecosystem, exposing a systemic gap between developer practice and secure integration principles, and suggest that secure LLM integration requires not only developer awareness but also explicit security guidance from providers and platform-level enforcement.2026-06-10T15:29:05Z12 pages, 4 figures, 4 tablesPinran GaoLingxiang WangYing ZhangFan Yang