https://arxiv.org/api/VvnuQfiFBO9/5h2VSsW63DiA2Dk 2026-06-21T09:06:45Z 27359 45 15 http://arxiv.org/abs/2606.19042v1 Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration 2026-06-17T13:10:24Z In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code. 2026-06-17T13:10:24Z VARIABILITY 2026 Xhevahire Tërnava http://arxiv.org/abs/2606.15828v3 Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents 2026-06-17T12:12:11Z Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS.‌md or CLAUDE.‌md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.‌md or a CLAUDE.‌md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions. 2026-06-14T14:07:09Z Helio Victor F. dos Santos Vitor Costa Joao Eduardo Montandon Luciana Lourdes Silva Marco Tulio Valente http://arxiv.org/abs/2606.18976v1 CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System 2026-06-17T12:00:21Z Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions. 2026-06-17T12:00:21Z Accepted for publication at the 38th International Conference on Software Engineering Education and Training Marco Becattini Niccolò Caselli Matteo Minin Roberto Verdecchia Enrico Vicario http://arxiv.org/abs/2606.18855v1 Toward Semantically-Seeded, Graph-Propagated Impact Analysis Across Software Artifacts: A Vision 2026-06-17T09:34:56Z When a single software artifact changes - a requirement, a configuration value, or a function - engineers must determine what else is impacted. Existing change-impact-analysis (CIA) tooling tends to rely on one of two signals in isolation: semantic similarity recovered from text (information-retrieval traceability, code search, embeddings), or structural dependency following (call graphs, IDE "find usages", test-impact selection). Each has a characteristic blind spot. A semantically driven tool misses an impacted artifact whose text shares no vocabulary with the change; a structurally driven tool misses artifacts related in meaning but not joined by an edge, and most operate only over code rather than the Requirement-Config-Service-Test chain. We argue for a training-free and interpretable analyzer that fuses both signals over the same embeddings. We model the system as a heterogeneous artifact graph with typed edges recovered by static analysis, compute a semantic prior by cosine similarity to the changed artifact, propagate impact multi-hop with decay over a row-normalized propagation matrix, and blend the two with a single tunable weight lambda. A small but complete proof-of-concept on a payment subsystem (5 labelled change scenarios) shows the mechanism we care about: artifacts with zero textual overlap with the change are still recovered through propagation, and helper functions that propagation alone cannot reach are recovered through the semantic layer. The fusion is the only configuration that covers both blind spots, and lambda acts as an explicit precision/recall control. Drawing on four publicly documented production failures, we argue that the same formulation extends to operational artifacts (images, metrics, dashboards, data schemas) that code-only analysis cannot reach. 2026-06-17T09:34:56Z Momil Seedat http://arxiv.org/abs/2606.18733v1 SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents 2026-06-17T06:22:28Z Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay. 2026-06-17T06:22:28Z Qiao Zhao JianYing Qu Jun Zhang Yehua Yang Hanwen Du Zhongkai Sun http://arxiv.org/abs/2606.19395v1 DevOps and General Developers: Insights from Stack Overflow's 2023 Survey 2026-06-17T05:54:15Z Purpose: To investigate the distinct roles of DevOps specialists and general software developers, examining their varying use of tools, technologies, methodologies, and demographics in the current software development environment. In addition, to differentiate these two professional groups regarding their unique contributions and challenges in the field. Design/Methodology/Approach: The research uses a quantitative approach to analyze data from the Stack Overflow 2023 Developer Survey. It focuses on a comparative analysis of technological preferences, demographic information, and professional experiences between DevOps specialists and general developers, highlighting key trends and differences. The data analysis was conducted using Python's Pandas library for data analysis. Findings: The research indicates no significant difference in the tool and technology preferences between DevOps specialists and general software developers, highlighting their complementary roles. DevOps specialists and general software developers use tools like Docker and Kubernetes, emphasizing efficiency and automation. While general developers employ diverse tools for various role demands, demographic trends show younger general developers and mid-career DevOps professionals. This age range reflects growing experience in DevOps, and both groups are adapting to remote and hybrid work models in the evolving tech industry. Practical Implications: This research offers perspectives on the dynamic roles within software development, emphasizing the growing importance of DevOps. It is a valuable resource for academic and industry professionals to understand the evolving dynamics in software development roles. Originality/Value: This research fills a significant gap in the existing literature regarding the evolving dynamics of software development roles. 2026-06-17T05:54:15Z 17 pages, 11 tables, research paper based on the 2023 Stack Overflow Developer Survey data analysis Hasan Abdulla Fatema AlJazeeri Fawzi AlBalooshi Jaflah Al-Ammary http://arxiv.org/abs/2606.17510v2 OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem 2026-06-17T04:49:05Z Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically composable entities that can be integrated with external infrastructures to offer omni-capabilities. OmniDroneX bridges low-level physical primitives with high-level mission intent through a unified vendor-agnostic interface (libUAV) and a formal physical-service abstraction model (PT-SOA). A core innovation is the diverse application of large language models (LLMs) across multiple layers of the OmniDroneX architecture. LLMs are used to assist in identifying and formalizing primitive device functions and abstract service definitions, supporting automated service composition and workflow generation, and enabling interactive, natural-language mission specification and refinement. OmniDroneX also incorporates important categories of composition techniques that are essential in dynamic UAV systems, including physical layer composition for drone capability augmentation, as well as spatiotemporal, functional, collaborative, exception-aware, and QoS-based service compositions. Collectively, these features allow OmniDroneX to serve as a foundation for scalable, resilient, and self-evolving UAV ecosystems operating in complex and dynamic environments. 2026-06-16T04:40:58Z This manuscript is a full version of a paper accepted in shortened form by IEEE International Conference on Joint Cloud Computing I-Ling Yen Akeem Mohammed Farokh Bastani San-Yih Hwang http://arxiv.org/abs/2412.15557v4 MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems 2026-06-17T03:05:03Z With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget. 2024-12-20T04:31:03Z Accepted for publication in IEEE Transactions on Software Engineering (TSE) Aaron Guoxiang Guo Aldeida Aleti Neelofar Neelofar Chakkrit Tantithamthavorn Yuanyuan Qi Tsong Yueh Chen 10.1109/TSE.2026.3701230 http://arxiv.org/abs/2511.00802v2 GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents 2026-06-17T02:55:02Z With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment, and may negatively impact users (e.g., unsafe or unethical outcomes) while requiring weeks of data collection. To address this, the growing research area of off-policy evaluation (OPE), or offline A/B testing, assesses new technologies offline using previously collected logged data. OPE is also a fundamental problem in reinforcement learning and is important where online testing is expensive or risky, such as healthcare, recommender systems, education, and robotics. Despite advances in code-generation large language models (LLMs) and agentic workflows, little is known about whether and how LLMs and LLM-based agents can automatically optimize OPE implementations. We propose GrowthHacker, a benchmark that evaluates baseline LLMs and LLM-based agents on large-scale public datasets. GrowthHacker autonomously and iteratively modifies code, runs OPE, and uses the metrics to guide subsequent optimization. We evaluate methods on Open Bandit Pipeline (OBP) and Scope-RL, and develop a two_agent framework that addresses limitations of existing frameworks while reducing complexity. Across both libraries, two_agent shows the highest reliability (98.1%-100% success rate) and positive-outcome rate (78%), with a median improvement of 4.4% among positive outcomes; CrewAI achieves the highest average improvement (37.9%) and is the only framework with zero extreme-value failures. AutoGen and Default each reach 65% positive-outcome rates. These results establish the feasibility of using LLM-based agents as automated "growth hackers" to continuously improve OPE systems, with implications for scaling data-driven decision-making where manual optimization is expensive. 2025-11-02T04:47:17Z Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026 Jie JW Wu Ayanda Patrick Herlihy Ahmad Saleem Mirza Ali Afoud Fatemeh Fard 10.1145/3815588 http://arxiv.org/abs/2606.18619v1 Code-Augur: Agentic Vulnerability Detection via Specification Inference 2026-06-17T02:32:45Z The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek. 2026-06-17T02:32:45Z Zhengxiong Luo Mehtab Zafar Dylan Wolff Abhik Roychoudhury http://arxiv.org/abs/2602.02690v3 Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All 2026-06-17T01:08:37Z Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs. 2026-02-02T19:06:15Z Chenxi Huang Alex Mathai Feiyang Yu Aleksandr Nogikh Petros Maniatis Franjo Ivančić Eugene Wu Kostis Kaffes Junfeng Yang Baishakhi Ray http://arxiv.org/abs/2606.18543v1 CEO-Bench: Can Agents Play the Long Game? 2026-06-16T23:37:52Z Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time. 2026-06-16T23:37:52Z Haozhe Chen Karthik Narasimhan Zhuang Liu http://arxiv.org/abs/2606.18536v1 Analytics for Quality Assurance for Item Pools (AQuAP): Monitoring and Maintaining Item Bank Health in AI-Driven Assessment Systems 2026-06-16T23:07:33Z The large-scale digitization of educational assessment has made the continuous oversight of item banks both essential and complex. This paper presents Analytics for Quality Assurance for Item Pools (AQuAP), a dashboard environment for monitoring item quality and item bank health. AQuAP supports the operational implementation of the large scale item generation procedures for high-stakes tests as included in the Item Factory, a framework for automated and human-supported test development. The paper describes AQuAP in relationship with the process of item development, outlines the broader metric framework for item-pool quality assurance, and highlights the Effective Bank Size (EBS) as one central indicator of pool vitality. EBS quantifies how many independent test sessions can be constructed before content repetition occurs and, when coupled with exposure and usage metrics, provides insight into item bank security, diversity, and efficiency. We further introduce bank-health metrics, such as maximum exposure, maximum conditional exposure, adjusted effective bank size, and the rarely-administered fraction, all of which extend this picture of item utilization. AQuAP illustrates how operational analytics can translate psychometric concepts into quality assurance tools for high-volume, AI-enabled testing programs. This work is illustrated with the Duolingo English Test (DET) processes. 2026-06-16T23:07:33Z 11 pages, 4 figures Alina A. von Davier Xiaowan Zhang Yigal Attali Yena Park Jacqueline Church Andrew Runge Geoff T. LaFlair Alexander Tsigler http://arxiv.org/abs/2606.18532v1 AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework 2026-06-16T22:57:24Z AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance. 2026-06-16T22:57:24Z 50 pages, 8 figures, 10 tables Inderjeet Singh Haitham Mahmoud Andrés Murillo http://arxiv.org/abs/2606.18481v1 Designing L5: A Permacomputing Approach to Creative Coding 2026-06-16T20:44:41Z Creative coding libraries provide high-level tools that make computational and algorithmic art accessible to artists and learners. Processing/p5 is one such family of libraries, known for its beginner-friendly approach and wide reach across artistic and technical communities. L5 is a new member of this family, implemented in Lua using the LOVE framework. It applies permacomputing principles, a movement addressing sustainability in computing inspired by permaculture, bringing these values to a community of practice not historically centered on them. This paper explores L5's design decisions and tensions between sustainability and usability through five case studies: 1. balancing perceived simplicity versus exposing the seams, 2. designing for lower resource consumption, 3. ensuring long-term stability, 4. constraining functionality, and 5. designing documentation for resource-constrained access. Rather than optimizing for a single metric, sustainable creative tools require navigating competing values transparently. 2026-06-16T20:44:41Z 10 pages, 1 figure, In LIMITS 26: Workshop on Computing within Limits, June 23 - 25, 2026 Lee Tusman Kit Kuksenok