https://arxiv.org/api/8EmHigroEqrchcsbG7B1cG778zQ 2026-06-09T20:34:05Z 27161 0 15 http://arxiv.org/abs/2606.09800v1 FASE: Fast Adaptive Semantic Entropy for Code Quality 2026-06-08T17:53:05Z Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows. 2026-06-08T17:53:05Z Shizhe Lin Ladan Tahvildari http://arxiv.org/abs/2606.09645v1 Modeling Components and Connections in Cyber-Physical Systems 2026-06-08T15:39:43Z Text based configuration files for cyber-physical systems show the hierarchy of component modules well but often hide the details of connections and interfaces between modules. A model-based visual approach to these configuration files can better capture this information. The XML structure of Robot Operating System (ROS) launch files can be improved using a modeling approach. This paper presents ROSLaunchVisual, a model-integrated environment built on WebGME for designing, visualizing, and managing ROS launch files. The tool raises the level of abstraction by allowing developers to create and modify launch files using a graphical interface that represents nodes, publishers, subscribers, and arguments as interconnected components. The tool provides a dynamic system analysis that can then be used in the static development and analysis of new and existing launch files. ROSLaunchVisual incorporates features such as metamodel-driven validation, automatic import/export of launch files, and visual communication mapping. Plugins further enhance functionality by updating libraries, checking for semantic errors, and managing remaps. By making launch file creation more intuitive and less error-prone, ROSLaunchVisual improves development efficiency and system understanding, especially in collaborative or large-scale robotics projects. 2026-06-08T15:39:43Z Kate Sanborn Tanuj Kenchannavar Vakul Nath Jonathan Sprinkle http://arxiv.org/abs/2606.09637v1 Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation 2026-06-08T15:34:29Z Personas are widely used in software engineering to support requirements elicitation, design, and validation, but their manual creation is costly, time-consuming, and hard to scale. Recent LLM-based approaches automate persona generation from textual data; however, they typically rely on single-shot generation and subjective evaluations, limiting practical reliability. We present PerGent, an industry-grade method for persona generation built around an iterative critique-refinement loop. Specifically, PerGent uses a generator and a critic LLM agent, coordinated by an orchestrator, to iteratively refine personas using external resources such as interviews, surveys, and job postings through a critique-refinement loop with a user-defined maximum number of rounds. We deploy and evaluate PerGent in an industrial setting at Kinaxis, comparing it with three baselines, including one-shot methods. In an expert in-situ evaluation, PerGent achieved the highest expert approval rate (96.9%), exceeding all baselines. We further compare PerGent-generated personas with best-practice personas manually created by domain experts prior to the adoption of LLMs. Compared to baselines, PerGent reproduces a larger proportion of expert content while also contributing substantial new content beyond the pre-LLM personas. We conclude with lessons learned from deploying and evaluating PerGent at Kinaxis. 2026-06-08T15:34:29Z Accepted in the Industry Track of the Requirements Engineering (RE) 2026 Conference Mohammad Hossein Amini David Dewar Shiva Nejati Mehrdad Sabetzadeh http://arxiv.org/abs/2507.00322v2 Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones 2026-06-08T15:24:04Z Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%. 2025-06-30T23:35:19Z 23 pages, 10 figures, accepted for NeurIPS 2025 Daking Rai Samuel Miller Kevin Moran Ziyu Yao http://arxiv.org/abs/2606.09577v1 Code Is More Than Text: Uncertainty Estimation for Code Generation 2026-06-08T14:52:43Z Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports. 2026-06-08T14:52:43Z Yuling Shi Caiqi Zhang Yuexian Li Haopeng Wang Yeheng Chen Nigel Collier Xiaodong Gu http://arxiv.org/abs/2603.04177v2 CodeTaste: Can LLMs Generate Human-Level Code Refactorings? 2026-06-08T14:51:08Z LLM coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. We investigate whether agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. To this end, we construct CodeTaste, a benchmark mined from large multi-file open-source refactorings. To score solutions, we combine repository test suites that measure functional correctness with tailored static checks that verify removal of undesired and introduction of desired code patterns using dataflow reasoning. Our results show a clear gap: agents perform well at implementing refactorings that are specified in detail, but often fail to discover the human refactoring choices when given a focus area for changes. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases. We release the benchmark, leaderboard, and code. 2026-03-04T15:34:18Z Alex Thillen Niels Mündler Veselin Raychev Martin Vechev http://arxiv.org/abs/2606.09528v1 Relocate and Emulate: Re-Hosting Android's Application Layer 2026-06-08T14:14:38Z Dynamic analysis of Android's application layer typically relies on physical devices, limiting scalability and reproducibility. To compensate, we introduce a systematic re-hosting method that relocates the Android framework and pre-installed software from real device firmware into a fully emulated environment. Our approach integrates vendor-specific components into the Android Open Source Project (AOSP) build system using tailored extraction and injection strategies, producing vendor-flavoured emulator images that preserve system integrity and runtime compatibility. This enables dynamic execution of real-world framework and application-layer components, including proprietary binaries and pre-installed apps, across multiple SDK versions. We evaluate our method on 184 firmware samples from SDK 31-33. It achieves high build and boot success rates, with residual failures primarily occurring during core-service initialization due to baseline strategy limitations, missing dependencies, device-protection checks, or emulator constraints. However, the modular design allows injection strategies to be extended for specific firmware, supporting broader compatibility and future research on automated, adaptive re-hosting. Though we identified potential for optimization through engineering vendor-specific solutions, our research demonstrates the feasibility of vendor-flavoured emulators for scalable, reproducible dynamic analysis. 2026-06-08T14:14:38Z Conference: IEEE International Conference on Software Analysis, Evolution and Reengineering Thomas Sutter Timo Kehrer Bernhard Tellenbach Marc Rennhard 10.1109/SANER67736.2026.00053 http://arxiv.org/abs/2602.05458v2 Emergence-as-Code as a Foundation for Self-Governing Reliable Systems 2026-06-08T12:59:42Z Service-level objective (SLO)-as-code tools make per-service reliability declarative, but users experience journeys: end-to-end executions whose availability and tail latency emerge from topology, routing, redundancy, timeouts/fallbacks, shared failure domains, and tail amplification. Journey objectives are therefore often maintained outside code and drift away from the effective runtime graph. We propose Emergence-as-Code (EmaC), a declarative contract that compiles journey-level SLI bounds and governance artifacts for declared SLOs from intent and evidence. An EmaC specification defines a typed journey expression, leaf bindings to atomic SLOs and telemetry, failure-domain assumptions, and guarded actions. Model Discovery proposes evidence-backed deltas for edges, branch probabilities, redundancy groups, and failure-domain hypotheses; each delta carries provenance and confidence. The compiler derives optimistic and pessimistic journey bounds and emits reviewable governance artifacts. An executable checkout replay shows that local SLOs can remain green while evidence-backed discovery changes the failure-domain model, collapses the pessimistic payment-race bound, and changes the rollout decision from pass to fail or review. 2026-02-05T09:04:33Z Anatoly A. Krasnovsky http://arxiv.org/abs/2606.09416v1 Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer 2026-06-08T12:29:54Z Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh. 2026-06-08T12:29:54Z 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026) Sanghoon Lee Jiyeong Chae Kyung-Joon Park http://arxiv.org/abs/2510.02404v2 Dynamic Function Configuration and its Management in Serverless Computing: A Taxonomy and Future Directions 2026-06-08T12:22:32Z The serverless cloud computing model offers a framework where the service provider abstracts the underlying infrastructure management from developers. In this serverless model, FaaS provides an event-driven, function-oriented computing service characterised by fine-grained, usage-based pricing that eliminates cost for idle resources. Platforms like AWS Lambda, Azure Functions, and Cloud Run Functions require developers to configure their function(s) with minimum operational resources for its successful execution. This resource allocation influences both the operational expense and the performance quality of these functions. However, a noticeable lack of platform transparency forces developers to rely on expert knowledge or experience-based ad-hoc decisions to request desired function resources. This makes optimal resource configuration a non-trivial task while adhering to performance constraints. Furthermore, while commercial platforms often scale resources like CPU and network bandwidth proportional to memory, open-source frameworks permit independent configuration of function resources, introducing additional complexity for developers aiming to optimise their functions. These complexities have directed researchers to resolve developer challenges and advance towards an efficient server-less execution model. In this article, we identify different aspects of resource configuration techniques in FaaS settings and propose a taxonomy of factors that influence function design, configuration, run-time cost, and performance guarantees. We conduct an analysis of existing literature on resource configuration to present a comprehensive review of current studies on function configuration. We also identify existing research gaps and suggest future research directions to enhance function configuration and strengthen the capabilities of serverless computing environments to drive its broader adoption. 2025-10-02T03:09:54Z 36 pages, 2 figures, 2 tables, journal Siddharth Agarwal Maria A. Rodriguez Rajkumar Buyya http://arxiv.org/abs/2606.09395v1 Empirical Study for Structured Output Control in LLMs for Software Engineering 2026-06-08T12:13:58Z LLM-generated outputs in software engineering rarely exist in isolation. They must plug into toolchains, APIs, and data pipelines that impose strict, often organization-specific structural contracts. A semantically correct output that violates the expected format is, from the consuming system's perspective, indistinguishable from a wrong answer, making structural fidelity an operational prerequisite for deploying LLMs in practice. Yet current models routinely produce syntactically invalid or structurally non-compliant outputs. Unlike encoders, autoregressive decoders generate text token-by-token with a local rather than global focus, amplifying structural fragility whenever the target format deviates from familiar training distributions. We present a systematic evaluation of structural reliability across four representative SE tasks, categorizing failures into syntax, structural, and semantic errors. We benchmark ways of mitigation targeting the decoder: grammar-constrained decoding, regex-based validation, and a strict template-driven control (Template Token Match Generation, TTMG) to isolate the sources of these failures. TTMG nearly eliminates syntax errors, yet substantial structural and semantic errors persist, demonstrating that the core bottleneck lies beyond syntax formatting. A detailed case study further illustrates how residual errors cascade in downstream workflows. Our findings show that current structure-enforcing tools are necessary but insufficient, and highlight the need for approaches that jointly ensure structural fidelity and semantic correctness in LLM-driven workflows. 2026-06-08T12:13:58Z 35 pages Yewei Song Prateek Rajput Tiezhu Sun Saad Ezzini Tegawendé F. Bissyandé Jacques Klein http://arxiv.org/abs/2605.28510v2 Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets 2026-06-08T10:22:34Z Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs. 2026-05-27T14:12:17Z Andrea Gurioli Davide D'Ascenzo Federico Pennino Maurizio Gabbrielli Stefano Zacchiroli http://arxiv.org/abs/2606.09182v1 Understanding How Enterprises Adopt the Model Context Protocol for LLM-Driven Software Engineering 2026-06-08T08:20:50Z Large Language Models (LLMs) are increasingly used in AI-based software engineering, but their limitations in complex task execution and multi-tool coordination have driven growing interest in the Model Context Protocol (MCP). Existing research has mainly focused on MCP's technical design, with limited empirical evidence on how it is adopted and used in enterprise practice, particularly with regard to deployment challenges, operational risks, and practitioner expectations. To address this gap, we conducted semi-structured interviews with 20 practitioners from eight companies in the Internet and financial sectors. The findings show that MCP is valued for supporting cross-system collaboration, task decoupling, and knowledge reuse in LLM-based workflows, but its adoption remains constrained by ecosystem fragmentation, cross-component coordination difficulties, and unresolved problems in distributed state management and fault diagnosis. Participants also expressed strong demand for better standardization, lower adoption barriers through low-code or plugin-based approaches, and more systematic operational support. These results provide early empirical evidence on enterprise MCP practice and offer practical implications for improving MCP's standardization, usability, and deployment readiness in real-world software engineering environments. 2026-06-08T08:20:50Z 12pages, preliminary version accepted at the 26th International Conference on Quality, Reliability, and Security (QRS 2026) Kehui Chen Yicheng Sun Jacky Keung Zhenyu Mao Xiaoxue Ma http://arxiv.org/abs/2512.15422v2 A Survey on the Applications of Generative Artificial Intelligence in Automated Driving Systems Test Scenario Generation Methods 2026-06-08T07:34:10Z Ensuring the safety and reliability of Automated Driving Systems (ADS) remains a critical challenge, as traditional verification methods such as large-scale on-road testing are prohibitively costly and time-consuming.To address this,scenario-based testing has emerged as a scalable and efficient alternative,yet existing surveys provide only partial coverage of recent methodological and technological advances.This review systematically analyzes 31 primary studies,and 10 surveys identified through a comprehensive search spanning 2015~2025;however,the in-depth methodological synthesis and comparative evaluation focus primarily on recent frameworks(2023~2025),reflecting the surge of Artificial Intelligent(AI)-assisted and multimodal approaches in this period.Traditional approaches rely on expert knowledge,ontologies,and naturalistic driving or accident data,while recent developments leverage generative models,including large language models,generative adversarial networks,diffusion models,and reinforcement learning frameworks,to synthesize diverse and safety-critical scenarios.Our synthesis identifies three persistent gaps:the absence of standardized evaluation metrics,limited integration of ethical and human factors,and insufficient coverage of multimodal and Operational Design Domain (ODD)-specific scenarios.To address these challenges,this review contributes a refined taxonomy that incorporates multimodal extensions,an ethical and safety checklist for responsible scenario design,and an ODD coverage map with a scenario-difficulty schema to enable transparent benchmarking.Collectively,these contributions provide methodological clarity for researchers and practical guidance for industry,supporting reproducible evaluation and accelerating the safe deployment of higher-level ADS. 2025-12-17T13:14:15Z 40 pages,7 figures Ji Zhou Institute of Automotive Engineering, Graz University of Technology, Graz, Austria Yongqi Zhao Institute of Automotive Engineering, Graz University of Technology, Graz, Austria Yixian Hu Institute of Automotive Engineering, Graz University of Technology, Graz, Austria Hexuan Li Institute of Automotive Engineering, Graz University of Technology, Graz, Austria Zhengguo Gu Institute of Automotive Engineering, Graz University of Technology, Graz, Austria Nan Xu National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin university Arno Eichberger Institute of Automotive Engineering, Graz University of Technology, Graz, Austria http://arxiv.org/abs/2606.09122v1 Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations 2026-06-08T07:15:53Z Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale. 2026-06-08T07:15:53Z 7 pages, 6 figures Arun Malik