https://arxiv.org/api/ehwHb2J12udlcVhafjGPxgspUog 2026-06-21T11:36:09Z 27359 75 15 http://arxiv.org/abs/2606.17799v1 Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering 2026-06-16T11:21:01Z Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on. 2026-06-16T11:21:01Z Maria I. Gorinova Macey Baker Amy Heineike Maksim Shaposhnikov Rob Willoughby Dru Knox http://arxiv.org/abs/2606.04993v2 Code Lifespan Survival Analysis (CLSA): Predicting the Survival of Source Code Lines Using AST-Aware Mining 2026-06-16T09:25:12Z Context: Predicting which source lines will be deleted - and when - matters for maintenance, technical debt, and review prioritization. Existing MSR approaches work at file or method granularity, masking individual-statement risk. Objective: We introduce Code Lifespan Survival Analysis (CLSA), the first framework to model code survival at individual-line granularity. CLSA treats each line as a right-censored subject and estimates deletion risk from structural, contextual, and temporal covariates; its strongest predictors are computable statically from one file (AST structure plus line entropy), without version history or bug data. Method: We mine 32.5 million line birth events from 120 open-source TypeScript repositories. A 5-stage bipartite matching pipeline separates true deletions from refactoring noise (migrations and rewrites), preventing 8.3 million false deaths. We fit a Cox Proportional Hazards model with 15 covariates and check robustness via Weibull/Log-Logistic AFT, gamma frailty, and time-stratified landmark models. Results: More than half of all lines are never deleted (Kaplan-Meier median not reached); among deleted lines the median lifespan is 95.7 days. Covariate effects are strongly time-varying, forming three regimes. Line Shannon entropy is moderately protective for new code (HR=0.84, 0-90 days) and strongly protective for mature code (HR=0.36, 365+ days), explaining its proportional-hazards violation. Lines in conditional branches reverse: protective at birth (HR=0.97), a risk factor after 90 days (HR=1.21). Repository identity is the largest factor: a gamma frailty model (variance theta=1.449) raises concordance from 0.586 to 0.666, outweighing every structural covariate. Conclusion: Line-level survival modeling is tractable, yielding interpretable, mostly static risk signals and a calibration recipe for time-conditional risk scoring in IDEs and code review. 2026-06-03T15:13:31Z Pavel Gurov 10.5281/zenodo.20714794 http://arxiv.org/abs/2606.17666v1 FacProcessTwin: An LLM-Based System for Process Twin Development 2026-06-16T08:27:43Z Process twins provide real-time representations of entire production processes. By capturing how process steps interact, rather than monitoring a single machine in isolation as an asset-based digital twin does, they have the potential to drive efficiency gains across the whole process. However, developing a process twin is costly. It requires accurately modelling the entire production process: its process steps, the equipment and product-specific settings each step uses, and its process variations. The resulting model must then be bound to live operational data. We present FacProcessTwin, a system that leverages a large language model (LLM) to reduce this development time, building a process twin from a plant's process documentation and natural-language input from an operator. FacProcessTwin generates this complete process model and then automatically binds its process steps to live operational data. The generated model and its data bindings are rendered as an interactive process diagram through which manufacturing personnel can monitor and correct the system's autonomous decisions, such as resolving uncertainty at safety-critical binding steps. We evaluate FacProcessTwin through a real-world case study of an Australian food manufacturer, covering 16 production process flows that span chilled, frozen, and aseptic shelf-stable product categories and include process variations within the same product. The results show that FacProcessTwin generates these process models accurately (a mean F1 of 95.2% against ground truth) and builds each twin in roughly a sixth of the manual time. Its human-in-the-loop governance then keeps the safety-critical bindings correct: at ambiguous tags where a single-pass baseline silently mis-binds 75.0% of the time, FacProcessTwin defers to the operator and mis-binds none. 2026-06-16T08:27:43Z Yash Pulse Yong-Bin Kang Abhik Banerjee Prem Prakash Jayaraman http://arxiv.org/abs/2606.17612v1 PracRepair: LLM-Empowered Automated Program Repair Inspired by Human-Like Debugging Practices 2026-06-16T07:18:37Z As software systems grow in scale and complexity, debugging and repair remain costly and time-consuming. Large language models (LLMs) have advanced automated program repair (APR), but existing LLM-based APR approaches still largely rely on static or retrieved context, error messages, and coarse-grained validation outcomes. As a result, they underutilize dynamic information for failure understanding and repair, including failure-execution dynamics and patch-validation dynamics. Effectively leveraging such information, however, is challenging: failure-execution traces are large and noisy, raw static-dynamic context is not self-explanatory, and patch-validation dynamics are often reduced to coarse feedback. To address these challenges, we propose \textsc{PracRepair}, a fully automated LLM-based APR framework inspired by human-like debugging practices. \textsc{PracRepair} constructs an on-demand static-dynamic context from buggy programs and failure executions, performs question-driven failure diagnosis to formulate explicit repair hypotheses, and iteratively refines candidate patches using validation diagnostics and trace-level behavioral changes. Experimental results on Defects4J V1.2 and V2.0 show that \textsc{PracRepair} consistently outperforms state-of-the-art baselines. Specifically, under GPT-3.5, \textsc{PracRepair} correctly fixes 139/136 bugs on Defects4J V1.2/V2.0, while under GPT-4o it further improves to 162/171. Moreover, \textsc{PracRepair} generalizes effectively to RWB (Real-World Bugs), achieving the best performance across multiple foundation models. 2026-06-16T07:18:37Z Yu Cheng Zhongxin Liu Zhenchang Xing Chao Ni Qing Huang Xiaoxue Ren http://arxiv.org/abs/2606.17593v1 Why Model Credibility Isn't Enough: -Rethinking Trust in Simulation Architectures 2026-06-16T06:56:30Z Credibility of a simulation model is an important topic. Several approaches try to quantify the credibility of simulation. However, models are mostly assembled within a simulation architecture. Can the credibility of a simulation architecture be assessed based on the credibility of the models that comprise it? This paper aims to address this issue by providing an overview of the current state of the art in the field of assembly credibility. It will compare sensitivity analysis techniques, qualitative analysis by experts, explainability in AI, and networks. Finally, an assessment of the proposed approaches, based on criteria such as rigor, generalization, and resource requirements, will reveal the strengths and weaknesses of each approach. 2026-06-16T06:56:30Z Annual Congress of Japan Society of Automotive Engineers (JSAE), May 2026, Yokohama, Japan Romain Barbedienne Adeline Lanugue Rim Kaddah Julien Silande Anthony Levillain Cedric Leclerc Maxime Hayet Boussaad Soualmi Cristian Maxim http://arxiv.org/abs/2606.17588v1 Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations 2026-06-16T06:51:04Z Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened independently by human experts and LLMs in zero-shot mode, resulting in Kappa values ranging from 0.52 to 0.77. Qualitative analysis suggests that human-LLM disagreement results from recurring, identifiable causes, such as boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, we propose recommendations such as validating semantic understanding before deployment, running multiple LLMs, and focusing validation efforts on borderline cases. Future studies are needed to validate the impact of our recommendations, and community efforts are needed to develop normative guidelines on LLM usage in SRs. 2026-06-16T06:51:04Z 14 pages + references. Accepted for publication in the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2026) Mika Mäntylä Patricia Matsubara Katia Romero Felizardo Miikka Kuutila Marco Gerosa Savio de Sousa Sampaio Tayana Conte Igor Steinmacher http://arxiv.org/abs/2401.01571v2 Principles and Practices of Large-Scale Code Analysis at Ant Group: A Data- and Logic-Oriented Approach 2026-06-16T06:38:52Z Large-scale software development requires dynamic and multifaceted static code analysis that extends beyond the capabilities of traditional tools. Existing tools like CodeQL lack cross-language analysis capabilities and can be time-consuming and resource-intensive. We present CodeFuse-Query, a data system tailored for large-scale code analysis. First, CodeFuse-Query adopts a Logic-Oriented Computation Design, employing Datalog with a two-tiered schema, COREF, to convert source code into data facts, and Godel to express complex analysis tasks in logical terms. Furthermore, CodeFuse-Query adopts a Domain-Optimized System Design. This approach optimizes resource utilization, prioritizes data reusability, applies incremental code extraction, and introduces task-type characteristics specifically for code changes, underscoring its domain-optimized design. We present empirical results demonstrating CodeFuse-Query's robustness, scalability, and efficiency in large-scale real-world scenarios at Ant Group, where it serves as a core static analysis infrastructure. Deployed in production environments, CodeFuse-Query processes up to 10 billion lines of code daily across more than 300,000 distinct analysis tasks. CodeFuse-Query has been open-sourced. 2024-01-03T06:56:39Z Xiaoheng Xie Gang Fan Xiaojun Lin Ang Zhou Shijie Li Xunjin Zheng Yinan Liang Yu Zhang Na Yu Haokun Li Xinyu Chen Yingzhuang Chen Yi Zhen Dejun Dong Xianjin Fu Jinzhou Su Fuxiong Pan Pengshuai Luo Youzheng Feng Ruoxiang Hu Hanyang Guo Jing Fan Xiao Xiao Peng Di 10.1145/3786583.3786907 http://arxiv.org/abs/2606.17514v1 Unlocking LLM Code Correction with Iterative Feedback Loops 2026-06-16T04:47:42Z Large Language Models have shown remarkable capabilities in code generation. However, most existing evaluations focus only on single-attempt accuracy and overlook the iterative refinement process that is central to real-world programming. This study presents a systematic investigation of LLMs' ability to rectify their own code through execution feedback. Using real-world programming problems across four models and two major programming languages, this study evaluates performance using iterative refinement framework where LLMs receive compiler error messages and testcase feedback after each attempt. This study introduces metrics to evaluate code failures, analyze rectification patterns, and compare the effectiveness of reasoning and non-reasoning models, offering actionable insights into both the understanding and practical application of feedback loops in LLM-driven code generation systems. Results show that reasoning models consistently improve over iterations, substantially outperforming non-reasoning models in leveraging feedback, while syntactic and runtime errors are far more tractable than logical or algorithmic failures. 2026-06-16T04:47:42Z 22 pages, 14th Computing Conference 2026 Le Zhang Suresh Kothari http://arxiv.org/abs/2606.17508v1 When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs 2026-06-16T04:40:04Z Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic. A model trained against a single label is learning to guess one outcome of a random process. We turn this around and use the nondeterminism as a training signal. We run each program many times, aggregate the observed next events into an empirical distribution, and fine-tune a 7B model to match that distribution with a KL objective. On 798 held-out predictions drawn from real production Go bugs (CockroachDB, Kubernetes, gRPC, etcd), fine-tuning on fewer than a thousand traces reaches 36.2% accuracy, ahead of Gemini 3.5 Flash used zero-shot (34.8%) and the same model without fine-tuning (28.6%). Distribution training matches cross-entropy on accuracy (35.8% vs. 36.2%) while reducing Expected Calibration Error from 0.205 to 0.169. We also derive a formal goroutine-leak signature for a class of select-blocked goroutines where P(GoUnblock)=0 holds by scheduler semantics, not by learning. We release the dataset, trained adapters, and all tooling. 2026-06-16T04:40:04Z 10 pages, 2 figures Kaviru Hapuarachchi http://arxiv.org/abs/2606.17507v1 LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline 2026-06-16T04:33:20Z Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in authorised curriculum artefacts and marking guidelines issued by education authorities. This paper presents a curriculum-grounded, configurable LLM-as-Judge pipeline for question-level marking, co-developed with an industrial partner, to support exam preparation for university admission. The pipeline identifies the relevant topics, subtopics, and cognitive demand of a question, and assembles verifiable and authorised context to support LLM judgement. Curriculum intent is operationalised through concrete syllabus artefacts, including prescribed verbs and outcomes, performance band descriptors, glossary definitions, and marking-guideline principles. A staged LLM workflow is employed to first generate question-specific rubrics, capturing structured expectations of performance, and then derive and evaluate marking criteria used to allocate marks to student responses. This design improves consistency, transparency, and alignment with official marking practices. Preliminary evaluation shows that the proposed LLM-as-Judge pipeline delivers marking outcomes comparable to human tutors, while yielding justifications that are more traceable to authorised curriculum artefacts and marking standards. The pipeline has also been integrated into an online study platform, where early deployment data provide initial insights into operational usage and manual overrides. 2026-06-16T04:33:20Z Xiwei Xu Chen Wang Jacky Jiang Phil Yang Qian Fu Mohan Dhall Wenjie Zhang Liming Zhu http://arxiv.org/abs/2606.10728v2 DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch 2026-06-16T03:49:03Z As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce \textbf{DeNovoSWE}, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with "divide and conquer" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%. 2026-06-09T11:37:15Z Jiale Zhao Guoxin Chen Fanzhe Meng Wayne Xin Zhao Ruihua Song Ji-Rong Wen Kai Jia http://arxiv.org/abs/2606.15817v2 ScratchLens: Lens-Parametric Behavioral Equivalence for Scratch Programs 2026-06-16T02:56:17Z Two Scratch programs can be syntactically far apart-renamed variables, split scripts, extracted custom blocks, or reordered initialization-and still behave identically; a one-block edit, such as replacing a blocking broadcast with an asynchronous one, can create divergences visible only under specific schedules. Deciding behavioral equivalence is central to automated feedback, grading support, and repair validation, yet tree differencing is too strict and single-run dynamic comparison is unsound for concurrent, random, and timing-dependent behavior. We observe that equivalence for block-based programs is lens-parametric: final state, frame traces, monitors, event causality, and debug traces induce different observation relations. ScratchLens makes this explicit through a taxonomy of causal divergence phenomena and observation lenses. It compiles Scratch projects into a causal IR of typed resources and semantic transactions, canonicalizes renamings, guards, and procedure bodies, quotients same-trigger concurrency with Mazurkiewicz trace normal forms, separates program order from races, and handles residual frontiers through SMT obligations and VM-backed counterexample-guided refinement. Conclusive verdicts carry evidence: equivalence by bijection and trace quotient, difference by a typed witness, and unresolved cases remain unknown. On a VM-witnessed mutation corpus from real Scratch projects, ScratchLens decides all 444 validated pairs and makes 0/158 false-equivalence claims on witnessed-different pairs under strict scoring. Structural, dynamic-only, and LLM baselines fail on the classes predicted by the taxonomy; ablations quantify the contribution of partial-order reduction and lens parametricity; and targeted scenarios expose ambiguous-mutant divergences missed by random testing. 2026-06-14T13:38:35Z Yuan Si Jialu Zhang http://arxiv.org/abs/2606.19388v1 Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen? 2026-06-16T02:36:22Z Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure. 2026-06-16T02:36:22Z Li Gu Zihuan Jiang Linqiang Guo Zhixiang Chi Ziqiang Wang Huan Liu Yuanhao Yu Tse-Hsun Chen Yang Wang http://arxiv.org/abs/2606.19387v1 Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement 2026-06-16T01:28:22Z Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework. 2026-06-16T01:28:22Z You Li Samuel Mandell David Z. Pan http://arxiv.org/abs/2606.17398v1 SoK: AI-Augmented Binary Reversing 2026-06-16T01:14:49Z Binary reversing is fundamental to software understanding, vulnerability discovery, malware investigation, and firmware auditing. However, it remains inherently challenging due to the irreversible loss of semantic information during compilation. Recent advances in machine learning, large language models (LLMs), and agentic AI systems have accelerated the adoption of AI-augmented binary reversing. Yet, the resulting body of work has become increasingly fragmented across reversing domains, artifact representations, learning approaches, and evaluation practices. This paper presents the first comprehensive systematization of knowledge on AI-augmented binary reversing. We analyze 144 research papers published since 2015, and organize them into 22 binary reversing domains according to the inference tasks. We further introduce a unified taxonomy spanning conventional and AI-augmented reversing pipelines. Our taxonomy connects traditional analysis techniques, binary-derived artifacts, representation strategies, learning paradigms, and downstream inference tasks, while clarifying the emerging roles of LLMs and agentic AI systems. By establishing a common vocabulary and structured framework, we provide a holistic view of the field's evolution over the past decade. Our study reveals common structures underlying seemingly disparate approaches, highlights persistent technical challenges and evaluation gaps, and identifies promising opportunities for future research. Collectively, these insights clarify the current state of the field and provide a foundation for the next generation of reliable and scalable AI-augmented binary reversing systems. 2026-06-16T01:14:49Z 20 pages, 7 tables, 3 figures Yujeong Kwon Yiyue Zhang Shakhzod Yuldoshkhujaev Kexin Pei Dokyung Song Hyungjoon Koo