https://arxiv.org/api/xaM9uF/l9q+UwDBGlarRE0TFf/s 2026-03-22T13:31:02Z 25302 60 15 http://arxiv.org/abs/2603.17104v1 When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents 2026-03-17T19:53:35Z

Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure audit to verify that scored components are recoverable from the visible interaction. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively. Structural integration degrades under emergent specification on both platforms, while seman- tic faithfulness loss is substantial on Claude Code and small on Codex. As a mitigation case study, we introduce ProjectGuard, an exter- nal project-state layer for specification tracking. On Claude Code, ProjectGuard recovers 90% of the faithfulness gap, increases fully faith- ful components from 118 to 181, and reduces severe failures from 72 to 49. These results identify specification tracking as a distinct eval- uation target for long-horizon coding agents.

2026-03-17T19:53:35Z Lu Yan Xuan Chen Xiangyu Zhang http://arxiv.org/abs/2603.16866v1 ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K 2026-03-17T17:59:49Z

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.

2026-03-17T17:59:49Z Website: https://manitwin.github.io/ Kaixuan Wang Tianxing Chen Jiawei Liu Honghao Su Shaolong Zhu Minxuan Wang Zixuan Li Yue Chen Huan-ang Gao Yusen Qin Jiawei Wang Qixuan Zhang Lan Xu Jingyi Yu Yao Mu Ping Luo http://arxiv.org/abs/2603.16791v1 Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers 2026-03-17T17:01:44Z

Novice programmers often struggle to comprehend code due to vague naming, deep nesting, and poor structural organization. While explanations may offer partial support, they typically do not restructure the code itself. We propose code refactoring as cognitive scaffolding, where cognitively guided refactoring automatically restructures code to improve clarity. We operationalize this in CDDRefactorER, an automated approach grounded in Cognitive-Driven Development that constrains transformations to reduce control-flow complexity while preserving behavior and structural similarity. We evaluate CDDRefactorER using two benchmark datasets (MBPP and APPS) against two models (gpt-5-nano and kimi-k2), and a controlled human-subject study with novice programmers. Across datasets and models, CDDRefactorER reduces refactoring failures by 54-71% and substantially lowers the likelihood of increased Cyclomatic and Cognitive complexity during refactoring, compared to unconstrained prompting. Results from the human study show consistent improvements in novice code comprehension, with function identification increasing by 31.3% and structural readability by 22.0%. The findings suggest that cognitively guided refactoring offers a practical and effective mechanism for enhancing novice code comprehension.

2026-03-17T17:01:44Z International Conference on Evaluation and Assessment in Software Engineering (EASE), 2026 Subarna Saha Alif Al Hasan Fariha Tanjim Shifat Mia Mohammad Imran http://arxiv.org/abs/2603.16790v1 InCoder-32B: Code Foundation Model for Industrial Scenarios 2026-03-17T17:01:35Z

Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.

2026-03-17T17:01:35Z Jian Yang Wei Zhang Jiajun Wu Junhang Cheng Shawn Guo Haowen Wang Weicheng Gu Yaxin Du Joseph Li Fanglin Xu Yizhi Li Lin Jing Yuanbo Wang Yuhan Gao Ruihao Gong Chuan Hao Ran Tao Aishan Liu Tuney Zheng Ganqu Cui Zhoujun Li Mingjie Tang Chenghua Lin Wayne Xin Zhao Xianglong Liu Ming Zhou Bryan Dai Weifeng Lv http://arxiv.org/abs/2603.16733v1 IQuest-Coder-V1 Technical Report 2026-03-17T16:15:31Z

In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, we implement a specialized mid-training stage that integrates reasoning and agentic trajectories in 32k-context and repository-scale in 128k-context to forge deep logical foundations. The models are then finalized with post-training of specialized coding capabilities, which is bifurcated into two specialized paths: the thinking path (utilizing reasoning-driven RL) and the instruct path (optimized for general assistance). IQuest-Coder-V1 achieves state-of-the-art performance among competitive models across critical dimensions of code intelligence: agentic software engineering, competitive programming, and complex tool use. To address deployment constraints, the IQuest-Coder-V1-Loop variant introduces a recurrent mechanism designed to optimize the trade-off between model capacity and deployment footprint, offering an architecturally enhanced path for efficacy-efficiency trade-off. We believe the release of the IQuest-Coder-V1 series, including the complete white-box chain of checkpoints from pre-training bases to the final thinking and instruction models, will advance research in autonomous code intelligence and real-world agentic systems.

2026-03-17T16:15:31Z Jian Yang Wei Zhang Shawn Guo Zhengmao Ye Lin Jing Shark Liu Yizhi Li Jiajun Wu Cening Liu X. Ma Yuyang Song Siwei Wu Yuwen Li L. Liao T. Zheng Ziling Huang Zelong Huang Che Liu Yan Xing Renyuan Li Qingsong Cai Hanxu Yan Siyue Wang Shikai Li Jason Klein Liu An Huang Yongsheng Kang Jinxing Zhang Chuan Hao Haowen Wang Weicheng Gu Ran Tao Mingjie Tang Peihao Wu Jianzhou Wang Xianglong Liu Weifeng Lv Bryan Dai http://arxiv.org/abs/2406.07714v4 LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing 2026-03-17T15:37:48Z

Greybox fuzzing has achieved success in revealing bugs and vulnerabilities in programs. However, randomized mutation strategies have limited the fuzzer's performance on structured data. Specialized fuzzers can handle complex structured data, but require additional efforts in grammar and suffer from low throughput. In this paper, we explore the potential of utilizing the Large Language Model to enhance greybox fuzzing for structured data. We utilize the pre-trained knowledge of LLM about data conversion and format to generate new valid inputs. We further fine-tuned it with paired mutation seeds to learn structured format and mutation strategies effectively. Our LLM-based fuzzer, LLAMAFUZZ, integrates the power of LLM to understand and mutate structured data to fuzzing. We conduct experiments on the standard bug-based benchmark Magma and a wide variety of real-world programs. LLAMAFUZZ outperforms our top competitor by 41 bugs on average. We also identified 47 unique bugs across all trials. Moreover, LLAMAFUZZ demonstrated consistent performance on both bug trigger and bug reached. Compared to AFL++, LLAMAFUZZ achieved 27.19% more branches in real-world program sets on average. We also demonstrate a case study to explain how LLMs enhance the fuzzing process in terms of code coverage.

2024-06-11T20:48:28Z The 7th ACM/IEEE International Conference on Automation of Software Test (AST 2026) Hongxiang Zhang Yuyang Rong Yifeng He Hao Chen http://arxiv.org/abs/2601.12186v2 Aletheia: What Makes RLVR For Code Verifiers Tick? 2026-03-17T15:26:31Z

Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary drivers of RLVR performance and cost: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifiers across disparate model sizes and covariate shifts. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas thinking traces become the most vital factor for larger sizes. Furthermore, we show that negative samples stabilize training at large sizes, and scaling inference-time compute cannot compensate for any core RLVR component. These findings provide a compute-optimal roadmap for practitioners, offering concrete strategies to simplify verifier training based on model size. Consequently, our work establishes a foundation for scalable supervision, enabling efficiently trained code verifiers to reliably supervise much larger code generation policies.

2026-01-17T22:30:45Z 21 pages, 6 figures Vatsal Venkatkrishna Indraneil Paul Iryna Gurevych http://arxiv.org/abs/2509.25117v2 Towards Reliable Generation of Executable Workflows by Foundation Models 2026-03-17T15:13:28Z

Recent advancements in Foundation Models (FMs) have demonstrated significant progress in processing complex natural language to perform intricate tasks. Successfully executing these tasks often requires orchestrating calls to FMs alongside other software components. However, manually decomposing a task into a coherent sequence of smaller, logically aggregated steps, commonly referred to as workflows, demands considerable effort and specialized domain knowledge. While FMs can assist in generating such workflows specified in domain-specific languages (DSLs), achieving accuracy and reliability in this process remains a challenge. We introduce a framework that leverages static analysis feedback to enable FMs to detect and repair defects in the DSL-based workflows they generate. We begin by presenting an initial taxonomy of defect occurrences in FM-generated DSL workflows, categorizing them into 20 distinct types. Furthermore, we observe a high prevalence of defects across FM-generated DSL workflows, with 89.23% of the studied instances containing at least one defect. This high prevalence underscores the magnitude of the problem and the necessity for mitigation strategies. Following this, we demonstrate that nine types of these defects can be effectively identified through static analysis of the workflows. For this purpose, we develop Timon, the first-of-its-kind static analyzer specifically designed for FM-generated DSL workflows. Finally, we show that by incorporating feedback from Timon, we can guide Pumbaa, an FM-based tool, to repair the detected defect incidences. By systematically detecting and repairing defects, our work takes a crucial step towards the reliable and automated generation of executable workflows from natural-language requirements.

2025-09-29T17:42:45Z Sogol Masoumzadeh Keheliya Gallaba Dayi Lin Ahmed E. Hassan http://arxiv.org/abs/2601.14455v2 Unpacking Security Scanners for GitHub Actions Workflows 2026-03-17T14:40:25Z

GitHub Actions is a widely used platform to automate the build and deployment of software projects through configurable workflows. As the platform's popularity grows, it also becomes a target of choice for software supply chain attacks. These attacks exploit excessive permissions, ambiguous versions or the absence of artifact integrity checks to compromise the workflows. In response to these attacks, several security scanners have emerged to help developers harden their workflows. In this paper, we perform the first systematic comparison of 9 GitHub Actions Workflows security scanners. We compare them regarding scope (which security weaknesses they target), detection capabilities (how many weaknesses they detect), and performance (how long they take to scan a workflow). In order to compare the scanners on a common ground, we first establish a classification of 10 common security weaknesses that can be found in GitHub Actions Workflows. Then, we run the scanners against a curated set of 2722 workflows. Our study reveals that the landscape of GitHub Actions Workflows security scanners is very diverse, with both general purpose and focused scanners. More importantly, we provide evidence that these scanners implement fundamentally different analysis strategies, leading to major gaps regarding the nature and the number of reported security weaknesses. Based on these empirical evidence we make actionable recommendations for developers to harden their GitHub Actions Workflows.

2026-01-20T20:25:11Z 16 pages, 3 figures, 5 tables Madjda Fares Yogya Gamage Benoit Baudry http://arxiv.org/abs/2603.16577v1 Reasoning About Variability Models Through Network Analysis 2026-03-17T14:29:03Z

Feature models are widely used to capture the configuration space of software systems. Although automated reasoning has been studied for detecting problematic features and supporting configuration tasks, significantly less attention has been given to the systematic study of the structural properties of feature models at scale. The approach fills this gap by examining the models' structure through a network analysis perspective. We focus on three Research Questions concerning (i) the structural patterns exhibited by these graphs, (ii) the extent to which such patterns vary across domains and model sources, and (iii) the usefulness of network-based indicators for understanding, maintaining, and evolving variability models. To answer these questions, we analyze a dataset of 5,709 models from 20 repositories, spanning multiple application domains and varying sizes (ranging from 99 to 35,907 variables on their Boolean translation). To do so, graphs of transitive dependencies and conflicts between features are computed. Our results reveal consistent structural traits (e.g., the predominance of dependency relations, the presence of highly central features, or characteristic node degree distributions) as well as notable domain-specific deviations. These findings ease the identification of maintenance-relevant features, opportunities for modular decomposition, and indicators of structural fragility. This approach provides a scalable, graph-based foundation for the empirical analysis of variability models and contributes quantitative evidence to support future research on their structure and evolution.

2026-03-17T14:29:03Z Jose Manuel Sanchez Miguel Angel Olivero Ruben Heradio Luis Cambelo David Fernandez-Amoros http://arxiv.org/abs/2509.18808v2 SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement 2026-03-17T14:02:48Z

Large language models (LLMs) have achieved remarkable progress in code generation. However, existing benchmarks mainly formalize the task as a static, single-turn problem, overlooking the stepwise requirement changes and iterative workflows in real-world software development. This mismatch limits the understanding of how well LLMs can support real-world development workflows. Constructing such iterative benchmarks is challenging due to the lack of public interaction traces and the difficulty of creating discriminative, turn-specific test cases. To bridge this gap, we present SR-Eval, a benchmark specifically designed to assess LLMs on iterative code generation under Stepwise requirements Refinement. SR-Eval spans both function-level and repository-level tasks in Python and Java, enabling fine-grained and progressive evaluation across evolving requirements. The construction of SR-Eval follows a carefully designed pipeline that first leverages a multi-agent-based requirement generation method to simulate the development process and recover the multi-round interaction process from final requirements, then employs a semantic-aware discriminative test case generation component to ensure discriminative and consistent evaluation at each turn. SR-Eval comprises 443 multi-turn tasks and 1,857 questions at both function and repository levels. Using SR-Eval, we evaluate 11 representative LLMs with three prompting strategies that simulate different usage patterns. Results show that iterative code generation under stepwise requirement refinement remains highly challenging: the best-performing model achieves only 22.67% completion rate on function-level tasks and 20.00% on repository-level tasks. We further observe that prompting strategies substantially influence performance, highlighting the need for the development of advanced methods.

2025-09-23T08:59:05Z Zexun Zhan Shuzheng Gao Ruida Hu Cuiyun Gao http://arxiv.org/abs/2602.22579v2 Metamorphic Testing of Vision-Language Action-Enabled Robots 2026-03-17T13:56:42Z

Vision-Language-Action (VLA) models are multimodal robotic task controllers that, given an instruction and visual inputs, produce a sequence of low-level control actions (or motor commands) enabling a robot to execute the requested task in the physical environment. These systems face the test oracle problem from multiple perspectives. On the one hand, a test oracle must be defined for each instruction prompt, which is a complex and non-generalizable approach. On the other hand, current state-of-the-art oracles typically capture symbolic representations of the world (e.g., robot and object states), enabling the correctness evaluation of a task, but fail to assess other critical aspects, such as the quality with which VLA-enabled robots perform a task. In this paper, we explore whether Metamorphic Testing (MT) can alleviate the test oracle problem in this context. To do so, we propose two metamorphic relation patterns and five metamorphic relations to assess whether changes to the test inputs impact the original trajectory of the VLA-enabled robots. An empirical study involving five VLA models, two simulated robots, and four robotic tasks shows that MT can effectively alleviate the test oracle problem by automatically detecting diverse types of failures, including, but not limited to, uncompleted tasks. More importantly, the proposed MRs are generalizable, making the proposed approach applicable across different VLA models, robots, and tasks, even in the absence of test oracles.

2026-02-26T03:32:43Z Pablo Valle Sergio Segura Shaukat Ali Aitor Arrieta http://arxiv.org/abs/2603.16975v1 The State of Generative AI in Software Development: Insights from Literature and a Developer Survey 2026-03-17T12:59:59Z

Generative Artificial Intelligence (GenAI) rapidly transforms software engineering, yet existing research remains fragmented across individual tasks in the Software Development Lifecycle. This study integrates a systematic literature review with a survey of 65 software developers. The results show that GenAI exerts its highest impact in design, implementation, testing, and documentation, where over 70 % of developers report at least halving the time for boilerplate and documentation tasks. 79 % of survey respondents use GenAI daily, preferring browser-based Large Language Models over alternatives integrated directly in their development environment. Governance is maturing, with two-thirds of organizations maintaining formal or informal guidelines. In contrast, early SDLC phases such as planning and requirements analysis show markedly lower reported benefits. In a nutshell, GenAI shifts value creation from routine coding toward specification quality, architectural reasoning, and oversight, while risks such as uncritical adoption, skill erosion, and technical debt require robust governance and human-in-the-loop mechanisms.

2026-03-17T12:59:59Z Vincent Gurgul Robin Gubela Stefan Lessmann http://arxiv.org/abs/2603.15298v2 The Impact of AI-Assisted Development on Software Security: A Study of Gemini and Developer Experience 2026-03-17T10:48:08Z

The ongoing shortage of skilled developers, particularly in security-critical software development, has led organizations to increasingly adopt AI-powered development tools to boost productivity and reduce reliance on limited human expertise. These tools, often based on large language models, aim to automate routine tasks and make secure software development more accessible and efficient. However, it remains unclear how developers' general programming and security-specific experience, and the type of AI tool used (free vs. paid) affect the security of the resulting software. Therefore, we conducted a quantitative programming study with software developers (n=159) exploring the impact of Google's AI tool Gemini on code security. Participants were assigned a security-related programming task using either no AI tools, the free version, or the paid version of Gemini. While we did not observe significant differences between using Gemini in terms of secure software development, programming experience significantly improved code security and cannot be fully substituted by Gemini.

2026-03-16T13:59:06Z Nadine Jost Benjamin Berens Manuel Karl Stefan Albert Horstmann Martin Johns Alena Naiakshina http://arxiv.org/abs/2603.16357v1 Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs 2026-03-17T10:40:35Z

In this paper, we investigate the potential of open-source Large Language Models (LLMs) for grading Unified Modeling Language (UML) class diagrams. In contrast to existing work, which primarily evaluates proprietary LLMs, we focus on non-proprietary models, making our approach suitable for universities where transparency and cost are critical. Additionally, existing studies assess performance over complete diagrams rather than individual criteria, offering limited insight into how automated grading aligns with human evaluation. To address these gaps, we propose a grading pipeline in which student-generated UML class diagrams are independently evaluated by both teaching assistants (TAs) and LLMs. Grades are then compared at the level of individual criteria. We evaluate this pipeline through a quantitative study of 92 UML class diagrams from a software design course, comparing TA grades against assessments produced by six popular open-source LLMs. Performance is measured across individual criterion, highlighting areas where LLMs diverge from human graders. Our results show per-criterion accuracy of up to 88.56% and a Pearson correlation coefficient of up to 0.78, representing a substantial improvement over previous work while using only open-source models. We also explore the concept of an optimal model that combines the best-performing LLM per criterion. This optimal model achieves performance close to that of a TA, suggesting a possible path toward a mixed-initiative grading system. Our findings demonstrate that open-source LLMs can effectively support UML class diagram grading by explicitly identifying grading alignment. The proposed pipeline provides a practical approach to manage increasing assessment workloads with growing student counts.

2026-03-17T10:40:35Z 7 pages, 3 figures Matthijs Jansen op de Haar Nacir Bouali Faizan Ahmed