https://arxiv.org/api/8EmHigroEqrchcsbG7B1cG778zQ 2026-03-20T09:01:34Z 25302 0 15 http://arxiv.org/abs/2603.17973v2 TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis 2026-03-19T17:12:15Z

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions -- breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool that performs pre-change impact analysis for AI coding agents. TDAD builds a dependency map between source code and tests so that before committing a patch, the agent knows which tests to verify and can self-correct. The map is delivered as a lightweight agent skill -- a static text file the agent queries at runtime. Evaluated on SWE-bench Verified with two open-weight models running on consumer hardware (Qwen3-Coder 30B, 100 instances; Qwen3.5-35B-A3B, 25 instances), TDAD reduced regressions by 70% (6.08% to 1.82%) compared to a vanilla baseline. In contrast, adding TDD procedural instructions without targeted test context increased regressions to 9.94% -- worse than no intervention at all. When deployed as an agent skill with a different model and framework, TDAD improved issue-resolution rate from 24% to 32%, confirming that surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.

2026-03-18T17:38:22Z Toolpaper, 7 pages, 7 tables, 3 figures, 1 algorithm. Submitted to ACM AIWare 2026 (Data and Benchmark Track) Pepe Alonso Sergio Yovine Victor A. Braberman http://arxiv.org/abs/2501.17026v4 Mitigating Omitted Variable Bias in Empirical Software Engineering 2026-03-19T17:01:11Z

Omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. This results in the model attributing the missing variables' effect to some of the included variables -- hence over- or under-estimating the latter's true effect. Omitted variable bias presents a significant threat to the validity of empirical research, particularly in non-experimental studies such as those prevalent in empirical software engineering. This paper illustrates the impact of omitted variable bias on two illustrative examples in the software engineering domain, and uses them to present methods to investigate the possible presence of omitted variable bias, to estimate its impact, and to mitigate its drawbacks. The analysis techniques we present are based on causal structural models of the variables of interest, which provide a practical, intuitive summary of the key relations among variables. This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering. An important observation is that it pays off to invest effort investigating omitted variable bias before actually executing an empirical study, because this effort can lead to a more solid study design, and to a significant reduction in its threats to validity.

2025-01-28T15:43:46Z Carlo A. Furia Richard Torkar http://arxiv.org/abs/2603.19138v1 Implicit Patterns in LLM-Based Binary Analysis 2026-03-19T16:56:56Z

Binary vulnerability analysis is increasingly performed by LLM-based agents in an iterative, multi-pass manner, with the model as the core decision-maker. However, how such systems organize exploration over hundreds of reasoning steps remains poorly understood, due to limited context windows and implicit token-level behaviors. We present the first large-scale, trace-level study showing that multi-pass LLM reasoning gives rise to structured, token-level implicit patterns. Analyzing 521 binaries with 99,563 reasoning steps, we identify four dominant patterns: early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization that emerge implicitly from reasoning traces. These token-level implicit patterns serve as an abstraction of LLM reasoning: instead of explicit control-flow or predefined heuristics, exploration is organized through implicit decisions regulating path selection, commitment, and revision. Our analysis shows these patterns form a stable, structured system with distinct temporal roles and measurable characteristics. Our results provide the first systematic characterization of LLM-driven binary analysis and a foundation for more reliable analysis systems.

2026-03-19T16:56:56Z 18 pages Qiang Li XiangRui Zhang Haining Wang http://arxiv.org/abs/2511.02434v2 Who's Who? LLM-assisted Software Traceability with Architecture Entity Recognition 2026-03-19T13:48:31Z

Identifying architecturally relevant entities in textual artifacts is crucial for Traceability Link Recovery (TLR) between Software Architecture Documentation (SAD) and source code. While Software Architecture Models (SAMs) can bridge the semantic gap between these artifacts, their manual creation is time-consuming. LLMs offer new capabilities for extracting architectural entities from SAD and source code to construct SAMs automatically or establish direct trace links. This paper extends our ICSA 2025 paper [19], which introduced Extracting Architecture (ExArch) for LLM-based architecture component name extraction. The extension contributes the novel Architecture Traceability with Entity Matching via Semantic inference (ArTEMiS) approach, an extended evaluation with additional LLMs, configurations, a revised benchmark, and a combined evaluation of both approaches. Specifically, this paper presents the following approaches: ExArch extracts component names as simple SAMs from SAD and source code to eliminate the need for manual SAM creation, while ArTEMiS identifies architectural entities in documentation and matches them with (manually or automatically generated) SAM entities. Our evaluation compares against state-of-the-art approaches SWATTR, TransArC and ArDoCode. TransArC achieves strong performance (F1: 0.87) but requires manually created SAMs; ExArch achieves comparable results (F1: 0.86) using only SAD and code. ArTEMiS is on par with the traditional heuristic-based SWATTR (F1: 0.81) and can successfully replace it when integrated with TransArC. The combination of ArTEMiS and ExArch outperforms ArDoCode, the best baseline without manual SAMs. Our results demonstrate that LLMs can effectively identify architectural entities in textual artifacts, enabling automated SAM generation and TLR, making architecture-code traceability more practical and accessible.

2025-11-04T10:06:53Z Dominik Fuchß Haoyu Liu Sophie Corallo Tobias Hey Jan Keim Johannes von Geisau Anne Koziolek http://arxiv.org/abs/2603.14255v2 ITKIT: Feasible CT Image Analysis based on SimpleITK and MMEngine 2026-03-19T13:47:59Z

CT images are widely used in clinical diagnosis and treatment, and their data have formed a de facto standard - DICOM. It is clear and easy to use, and can be efficiently utilized by data-driven analysis methods such as deep learning. In the past decade, many program frameworks for medical image analysis have emerged in the open-source community. ITKIT analyzed the characteristics of these frameworks and hopes to provide a better choice in terms of ease of use and configurability. ITKIT offers a complete pipeline from DICOM to 3D segmentation inference. Its basic practice only includes some essential steps, enabling users with relatively weak computing capabilities to quickly get started using the CLI according to the documentation. For advanced users, the OneDL-MMEngine framework provides a flexible model configuration and deployment entry. This paper conducted 12 typical experiments to verify that ITKIT can meet the needs of most basic scenarios.

2026-03-15T07:25:06Z Yiqin Zhang Meiling Chen http://arxiv.org/abs/2601.19146v2 The Promise and Reality of Continuous Integration Caching: An Empirical Study of Travis CI Builds 2026-03-19T12:40:54Z

Continuous Integration (CI) provides early feedback by automatically building software, but long build durations can hinder developer productivity. CI services use caching to speed up builds by reusing infrequently changing artifacts, yet little is known about how caching is adopted in practice and what challenges it entails. In this paper, we conduct a large-scale empirical study of CI caching in Travis CI, analyzing 513,384 builds from 1,279 GitHub projects. We find that only 30% of projects adopt CI caching, and early adopters are typically more mature, with more dependencies, commits, and longer CI lifespans. To understand non-adoption, we submit pull requests enabling caching in non-adopting projects, and nearly half are accepted or merged. Developer feedback indicates that non- or late adoption mainly results from limited awareness of CI caching support. We further study cache maintenance and identify five common activities, performed by 24% of cache-enabled projects. While one-third of projects see substantial build-time reductions, cache uploads occur in 97% of builds, and 27% of projects contain stale cached artifacts. An analysis of reported caching issues shows developers mainly struggle with corrupted or outdated caches and request broader caching features. Overall, CI caching does not benefit all projects, requires ongoing maintenance, and is more complex in practice than many developers expect.

2026-01-27T03:23:19Z Accepted at the 30th International Conference on Evaluation and Assessment in Software Engineering (EASE '26) Taher A. Ghaleb Daniel Alencar da Costa Ying Zou http://arxiv.org/abs/2603.11103v2 Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining 2026-03-19T11:48:59Z

While Large Language Models (LLMs) have achieved remarkable success in code generation, they often struggle with the deep, long-horizon reasoning required for complex software engineering. We attribute this limitation to the nature of standard pre-training data: static software repositories represent only the terminal state of an intricate intellectual process, abstracting away the intermediate planning, debugging, and iterative refinement. To bridge this gap, we propose a novel paradigm: understanding via reconstruction. We hypothesize that reverse-engineering the latent agentic trajectories -- the planning, reasoning, and debugging steps -- behind static repositories provides a far richer supervision signal than raw code alone. To operationalize this, we introduce a framework that synthesizes these trajectories using a multi-agent simulation. This process is grounded in the structural realities of the source repositories (e.g., dependency graphs and file hierarchies) to ensure fidelity. Furthermore, to guarantee the logical rigor of the synthetic data, we employ a search-based optimization technique that iteratively refines the Chain-of-Thought (CoT) reasoning to maximize the likelihood of the ground-truth code. Empirical results demonstrate that continuous pre-training on these reconstructed trajectories significantly enhances Llama-3-8B's performance across diverse benchmarks, including long-context understanding, coding proficiency, and agentic capabilities.

2026-03-11T09:23:20Z Zhiyuan Zeng Yichi Zhang Yong Shan Kai Hua Siyuan Fang Zhaiyu Liu Jiaheng Liu Haozhe Wang Yining Zheng Ming Ding Ke Shen Ge Zhang Wenhao Huang Xipeng Qiu http://arxiv.org/abs/2511.08462v3 QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities 2026-03-19T10:53:05Z

Static analysis tools provide a powerful means to detect security vulnerabilities by specifying queries that encode vulnerable code patterns. However, writing such queries is challenging and requires diverse expertise in security and program analysis. To address this challenge, we present QLCoder - an agentic framework that automatically synthesizes queries in CodeQL, a powerful static analysis engine, directly from a given CVE metadata. QLCode embeds an LLM in a synthesis loop with execution feedback, while constraining its reasoning using a custom MCP interface that allows structured interaction with a Language Server Protocol (for syntax guidance) and a RAG database (for semantic retrieval of queries and documentation). This approach allows QLCoder to generate syntactically and semantically valid security queries. We evaluate QLCode on 176 existing CVEs across 111 Java projects. Building upon the Claude Code agent framework, QLCoder synthesizes correct queries that detect the CVE in the vulnerable but not in the patched versions for 53.4% of CVEs. In comparison, using only Claude Code synthesizes 10% correct queries.

2025-11-11T17:06:04Z Claire Wang Ziyang Li Saikat Dutta Mayur Naik http://arxiv.org/abs/2603.18741v1 Beyond the Code: A Multi-Modal Assessment Strategy for Fostering Professional Competencies via Introductory Programming Projects 2026-03-19T10:42:34Z

As the landscape of software engineering evolves, introductory programming courses must go beyond teaching syntax to foster comprehensive technical competencies and professional soft skills. This paper reports on a pedagogical experience in a "Fundamentals of Programming" course that used a Project-Based Learning (PBL) framework to develop a 2D "Maze Runner"-style game. While game development serves as a high-engagement vehicle for mastering core concepts, such as multidimensional arrays, control structures, and logic, the core of this study focuses on implementing a rigorous, multifaceted assessment model structured across four distinct dimensions: (1) an in-situ technical demonstration, evaluating real-time code execution and algorithmic robustness; (2) a technical screencast, requiring students to articulate their work in a concise audiovisual format; (3) a formal presentation to instructors, defending their project's design patterns and problem-solving strategies; and (4) a structured peer-review process, where students evaluated their colleagues' projects. Our findings suggest that this multi-dimensional approach not only improves student retention of programming fundamentals but also significantly enhances communication skills and critical thinking. By integrating peer evaluation and multimedia documentation, the course successfully bridges the gap between basic coding and the collaborative requirements of modern software engineering. This paper details the curriculum design, the challenges of implementing diverse assessment pillars, and the measurable impact on student performance and engagement, providing a scalable roadmap for educators looking to modernize introductory computing curricula.

2026-03-19T10:42:34Z Article submitted to IEEE Santiago Berrezueta-Guzman Vanesa Metaj Stefan Wagner http://arxiv.org/abs/2603.18740v1 Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review 2026-03-19T10:40:27Z

Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies. Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs. Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Our results show that confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed.

2026-03-19T10:40:27Z Dimitris Mitropoulos Nikolaos Alexopoulos Georgios Alexopoulos Diomidis Spinellis http://arxiv.org/abs/2603.18735v1 SpaceTime Programming: Live and Omniscient Exploration of Code and Execution 2026-03-19T10:35:55Z

Programming environments typically separate the world of static code from the dynamic execution of programs. Developers must switch between writing code and observing its execution, often with limited tools to understand the relationship between code changes and runtime behavior. Several paradigms and approaches exist to bridge this gap, including exploratory programming for comparing code variants, live programming for immediate feedback, and omniscient debugging for exploring execution history. However, existing solutions tend to focus on specific aspects and one specific paradigm rather than providing a fully integrated environment with multiple capabilities. This paper introduces \spacetime Programming, a novel approach that unifies these paradigms to create a programming model for exploring both code modifications and execution flow. At the core of our approach is a trace mechanism that captures not only execution state but also the corresponding code changes, enabling developers to explore programs in both space (code variants) and time (execution flow). As a proof of concept, we implemented a Python library supporting SpaceTime Programming and applied it in two contexts: a live omniscient debugger and a Pygame game development tool, showcased through a Flappy Bird-like game. We further evaluated SpaceTimePy on five real-world Python projects, finding performance overhead ranging from 35% to 150% on test suites.

2026-03-19T10:35:55Z Jean-Baptiste Döderlein Djamel Eddine Khelladi Mathieu Acher Benoit Combemale http://arxiv.org/abs/2603.18734v1 Green Architectural Tactics in ML-enabled Systems: An LLM-based Repository Mining Study 2026-03-19T10:34:53Z

Context: The increasing adoption of machine learning (ML) and artificial intelligence (AI) technologies raises growing concerns about their environmental sustainability. Developing and deploying ML-enabled systems is computationally intensive, particularly during training and inference. Green AI has emerged to address these issues by promoting efficiency without sacrificing accuracy. While prior research has proposed catalogs of sustainable practices (i.e., green tactics), there remains limited understanding of their adoption in practice and whether additional, undocumented tactics exist. Objective: This study aims to investigate the extent to which existing sustainable practices are implemented in real-world ML-enabled systems and to identify previously undocumented practices that support environmental sustainability. Method: We conduct a mining software repository study on 205 open-source ML projects on GitHub. To support our analysis, we design a novel mechanism based on large language models (LLMs) capable of identifying both known and new sustainable practices from code repositories. Results: Our findings confirm that green tactics reported in the literature are used in practice, although adoption rates vary. Furthermore, our LLM-based approach reveals nine previously undocumented sustainable practices. Each tactic is supported with code examples to aid adoption and integration. Conclusions: We finally provide insights for practitioners seeking to reduce the environmental impact of ML-enabled systems and offer a foundation for future research in automating the detection and adoption of sustainable practices.

2026-03-19T10:34:53Z Vincenzo De Martino Silverio Martínez-Fernández Fabio Palomba http://arxiv.org/abs/2603.18693v1 Cross-Ecosystem Vulnerability Analysis for Python Applications 2026-03-19T09:52:29Z

Python applications depend on native libraries that may be vendored within package distributions or installed on the host system. When vulnerabilities are discovered in these libraries, determining which Python packages are affected requires cross-ecosystem analysis spanning Python dependency graphs and OS package versions. Current vulnerability scanners produce false negatives by missing vendored vulnerabilities and false positives by ignoring security patches backported by OS distributions. We present a provenance-aware vulnerability analysis approach that resolves vendored libraries to specific OS package versions or upstream releases. Our approach queries vendored libraries against a database of historical OS package artifacts using content-based hashing, and applies library-specific dynamic analyses to extract version information from binaries built from upstream source. We then construct cross-ecosystem call graphs by stitching together Python and binary call graphs across dependency boundaries, enabling reachability analysis of vulnerable functions. Evaluating on 100,000 Python packages and 10 known CVEs associated with third-party native dependencies, we identify 39 directly vulnerable packages (47M+ monthly downloads) and 312 indirectly vulnerable client packages affected through dependency chains. Our analysis achieves up to 97% false positive reduction compared to upstream version matching.

2026-03-19T09:52:29Z Georgios Alexopoulos Nikolaos Alexopoulos Thodoris Sotiropoulos Charalambos Mitropoulos Zhendong Su Dimitris Mitropoulos http://arxiv.org/abs/2603.18606v1 SQL-Commenter: Aligning Large Language Models for SQL Comment Generation with Direct Preference Optimization 2026-03-19T08:23:40Z

SQL query comprehension is a significant challenge due to complex syntax, diverse join types, and deep nesting. Many queries lack adequate comments, severely hindering code readability, maintainability, and knowledge transfer. Automated SQL comment generation faces two main challenges: limited datasets that inadequately represent complex real-world queries, and Large Language Models' (LLMs) insufficient understanding of SQL-specific semantics. Our empirical analysis shows that even after continual pre-training and supervised fine-tuning, LLMs struggle with complex SQL semantics, yielding inaccurate comments. To address this, we propose SQL-Commenter, an advanced method based on LLaMA-3.1-8B. We first construct a comprehensive dataset of complex SQL queries with expert-verified comments. Next, we perform continual pre-training on a large SQL corpus to enhance the LLM's syntax and semantic understanding, followed by supervised fine-tuning. Finally, we introduce Direct Preference Optimization (DPO) using human feedback. SQL-Commenter utilizes a preference-based loss function to favor preferred outputs, enhancing fine-grained semantic learning and context-dependent quality assessment. Evaluated on the Spider and Bird benchmarks, SQL-Commenter significantly outperforms state-of-the-art baselines. On average, it surpasses the strongest baseline (Qwen3-14B) by 9.29, 4.99, and 13.23 percentage points on BLEU-4, METEOR, and ROUGE-L, respectively. Moreover, human evaluation demonstrates the superior quality of comments generated by SQL-Commenter in terms of correctness, completeness, and naturalness.

2026-03-19T08:23:40Z Accepted to ICPC 2026 Lei Yu Peng Wang Jingyuan Zhang Xin Wang Jia Xu Li Yang Changzhi Deng Jiajia Ma Fengjun Zhang http://arxiv.org/abs/2512.19980v2 Neuron-Guided Interpretation of Code LLMs: Where, Why, and How? 2026-03-19T08:11:51Z

Code language models excel on code intelligence tasks, yet their internal interpretability is underexplored. Existing neuron interpretability techniques from NLP are suboptimal for source code due to programming languages formal, hierarchical, and executable nature. We empirically investigate code LLMs at the neuron level, localizing language-specific neurons (selectively responsive to one language) and concept layers (feed-forward layers encoding language-agnostic code representations). We analyze Llama-3.1-8B and Qwen2.5-Coder-32B on multilingual inputs in C++, Java, Python, Go, and JavaScript, measuring neuron selectivity and layerwise contributions during generation. We find (1) neurons specialized for individual languages alongside a universal subset supporting general-purpose generation; and (2) lower layers mainly encode language-specific syntax, while middle layers capture semantic abstractions shared across languages, emerging as concept layers. We demonstrate utility on three tasks: neuron-guided fine-tuning for code generation, clone detection via concept-layer embeddings, and concept-layer-guided transfer for code summarization, each yielding consistent gains in multilingual settings.

2025-12-23T02:04:13Z Accepted by FSE2026 Zhe Yin Xiaodong Gu Beijun Shen