https://arxiv.org/api/8EmHigroEqrchcsbG7B1cG778zQ2026-03-20T09:01:34Z25302015http://arxiv.org/abs/2603.17973v2TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis2026-03-19T17:12:15ZAI coding agents can resolve real-world software issues, yet they frequently introduce regressions -- breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool that performs pre-change impact analysis for AI coding agents. TDAD builds a dependency map between source code and tests so that before committing a patch, the agent knows which tests to verify and can self-correct. The map is delivered as a lightweight agent skill -- a static text file the agent queries at runtime. Evaluated on SWE-bench Verified with two open-weight models running on consumer hardware (Qwen3-Coder 30B, 100 instances; Qwen3.5-35B-A3B, 25 instances), TDAD reduced regressions by 70% (6.08% to 1.82%) compared to a vanilla baseline. In contrast, adding TDD procedural instructions without targeted test context increased regressions to 9.94% -- worse than no intervention at all. When deployed as an agent skill with a different model and framework, TDAD improved issue-resolution rate from 24% to 32%, confirming that surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.2026-03-18T17:38:22ZToolpaper, 7 pages, 7 tables, 3 figures, 1 algorithm. Submitted to ACM AIWare 2026 (Data and Benchmark Track)Pepe AlonsoSergio YovineVictor A. Brabermanhttp://arxiv.org/abs/2501.17026v4Mitigating Omitted Variable Bias in Empirical Software Engineering2026-03-19T17:01:11ZOmitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. This results in the model attributing the missing variables' effect to some of the included variables -- hence over- or under-estimating the latter's true effect. Omitted variable bias presents a significant threat to the validity of empirical research, particularly in non-experimental studies such as those prevalent in empirical software engineering.
This paper illustrates the impact of omitted variable bias on two illustrative examples in the software engineering domain, and uses them to present methods to investigate the possible presence of omitted variable bias, to estimate its impact, and to mitigate its drawbacks. The analysis techniques we present are based on causal structural models of the variables of interest, which provide a practical, intuitive summary of the key relations among variables.
This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering. An important observation is that it pays off to invest effort investigating omitted variable bias before actually executing an empirical study, because this effort can lead to a more solid study design, and to a significant reduction in its threats to validity.2025-01-28T15:43:46ZCarlo A. FuriaRichard Torkarhttp://arxiv.org/abs/2603.19138v1Implicit Patterns in LLM-Based Binary Analysis2026-03-19T16:56:56ZBinary vulnerability analysis is increasingly performed by LLM-based agents in an iterative, multi-pass manner, with the model as the core decision-maker. However, how such systems organize exploration over hundreds of reasoning steps remains poorly understood, due to limited context windows and implicit token-level behaviors. We present the first large-scale, trace-level study showing that multi-pass LLM reasoning gives rise to structured, token-level implicit patterns. Analyzing 521 binaries with 99,563 reasoning steps, we identify four dominant patterns: early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization that emerge implicitly from reasoning traces. These token-level implicit patterns serve as an abstraction of LLM reasoning: instead of explicit control-flow or predefined heuristics, exploration is organized through implicit decisions regulating path selection, commitment, and revision. Our analysis shows these patterns form a stable, structured system with distinct temporal roles and measurable characteristics. Our results provide the first systematic characterization of LLM-driven binary analysis and a foundation for more reliable analysis systems.2026-03-19T16:56:56Z18 pagesQiang LiXiangRui ZhangHaining Wanghttp://arxiv.org/abs/2511.02434v2Who's Who? LLM-assisted Software Traceability with Architecture Entity Recognition2026-03-19T13:48:31ZIdentifying architecturally relevant entities in textual artifacts is crucial for Traceability Link Recovery (TLR) between Software Architecture Documentation (SAD) and source code. While Software Architecture Models (SAMs) can bridge the semantic gap between these artifacts, their manual creation is time-consuming. LLMs offer new capabilities for extracting architectural entities from SAD and source code to construct SAMs automatically or establish direct trace links. This paper extends our ICSA 2025 paper [19], which introduced Extracting Architecture (ExArch) for LLM-based architecture component name extraction. The extension contributes the novel Architecture Traceability with Entity Matching via Semantic inference (ArTEMiS) approach, an extended evaluation with additional LLMs, configurations, a revised benchmark, and a combined evaluation of both approaches. Specifically, this paper presents the following approaches: ExArch extracts component names as simple SAMs from SAD and source code to eliminate the need for manual SAM creation, while ArTEMiS identifies architectural entities in documentation and matches them with (manually or automatically generated) SAM entities. Our evaluation compares against state-of-the-art approaches SWATTR, TransArC and ArDoCode. TransArC achieves strong performance (F1: 0.87) but requires manually created SAMs; ExArch achieves comparable results (F1: 0.86) using only SAD and code. ArTEMiS is on par with the traditional heuristic-based SWATTR (F1: 0.81) and can successfully replace it when integrated with TransArC. The combination of ArTEMiS and ExArch outperforms ArDoCode, the best baseline without manual SAMs. Our results demonstrate that LLMs can effectively identify architectural entities in textual artifacts, enabling automated SAM generation and TLR, making architecture-code traceability more practical and accessible.2025-11-04T10:06:53ZDominik FuchßHaoyu LiuSophie CoralloTobias HeyJan KeimJohannes von GeisauAnne Koziolekhttp://arxiv.org/abs/2603.14255v2ITKIT: Feasible CT Image Analysis based on SimpleITK and MMEngine2026-03-19T13:47:59ZCT images are widely used in clinical diagnosis and treatment, and their data have formed a de facto standard - DICOM. It is clear and easy to use, and can be efficiently utilized by data-driven analysis methods such as deep learning. In the past decade, many program frameworks for medical image analysis have emerged in the open-source community. ITKIT analyzed the characteristics of these frameworks and hopes to provide a better choice in terms of ease of use and configurability. ITKIT offers a complete pipeline from DICOM to 3D segmentation inference. Its basic practice only includes some essential steps, enabling users with relatively weak computing capabilities to quickly get started using the CLI according to the documentation. For advanced users, the OneDL-MMEngine framework provides a flexible model configuration and deployment entry. This paper conducted 12 typical experiments to verify that ITKIT can meet the needs of most basic scenarios.2026-03-15T07:25:06ZYiqin ZhangMeiling Chenhttp://arxiv.org/abs/2601.19146v2The Promise and Reality of Continuous Integration Caching: An Empirical Study of Travis CI Builds2026-03-19T12:40:54ZContinuous Integration (CI) provides early feedback by automatically building software, but long build durations can hinder developer productivity. CI services use caching to speed up builds by reusing infrequently changing artifacts, yet little is known about how caching is adopted in practice and what challenges it entails. In this paper, we conduct a large-scale empirical study of CI caching in Travis CI, analyzing 513,384 builds from 1,279 GitHub projects. We find that only 30% of projects adopt CI caching, and early adopters are typically more mature, with more dependencies, commits, and longer CI lifespans. To understand non-adoption, we submit pull requests enabling caching in non-adopting projects, and nearly half are accepted or merged. Developer feedback indicates that non- or late adoption mainly results from limited awareness of CI caching support. We further study cache maintenance and identify five common activities, performed by 24% of cache-enabled projects. While one-third of projects see substantial build-time reductions, cache uploads occur in 97% of builds, and 27% of projects contain stale cached artifacts. An analysis of reported caching issues shows developers mainly struggle with corrupted or outdated caches and request broader caching features. Overall, CI caching does not benefit all projects, requires ongoing maintenance, and is more complex in practice than many developers expect.2026-01-27T03:23:19ZAccepted at the 30th International Conference on Evaluation and Assessment in Software Engineering (EASE '26)Taher A. GhalebDaniel Alencar da CostaYing Zouhttp://arxiv.org/abs/2603.11103v2Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining2026-03-19T11:48:59ZWhile Large Language Models (LLMs) have achieved remarkable success in code generation, they often struggle with the deep, long-horizon reasoning required for complex software engineering. We attribute this limitation to the nature of standard pre-training data: static software repositories represent only the terminal state of an intricate intellectual process, abstracting away the intermediate planning, debugging, and iterative refinement. To bridge this gap, we propose a novel paradigm: understanding via reconstruction. We hypothesize that reverse-engineering the latent agentic trajectories -- the planning, reasoning, and debugging steps -- behind static repositories provides a far richer supervision signal than raw code alone. To operationalize this, we introduce a framework that synthesizes these trajectories using a multi-agent simulation. This process is grounded in the structural realities of the source repositories (e.g., dependency graphs and file hierarchies) to ensure fidelity. Furthermore, to guarantee the logical rigor of the synthetic data, we employ a search-based optimization technique that iteratively refines the Chain-of-Thought (CoT) reasoning to maximize the likelihood of the ground-truth code. Empirical results demonstrate that continuous pre-training on these reconstructed trajectories significantly enhances Llama-3-8B's performance across diverse benchmarks, including long-context understanding, coding proficiency, and agentic capabilities.2026-03-11T09:23:20ZZhiyuan ZengYichi ZhangYong ShanKai HuaSiyuan FangZhaiyu LiuJiaheng LiuHaozhe WangYining ZhengMing DingKe ShenGe ZhangWenhao HuangXipeng Qiuhttp://arxiv.org/abs/2511.08462v3QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities2026-03-19T10:53:05ZStatic analysis tools provide a powerful means to detect security vulnerabilities by specifying queries that encode vulnerable code patterns. However, writing such queries is challenging and requires diverse expertise in security and program analysis. To address this challenge, we present QLCoder - an agentic framework that automatically synthesizes queries in CodeQL, a powerful static analysis engine, directly from a given CVE metadata. QLCode embeds an LLM in a synthesis loop with execution feedback, while constraining its reasoning using a custom MCP interface that allows structured interaction with a Language Server Protocol (for syntax guidance) and a RAG database (for semantic retrieval of queries and documentation). This approach allows QLCoder to generate syntactically and semantically valid security queries. We evaluate QLCode on 176 existing CVEs across 111 Java projects. Building upon the Claude Code agent framework, QLCoder synthesizes correct queries that detect the CVE in the vulnerable but not in the patched versions for 53.4% of CVEs. In comparison, using only Claude Code synthesizes 10% correct queries.2025-11-11T17:06:04ZClaire WangZiyang LiSaikat DuttaMayur Naikhttp://arxiv.org/abs/2603.18741v1Beyond the Code: A Multi-Modal Assessment Strategy for Fostering Professional Competencies via Introductory Programming Projects2026-03-19T10:42:34ZAs the landscape of software engineering evolves, introductory programming courses must go beyond teaching syntax to foster comprehensive technical competencies and professional soft skills. This paper reports on a pedagogical experience in a "Fundamentals of Programming" course that used a Project-Based Learning (PBL) framework to develop a 2D "Maze Runner"-style game. While game development serves as a high-engagement vehicle for mastering core concepts, such as multidimensional arrays, control structures, and logic, the core of this study focuses on implementing a rigorous, multifaceted assessment model structured across four distinct dimensions: (1) an in-situ technical demonstration, evaluating real-time code execution and algorithmic robustness; (2) a technical screencast, requiring students to articulate their work in a concise audiovisual format; (3) a formal presentation to instructors, defending their project's design patterns and problem-solving strategies; and (4) a structured peer-review process, where students evaluated their colleagues' projects.
Our findings suggest that this multi-dimensional approach not only improves student retention of programming fundamentals but also significantly enhances communication skills and critical thinking. By integrating peer evaluation and multimedia documentation, the course successfully bridges the gap between basic coding and the collaborative requirements of modern software engineering. This paper details the curriculum design, the challenges of implementing diverse assessment pillars, and the measurable impact on student performance and engagement, providing a scalable roadmap for educators looking to modernize introductory computing curricula.2026-03-19T10:42:34ZArticle submitted to IEEESantiago Berrezueta-GuzmanVanesa MetajStefan Wagnerhttp://arxiv.org/abs/2603.18740v1Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review2026-03-19T10:40:27ZSecurity code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies.
Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs.
Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Our results show that confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed.2026-03-19T10:40:27ZDimitris MitropoulosNikolaos AlexopoulosGeorgios AlexopoulosDiomidis Spinellishttp://arxiv.org/abs/2603.18735v1SpaceTime Programming: Live and Omniscient Exploration of Code and Execution2026-03-19T10:35:55ZProgramming environments typically separate the world of static code from the dynamic execution of programs. Developers must switch between writing code and observing its execution, often with limited tools to understand the relationship between code changes and runtime behavior. Several paradigms and approaches exist to bridge this gap, including exploratory programming for comparing code variants, live programming for immediate feedback, and omniscient debugging for exploring execution history. However, existing solutions tend to focus on specific aspects and one specific paradigm rather than providing a fully integrated environment with multiple capabilities. This paper introduces \spacetime Programming, a novel approach that unifies these paradigms to create a programming model for exploring both code modifications and execution flow. At the core of our approach is a trace mechanism that captures not only execution state but also the corresponding code changes, enabling developers to explore programs in both space (code variants) and time (execution flow). As a proof of concept, we implemented a Python library supporting SpaceTime Programming and applied it in two contexts: a live omniscient debugger and a Pygame game development tool, showcased through a Flappy Bird-like game. We further evaluated SpaceTimePy on five real-world Python projects, finding performance overhead ranging from 35% to 150% on test suites.2026-03-19T10:35:55ZJean-Baptiste DöderleinDjamel Eddine KhelladiMathieu AcherBenoit Combemalehttp://arxiv.org/abs/2603.18734v1Green Architectural Tactics in ML-enabled Systems: An LLM-based Repository Mining Study2026-03-19T10:34:53ZContext: The increasing adoption of machine learning (ML) and artificial intelligence (AI) technologies raises growing concerns about their environmental sustainability. Developing and deploying ML-enabled systems is computationally intensive, particularly during training and inference. Green AI has emerged to address these issues by promoting efficiency without sacrificing accuracy. While prior research has proposed catalogs of sustainable practices (i.e., green tactics), there remains limited understanding of their adoption in practice and whether additional, undocumented tactics exist. Objective: This study aims to investigate the extent to which existing sustainable practices are implemented in real-world ML-enabled systems and to identify previously undocumented practices that support environmental sustainability. Method: We conduct a mining software repository study on 205 open-source ML projects on GitHub. To support our analysis, we design a novel mechanism based on large language models (LLMs) capable of identifying both known and new sustainable practices from code repositories. Results: Our findings confirm that green tactics reported in the literature are used in practice, although adoption rates vary. Furthermore, our LLM-based approach reveals nine previously undocumented sustainable practices. Each tactic is supported with code examples to aid adoption and integration. Conclusions: We finally provide insights for practitioners seeking to reduce the environmental impact of ML-enabled systems and offer a foundation for future research in automating the detection and adoption of sustainable practices.2026-03-19T10:34:53ZVincenzo De MartinoSilverio Martínez-FernándezFabio Palombahttp://arxiv.org/abs/2603.18693v1Cross-Ecosystem Vulnerability Analysis for Python Applications2026-03-19T09:52:29ZPython applications depend on native libraries that may be vendored within package distributions or installed on the host system. When vulnerabilities are discovered in these libraries, determining which Python packages are affected requires cross-ecosystem analysis spanning Python dependency graphs and OS package versions. Current vulnerability scanners produce false negatives by missing vendored vulnerabilities and false positives by ignoring security patches backported by OS distributions.
We present a provenance-aware vulnerability analysis approach that resolves vendored libraries to specific OS package versions or upstream releases. Our approach queries vendored libraries against a database of historical OS package artifacts using content-based hashing, and applies library-specific dynamic analyses to extract version information from binaries built from upstream source. We then construct cross-ecosystem call graphs by stitching together Python and binary call graphs across dependency boundaries, enabling reachability analysis of vulnerable functions. Evaluating on 100,000 Python packages and 10 known CVEs associated with third-party native dependencies, we identify 39 directly vulnerable packages (47M+ monthly downloads) and 312 indirectly vulnerable client packages affected through dependency chains. Our analysis achieves up to 97% false positive reduction compared to upstream version matching.2026-03-19T09:52:29ZGeorgios AlexopoulosNikolaos AlexopoulosThodoris SotiropoulosCharalambos MitropoulosZhendong SuDimitris Mitropouloshttp://arxiv.org/abs/2603.18606v1SQL-Commenter: Aligning Large Language Models for SQL Comment Generation with Direct Preference Optimization2026-03-19T08:23:40ZSQL query comprehension is a significant challenge due to complex syntax, diverse join types, and deep nesting. Many queries lack adequate comments, severely hindering code readability, maintainability, and knowledge transfer. Automated SQL comment generation faces two main challenges: limited datasets that inadequately represent complex real-world queries, and Large Language Models' (LLMs) insufficient understanding of SQL-specific semantics. Our empirical analysis shows that even after continual pre-training and supervised fine-tuning, LLMs struggle with complex SQL semantics, yielding inaccurate comments. To address this, we propose SQL-Commenter, an advanced method based on LLaMA-3.1-8B. We first construct a comprehensive dataset of complex SQL queries with expert-verified comments. Next, we perform continual pre-training on a large SQL corpus to enhance the LLM's syntax and semantic understanding, followed by supervised fine-tuning. Finally, we introduce Direct Preference Optimization (DPO) using human feedback. SQL-Commenter utilizes a preference-based loss function to favor preferred outputs, enhancing fine-grained semantic learning and context-dependent quality assessment. Evaluated on the Spider and Bird benchmarks, SQL-Commenter significantly outperforms state-of-the-art baselines. On average, it surpasses the strongest baseline (Qwen3-14B) by 9.29, 4.99, and 13.23 percentage points on BLEU-4, METEOR, and ROUGE-L, respectively. Moreover, human evaluation demonstrates the superior quality of comments generated by SQL-Commenter in terms of correctness, completeness, and naturalness.2026-03-19T08:23:40ZAccepted to ICPC 2026Lei YuPeng WangJingyuan ZhangXin WangJia XuLi YangChangzhi DengJiajia MaFengjun Zhanghttp://arxiv.org/abs/2512.19980v2Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?2026-03-19T08:11:51ZCode language models excel on code intelligence tasks, yet their internal interpretability is underexplored. Existing neuron interpretability techniques from NLP are suboptimal for source code due to programming languages formal, hierarchical, and executable nature. We empirically investigate code LLMs at the neuron level, localizing language-specific neurons (selectively responsive to one language) and concept layers (feed-forward layers encoding language-agnostic code representations). We analyze Llama-3.1-8B and Qwen2.5-Coder-32B on multilingual inputs in C++, Java, Python, Go, and JavaScript, measuring neuron selectivity and layerwise contributions during generation. We find (1) neurons specialized for individual languages alongside a universal subset supporting general-purpose generation; and (2) lower layers mainly encode language-specific syntax, while middle layers capture semantic abstractions shared across languages, emerging as concept layers. We demonstrate utility on three tasks: neuron-guided fine-tuning for code generation, clone detection via concept-layer embeddings, and concept-layer-guided transfer for code summarization, each yielding consistent gains in multilingual settings.2025-12-23T02:04:13ZAccepted by FSE2026Zhe YinXiaodong GuBeijun Shen