https://arxiv.org/api/8KRAUryDKJ3zKkiNjDQwnkcTSOU 2026-06-21T19:45:42Z 27359 180 15 http://arxiv.org/abs/2511.12576v2 Can Small GenAI Language Models Rival Large Language Models in Understanding Application Behavior? 2026-06-11T15:50:40Z

Generative AI (GenAI) models, particularly large language models (LLMs), have transformed multiple domains, including natural language processing, software analysis, and code understanding. Their ability to analyze and generate code has enabled applications such as source code summarization, behavior analysis, and malware detection. In this study, we systematically evaluate the capabilities of both small and large GenAI language models in understanding application behavior, with a particular focus on malware detection as a representative task. While larger models generally achieve higher overall accuracy, our experiments show that small GenAI models maintain competitive precision and recall, offering substantial advantages in computational efficiency, faster inference, and deployment in resource-constrained environments. We provide a detailed comparison across metrics such as accuracy, precision, recall, and F1-score, highlighting each model's strengths, limitations, and operational feasibility. Our findings demonstrate that small GenAI models can effectively complement large ones, providing a practical balance between performance and resource efficiency in real-world application behavior analysis.

2025-11-16T12:38:28Z Mohammad Meymani Hamed Jelodar Parisa Hamedi Roozbeh Razavi-Far Ali A. Ghorbani http://arxiv.org/abs/2603.27002v2 Etna: An Evaluation Platform for Property-Based Testing 2026-06-11T15:45:32Z

Property-based testing is a mainstay of functional programming, boasting a rich literature, an enthusiastic user community, and an abundance of tools~ -- so many, indeed, that new users may have difficulty choosing. Moreover, any given framework may support a variety of strategies for generating test inputs; even experienced users may wonder which are better in any given situation. Sadly, the PBT literature, though long on creativity, is short on rigorous comparisons to help answer such questions. We present ETNA, a platform for empirical evaluation and comparison of PBT techniques. ETNA incorporates a number of popular PBT frameworks and testing workloads from the literature, and its extensible architecture makes adding new ones easy, while handling the technical drudgery of performance measurement. To illustrate its benefits, we use ETNA to carry out several experiments with popular PBT approaches in Rocq, Haskell, OCaml, Racket, and Rust, allowing users to more clearly understand best practices and tradeoffs.

2026-03-27T21:24:28Z Alperen Keles Jessica Shi Nikhil Kamath Tin Nam Liu Ceren Mert Harrison Goldstein Benjamin C. Pierce Leonidas Lampropoulos http://arxiv.org/abs/2602.07698v2 On Sequence-to-Sequence Models for Automated Log Parsing 2026-06-11T15:32:14Z

Context: Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. Objectives: This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. Methods: We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Results: Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Conclusion: Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

2026-02-07T20:47:45Z Added a comparison with large language models Adam Sorrenti Andriy Miranskyy http://arxiv.org/abs/2606.13468v1 Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset 2026-06-11T15:19:36Z

AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

2026-06-11T15:19:36Z 5 pages, 2 figures, MSR '26: Proceedings of the 23rd International Conference on Mining Software Repositories, April 2026, Rio de Janeiro, Brazil Mahmoud Abujadallah Ali Arabat Mohammed Sayagh 10.1145/3793302.3793592 http://arxiv.org/abs/2606.13449v1 Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests 2026-06-11T15:09:32Z

AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).

2026-06-11T15:09:32Z 5 pages, 8 figures, 23rd International Conference on Mining Software Repositories, April 13--14, 2026 Ali Arabat Mohammed Sayagh 10.1145/3793302.3793601 http://arxiv.org/abs/2606.14796v1 Faster Code, Deeper Debt? A Multivocal Literature Review on Technical Debt and Its Early Signs in LLM-Assisted Software Development 2026-06-11T14:44:30Z

With the rapid adoption of LLM-assisted coding, the need to manage the technical debt these systems introduce has become urgent. In this paper, we conduct a multivocal literature review of 104 sources (31 formal, 73 grey) to examine how LLM-assisted development contributes to technical debt and what strategies, metrics, and benchmarks exist to mitigate it. We find that LLMs often amplify traditional forms of technical debt, particularly code, design, and documentation debts, while also introducing new LLM-specific debts. Notably, we identify fast-integration debt, where rapidly generated code prioritizes speed over quality, triggering a domino effect that leads to governance debt and increased long-term maintenance costs. Additional emerging categories include prompt, ethical, data, and provenance debt, reflecting new challenges unique to LLM adoption. To address these, strategies suggested in the literature include human-in-the-loop frameworks, prompt engineering, and data quality alignment. In practice, tools such as SonarQube are commonly used to detect technical debt indicators, while research prototypes such as CodeSmellEval are emerging to assess how LLMs contribute to debts. However, no standardized benchmarks or LLM-specific metrics yet exist, leaving an important gap. Based on findings, we outline insights and future directions to ensure reliable integration of LLMs into software engineering workflows.

2026-06-11T14:44:30Z Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026 Ramtin Ehsani Shriya Rawal Yuanfang Cai Preetha Chatterjee 10.1145/3820165 http://arxiv.org/abs/2606.13298v1 Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories 2026-06-11T12:50:36Z

AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

2026-06-11T12:50:36Z 16 pages. Accepted for presentation at the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2026, Krakow, Poland, 2-4 September 2026, and for publication in the Springer LNCS proceedings. This is the author's accepted manuscript Oliver Aleksander Larsen Mahyar T. Moghaddam http://arxiv.org/abs/2511.02430v3 Efficient Solvers for SLOPE in R, Python, Julia, and C++ 2026-06-11T12:15:34Z

We present a suite of packages in R, Python, Julia, and C++ that efficiently solve the Sorted L-One Penalized Estimation (SLOPE) problem. The packages feature a highly efficient hybrid coordinate descent algorithm that fits generalized linear models (GLMs) and supports a variety of loss functions, including Gaussian, binomial, Poisson, and multinomial logistic regression. Our implementation is designed to be fast, memory-efficient, and flexible. The packages support a variety of data structures (dense, sparse, and out-of-memory matrices) and are designed to efficiently fit the full SLOPE path as well as handle cross-validation of SLOPE models, including the relaxed SLOPE. We present examples of how to use the packages and benchmarks that demonstrate the performance of the packages on both real and simulated data and show that our packages outperform existing implementations of SLOPE in terms of speed.

2025-11-04T10:03:15Z 30 pages, 8 figures Johan Larsson Malgorzata Bogdan Krystyna Grzesiak Mathurin Massias Jonas Wallin http://arxiv.org/abs/2605.22092v2 Astragalus: Automatic Configuration Repair for Production Networks 2026-06-11T12:14:26Z

Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a \textit{semantics-driven approach}: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a \textit{syntax-driven approach}, which generates and validates syntactically-valid candidate updates without modeling program semantics, often drawing on existing code in the same repository. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a "localize-fix-validate" pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5% of the incidents on a real network, both with 15 types of errors injected, within an average time of 6.93 seconds. It has also provided valid repairs in under 6 minutes for 7 recent network incidents or undesired changes, in a real production network with O(1,000)~O(10,000) devices.

2026-05-21T07:32:22Z 13 pages body, 14 pages total Zhenrong Gu Peng Zhang Xing Feng Xu Liu http://arxiv.org/abs/2606.13239v1 ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm 2026-06-11T11:53:32Z

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

2026-06-11T11:53:32Z Jiaxin Ai Tao Hu Xuemeng Yang Shu Zou Hairong Zhang Daocheng Fu Yu Yang Hongbin Zhou Nianchen Deng Pinlong Cai Zhongyuan Wang Botian Shi Kaipeng Zhang Licheng Wen http://arxiv.org/abs/2605.17062v2 The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort 2026-06-11T11:43:49Z

Spracklen et al. (USENIX Security '25) showed that code-generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open-source models, creating an attack surface for slopsquatting -- the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT-5.4-mini) -- an order-of-magnitude compression of the inter-model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127 package names (109 on PyPI, 18 on npm) that all five evaluated models invent identically; following coordinated disclosure with PyPI Security and Socket.dev, 53 of these (41 on PyPI, 12 on npm) remain registrable by an attacker after each registry's existing defenses, constituting a model-agnostic supply-chain attack surface that no single-model study can reveal. We further document a Python-over-JavaScript hallucination asymmetry that inverts Spracklen's 2024 finding, identify a Haiku-below-Sonnet inversion within the Anthropic family, and observe a Jaccard-similarity peak between DeepSeek V3.2 and GPT-5.4-mini (J = 0.343) suggestive of shared training-data origins.

2026-05-16T16:08:52Z 13 pages, 3 figures, 4 tables. v2: incorporates coordinated-disclosure feedback from PyPI Security and Socket.dev; registrable attack surface refined to 53 names (41 PyPI, 12 npm). Headline rates unchanged. Replication of Spracklen et al. (USENIX Security 2025). Data and code: https://github.com/churik5/slopsquatting-replication-2026 and https://doi.org/10.5281/zenodo.19859120 Aleksandr Churilov Independent Researcher http://arxiv.org/abs/2606.13175v1 The End of Code Review: Coding Agents Supersede Human Inspection 2026-06-11T10:43:48Z

Code review has been the primary quality gate in software development since Fagan formalised code inspection in 1976. For five decades, having a human examine and comment on a colleague's changes before merge has been a cornerstone practice at organisations of every size. Coding agents are large language model (LLM)-based autonomous systems capable of reading, writing, testing, and repairing software. We argue that coding agents have crossed a threshold of capability at which traditional human code review is no longer a necessary component of a software quality pipeline. Our argument rests on two claims: every stated goal of code review can be served by agents at lower cost and higher throughput; the naive integration in which agents write code and humans remain the mandatory reviewers is a dead end because it neither provides meaningful assurance nor scales with AI-assisted throughput.

2026-06-11T10:43:48Z Martin Monperrus http://arxiv.org/abs/2512.06242v2 Reasoning about concurrent loops and recursion with rely-guarantee rules 2026-06-11T08:36:54Z

The objective of this paper is to present general, mechanically verified, refinement rules for reasoning about recursive programs and while loops in the context of concurrency. We make use of the rely-guarantee approach to concurrency that facilitates reasoning about interference from concurrent threads in a compositional manner. Recursive programs can be defined as fixed points over a lattice of commands and hence we develop laws for reasoning about fixed points. Loops can be defined in terms of fixed points and hence the laws for recursion can be applied to develop laws for loops. Unlike many approaches to concurrency, we do not assume that expression evaluation is atomic.

2025-12-06T01:57:42Z 24 pages, 1 figures Ian J. Hayes Larissa A. Meinicke Cliff B. Jones http://arxiv.org/abs/2605.14568v2 Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines 2026-06-11T08:34:48Z

Context. Behaviour-Driven Development (BDD) test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. SBERT / UMAP / HDBSCAN clustering recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An XGBoost extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p = 1.5e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, and cross-organisational shared-step candidate, respectively; the figures are stable under a sweep of the classifier decision threshold. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring candidates; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

2026-05-14T08:38:04Z 31 pages, 10 figures, 6 tables, 56 references. v2: retitled; reference list fully corrected and verified; decision-threshold sensitivity analysis and imbalance-robust baseline metrics added; figures restyled. Reproduction package at https://github.com/amughalbscs16/cukereuse_subscenarios_release (Apache-2.0). Upstream cukereuse corpus at https://doi.org/10.5281/zenodo.19754359 Ali Hassaan Mughal Noor Fatima Muhammad Bilal http://arxiv.org/abs/2606.13037v1 DIG: Oracle-Guided Directed Input Generation for One-Day Vulnerabilities 2026-06-11T08:18:21Z

One-day vulnerabilities pose significant risks due to delayed or incomplete patch adoption. Generating proof-of-concept (PoC) inputs is therefore essential for assessing real-world impact. The key challenge is identifying necessary constraints for triggering the vulnerability and solving them effectively. Existing directed fuzzing approaches prioritize inputs toward target locations, but neither explicitly identify necessary constraints nor solve them effectively, relying instead on target-distance feedback and random mutation. Agentic approaches show strong potential through code reasoning and structured input generation, but goal drift in long-horizon reasoning limits their effectiveness. DIG addresses this challenge by exploiting a key property of one-day vulnerabilities: patches often reveal necessary preconditions for triggering. DIG uses an LLM to analyze the patch and synthesize an oracle making these conditions explicit. The oracle supports effective PoC generation at two levels. At the high level, DIG performs oracle-guided generator evolution, where an agent infers and solves constraints to satisfy the oracle. At the low level, DIG instruments the oracle into the target program and uses branch-distance feedback to guide random mutation in directed fuzzing. Evaluation shows DIG outperforms 2 state-of-the-art agents and 10 fuzzers across 138 real-world CVEs. DIG triggers 80 vulnerabilities, surpassing prior results and outperforming the best baseline by 40% (57 vs. 80 CVEs). Notably, DIG exclusively triggers 9 vulnerabilities no existing technique can trigger. Compared to the average of other tools, DIG triggers vulnerabilities faster in 92.9% of cases, achieving over 100x speedup in 48.8% of cases, with a maximum speedup of 3,664x. Beyond one-day PoC generation, DIG uncovers 6 previously unknown vulnerabilities in widely deployed libraries, enabling zero-day discovery.

2026-06-11T08:18:21Z Andrew Bao University of Minnesota, Twin Cities Haochen Zeng University of California, Riverside Peng Chen Independent Researcher Stephen McCamant University of Minnesota, Twin Cities Pen-Chung Yew University of Minnesota, Twin Cities