https://arxiv.org/api/YQnzckcY8UffrH8BNMsHPEgdwfo 2026-06-21T13:45:57Z 27359 105 15 http://arxiv.org/abs/2606.16839v1 Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection 2026-06-15T15:17:52Z In software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are maintained and do they work out of the box. In this paper, we propose a pipeline to identify relevant studies with LLM screening, extract the tools presented in them, and run them with LLM-based coding agent. To evaluate the feasibility of our approach we focus on software log anomaly detection tools. We begin the study by designing a broad search string that yields 3233 hits from Scopus. We request two LLMs to provide an inclusion probability for each title-abstract pair according to the inclusion and exclusion criteria. From the 3233 exported abstracts, this screening reduced the number of included papers to 569, out of which we could download 470. These papers included 206 unique links and after manual evaluation we determined 83 to be tools. Finally, we ran the LLM-based coding agent on these 83 links, and got 24 successfully running tools. As replicating our approach would require roughly only 4 hours of human effort, of which 3 hours were manual PDF downloading, and 12 hours of LLM running time, this demonstrates promising efficiency when utilizing LLMs in rapid reviews. Because practitioner-built tools often lack academic papers, in the future we aim to expand our analysis to tool-hosting platforms such as GitHub and PyPI. In the future, we plan to formalize our workflow as LLM Agent Skills to make our approach easier to adopt. 2026-06-15T15:17:52Z 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2026 Jesse Nyyssölä Hamza Bin Mazhar Alexander Bakhtin Matteo Esposito Nana Reinikainen Yuqing Wang Ying Song Davide Taibi Mika Mäntylä http://arxiv.org/abs/2606.16827v1 No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages 2026-06-15T15:08:55Z Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning. 2026-06-15T15:08:55Z Accepted for publication at IEEE Transactions on Software Engineering Alessandro Giagnorio Alberto Martin-Lopez Gabriele Bavota http://arxiv.org/abs/2511.20709v2 DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents 2026-06-15T14:19:33Z Large language models (LLMs) and LLM-based coding agents are now used to generate code from natural-language specifications, yet ensuring such code is both functionally correct and secure remains a challenge. We present DualGauge, the first fully automated framework for jointly evaluating correctness and security of specification-only code generation, supported by DualGauge-Bench, a language-agnostic benchmark of 307 coding tasks each paired with functional and security tests derived from the same specification. Evaluating 10 representative LLMs across Python, C++, and JavaScript, we find that functional correctness substantially overestimates reliable code generation: even the strongest model remains below 15% joint security-functionality success in every language. Common model-side factors--scale, extended thinking, quantization, instruction tuning, and code specialization--do not reliably improve joint performance, suggesting secure-and-correct code generation does not simply emerge from stronger coding capability. Evaluation of 3 leading agentic coding systems (Codex, OpenHands, and Claude Code) shows that iterative scaffolding provides no advantage over direct (LLM-based) generation on specification-only tasks. A qualitative audit reveals failures concentrate at the output contract boundary and in guards that exist but are insufficient--patterns that only joint benchmarking reliably exposes. 2025-11-24T22:26:14Z Rupam Patir Keyan Guo Suvadra Barua Abhijeet Pathak Dinesh Gudimetla Jiawei Guo Hongxin Hu Haipeng Cai http://arxiv.org/abs/2603.13584v2 An Empirical Investigation of Pre-Trained Deep Learning Model Reuse in the Scientific Process 2026-06-15T13:28:01Z Deep learning has achieved recognition for its impact within natural sciences, yet the prohibitive financial and technical cost of training models from scratch inhibit adoption. Following software engineering community guidance, natural scientists are reusing pre-trained deep learning models (PTMs) to amortize these costs. While prior works recommend PTM reuse patterns, we present the first empirical study of PTM reuse patterns in the natural sciences, quantifying the utilization and impact of PTM reuse within the scientific process across 17,718 peer reviewed, open access papers. Our results show that "Biochemistry, Genetics and Molecular Biology" has outpaced other natural scientific fields in PTM reuse, "adaptation" reuse is the most prevalent PTM reuse pattern identified across all natural science fields, and the "testing" stage of the scientific process has been most impacted by PTM integration. 2026-03-13T20:49:02Z 22 pages, 7 figures, 4 tables Nicholas M. Synovic Karolina Ryzka Alessandra V. Vellucci Solari Kenny Lyons James C. Davis George K. Thiruvathukal http://arxiv.org/abs/2606.16692v1 Reference Architecture for Metadata-driven Services to Promote Reusability in Software Systems 2026-06-15T13:27:29Z Service-based Architectures place reusability among their central design goals, yet structural heterogeneity across clients often drives the creation of services with similar functionalities, undermining system evolution and maintainability. In this work, we address this issue by focusing on validated architectural artifacts that bound to a limit the number of replicated services. We do so by proposing and validating a reference architecture that employs metadata as the core mechanism to promote service reusability, embracing heterogeneous data. The proposed RA is designed based on a pattern language with the same purpose, and it is evaluated by combining two well-established methods for RA evaluation: scenario-based evaluation and case studies with real-world systems. The triangulation of these methods' results demonstrated that, during the system's evolution, the most common change types the RA incurs are either no change or less impactful ones, like configuration changes or the addition of a pluggable class. 2026-06-15T13:27:29Z João F. L. Daniel Bruno P. Romano Xiaofeng Wang Andrea Janes Eduardo M. Guerra http://arxiv.org/abs/2606.16670v1 Trust by design -- in praise of modularization: a case study 2026-06-15T13:05:15Z Ensuring that collective adaptive systems remain safe, reliable, and trustworthy requires measures that transcend so far established formal methods, and in particular established verification techniques. In this contribution, we suggest three such measures: (1) conceptual means: runs with locally confined cause and effect of events, (2) temporal logic like verification techniques that respect and exploit such runs, (3) composing system properties from properties of components. This contribution presents a case study which particularly focuses on the benefits of modularization for achieving trust by design. Further work will develop a full-fledged theory for the presented ideas. 2026-06-15T13:05:15Z 11 pages, 11 figures, submitted to ISoLA 2026 Peter Fettke Wolfgang Reisig http://arxiv.org/abs/2606.16650v1 Understanding Automated Web GUI Testing: An Empirical Study Across Exploration Strategies and State Abstractions 2026-06-15T12:40:12Z Automated web GUI testing (AWGT) relies on exploration strategies that exercise web applications through GUI actions to maximize code coverage, spanning traditional model-based, reinforcement learning (RL)-based, and emerging large language model (LLM)-based approaches. State abstraction, which detects pages with the same functionality to avoid repeated testing, has long been recognized as critical to guiding exploration. However, how exploration strategies and state abstractions jointly affect testing effectiveness remains underexplored. We present an empirical study analyzing both factors from the perspectives of code coverage and failure revelation. We compare representative model-based, RL-based, and LLM-based approaches; investigate how six state abstractions influence model-based and RL-based approaches; examine LLM-based approaches under different history representations, which act as a form of state abstraction; and compare the failures exposed by different approaches. Our results show that no single strategy excels across all dimensions; instead, categories exhibit complementary strengths in code coverage, state coverage, and failure discovery. State abstraction is a key factor: strict, fine-grained abstractions favor model-based strategies, while compact ones better support RL-based strategies. History representation substantially affects LLM-based strategies, where concise, functionality-level context performs best. We also find that code coverage is weakly correlated with failure-revealing ability, underscoring the need for multi-dimensional evaluation. These findings offer practical guidance for selecting exploration strategies and designing effective state abstractions for AWGT. 2026-06-15T12:40:12Z Chenxu Liu Wei Yang Ying Zhang Tao Xie http://arxiv.org/abs/2512.22827v2 FasterPy: An LLM-based Code Execution Efficiency Optimization Framework 2026-06-15T12:03:15Z Code often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant loops, repeated computations), making them labor-intensive and limited in applicability. In recent years, machine learning and deep learning-based methods have emerged as promising alternatives by learning optimization heuristics from annotated code corpora and performance measurements. However, these approaches usually depend on specific program representations and meticulously crafted training datasets, making them costly to develop and difficult to scale. With the booming of Large Language Models (LLMs), their remarkable capabilities in code generation have opened new avenues for automated code optimization. In this work, we proposed FasterPy, a low-cost and efficient framework that adapts LLMs to optimize the execution efficiency of Python code. FasterPy combines Retrieval-Augmented Generation (RAG), supported by a knowledge base constructed from existing performance-improving code pairs and corresponding performance measurements, with Low-Rank Adaptation (LoRA) to enhance code optimization performance. Our experimental results on the Performance Improving Code Edits (PIE) benchmark demonstrate that our method outperforms existing models on multiple metrics. The FasterPy tool and the experimental results are available at https://github.com/WuYue22/fasterpy. 2025-12-28T07:43:08Z 38 pages, 5 images, 14 tables, Manuscript revision submitted to a Journal (2026) Yue Wu Minghao Han Ruiyin Li Peng Liang Amjed Tahir Zengyang Li Qiong Feng Mojtaba Shahin http://arxiv.org/abs/2606.14061v2 LLM Agents Can See Code Repositories 2026-06-15T09:45:16Z Coding agents powered by large language models have demonstrated strong performance on software engineering tasks. Yet most agents consume repositories almost entirely as text, which differs from how human developers use visual structure such as folder hierarchies and dependency relationships to orient themselves in large codebases. With multimodal large language models (MLLMs), it is an open question whether agents can effectively benefit from visual representations of repositories. This paper presents the first systematic empirical study of visual repository representations for LLM-based agents on repository-level issue resolution. We evaluate four recent multimodal models. Our results show that a strictly vision-only setup degrades accuracy and increases token cost, because agents lack sufficient symbolic detail and compensate with repeated visual queries. In contrast, integrating visual graphs of repository structure as a supplementary modality alongside standard text interfaces helps agents understand structure more efficiently: input token consumption decreases by up to 26% while issue-resolution accuracy is maintained or improved. Visualization is most useful during fault localization and when the agent autonomously controls exploration depth. These findings point to a practical hybrid text-and-vision design for next-generation coding agents. 2026-06-12T03:14:40Z The paper is not yet completed Dongjian Ma Silin Chen Yufei Yang Yulin Shi Yanfu yan Xiaodong Gu http://arxiv.org/abs/2508.19610v3 The Influence of Code Comments on the Perceived Helpfulness of Stack Overflow Posts 2026-06-15T09:09:09Z Question-and-answer platforms such as Stack Overflow are an important way for software developers to share and retrieve knowledge. However, reusing poorly understood code can lead to serious problems, such as bugs or security vulnerabilities. To better understand how code comments affect the perceived helpfulness of Stack Overflow answers, we conducted an online experiment simulating a Stack Overflow environment (n=91). The results indicate that both block and inline comments are perceived as significantly more helpful than uncommented source code. Moreover, novices rated code snippets with block comments as more helpful than those with inline comments. Interestingly, other surface features, such as the position of an answer and its answer score, were considered less important. Moreover, the content of Stack Overflow has been a major source for training large language models. AI-based coding assistants such as GitHub Copilot, which are based on these models, are changing the way Stack Overflow is used. However, our findings have implications beyond Stack Overflow. First, they may help to improve the relevance also of other community-driven platforms, which provide human advice and explanations of code solutions, complementing AI-based support for software developers. Second, since chat-based AI tools can be prompted to generate code in different ways, knowing which properties influence perceived helpfulness can lead to more targeted prompting strategies to generate readable code snippets. 2025-08-27T06:45:00Z 32 pages, 7 figures, 2 tables, accepted in Empirical Software Engineering Kathrin Figl Maria Kirchner Sebastian Baltes Michael Felderer http://arxiv.org/abs/2505.13553v3 Towards Functional Correctness of Large Code Models with Selective Generation 2026-06-15T09:04:45Z The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose a \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our method along with the controllability of code hallucination and reasonable selection efficiency. 2025-05-19T06:29:16Z ICML 2026 Jaewoo Jeong Taesoo Kim Sangdon Park http://arxiv.org/abs/2606.16364v1 Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents 2026-06-15T07:58:56Z LLM agents mis-call tools, and the natural guess is that the model failed to see the right tool in a crowded harness. We show the opposite through a lens concurrent work sets aside -- the model's attention to labeled tool-definition segments. On real BFCL failures, by per-candidate attention argmax the model attends most to the correct tool 80% of the time (vs. 21% chance), and the gold is the under-attended segment on only 10%: it looks at the right tool and still picks wrong. This directly refutes the intuitive "crowded-harness / lost-in-the-middle" explanation: the failure is at the decision readout, not the harness, and we pin it there three ways. (1) Input vs. readout: repairing the prompt (reordering or duplicating the gold tool) recovers <=23% of failures, while readout-side interventions recover 59-91%. (2) Representation-invariance: two gold-pointed interventions in different representations -- an additive attention-logit bias and a residual-stream steering vector -- recover largely the same failures (per-task Jaccard 0.865 pooled, 0.79-0.91 per model), so the bottleneck is localized to the readout independent of which representation is poked. (3) A training-free, gold-free selector: per-segment attention closes most of the gold-free-vs-oracle gap on BFCL (+11.9 pts pooled function-name selection vs. +17.9-pt oracle headroom) and adds +14.9 pts on Seal-Tools; every model positive (exact McNemar p<=8e-4 each). Scopes differ: the causal attention-bias dose-response is bidirectional and monotonic on 10 mask-honoring models (3-32B), the full 0.5-32B span carrying only the correlational diagnostic; the deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop. 2026-06-15T07:58:56Z 13 pages, 1 figure, 15 tables Shiyang Chen http://arxiv.org/abs/2606.16292v1 AI Supply Chain Galaxy: 3D Visual Analytics for License Compliance 2026-06-15T06:54:33Z The rapid proliferation of machine learning model reuse has transformed the AI ecosystem into a highly interconnected supply chain. Traditional compliance tools and static reports struggle to navigate these massive, multi-hop dependency networks. To address this, we present AI Supply Chain Galaxy (AISCG), an interactive 3D visual analytics system for model provenance and compliance auditing. AISCG maps models into a 3D spatial layout, integrating explicit structural dependencies with a rule-based compliance engine. It supports multi-scale exploration, from global community detection to localized, path-aware lineage tracing. We demonstrate its efficacy through an ecosystem-scale empirical analysis of 908,449 models from Hugging Face. Our findings reveal a concerning landscape: 55.46% of models exhibit compliance risks or metadata conflicts/omissions. We also identified distinct risk patterns, including a 56.67% license omission rate in adapter derivations and an 8.05% "license drift" rate in fine-tuning. Through a case study on the complex Llama model family, we show how AISCG empowers analysts to intuitively trace inherited restrictive terms and identify root causes across deep topological networks, significantly reducing the cognitive load of compliance auditing. 2026-06-15T06:54:33Z 15 pages, 6 figures Weiru Han Xuetao Shi Wenyi He Wei Wang Rui Zhao Moming Duan http://arxiv.org/abs/2601.19697v2 AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion 2026-06-15T06:39:19Z Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages. 2026-01-27T15:23:14Z To appear at ASE'25 Tianyue Jiang Yanli Wang Yanlin Wang Daya Guo Ensheng Shi Yuchi Ma Jiachi Chen Zibin Zheng http://arxiv.org/abs/2606.16262v1 UXBench: Measuring the Actionability of LLM-Generated UX Critiques 2026-06-15T06:08:39Z Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories 2026-06-15T06:08:39Z 30 pages Wenjie Wang Yue Huang Zipeng Ling Han Bao Hang hua Xiaonan Luo Yu Jiang Shiyi Du Yuexing Hao Xiaomin Li Yuchen Ma Dianzhuo Wang Yanfang Ye Xiangliang Zhang