https://arxiv.org/api/YQnzckcY8UffrH8BNMsHPEgdwfo2026-06-21T13:45:57Z2735910515http://arxiv.org/abs/2606.16839v1Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection2026-06-15T15:17:52ZIn software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are maintained and do they work out of the box. In this paper, we propose a pipeline to identify relevant studies with LLM screening, extract the tools presented in them, and run them with LLM-based coding agent. To evaluate the feasibility of our approach we focus on software log anomaly detection tools. We begin the study by designing a broad search string that yields 3233 hits from Scopus. We request two LLMs to provide an inclusion probability for each title-abstract pair according to the inclusion and exclusion criteria. From the 3233 exported abstracts, this screening reduced the number of included papers to 569, out of which we could download 470. These papers included 206 unique links and after manual evaluation we determined 83 to be tools. Finally, we ran the LLM-based coding agent on these 83 links, and got 24 successfully running tools. As replicating our approach would require roughly only 4 hours of human effort, of which 3 hours were manual PDF downloading, and 12 hours of LLM running time, this demonstrates promising efficiency when utilizing LLMs in rapid reviews. Because practitioner-built tools often lack academic papers, in the future we aim to expand our analysis to tool-hosting platforms such as GitHub and PyPI. In the future, we plan to formalize our workflow as LLM Agent Skills to make our approach easier to adopt.2026-06-15T15:17:52Z52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2026Jesse NyyssöläHamza Bin MazharAlexander BakhtinMatteo EspositoNana ReinikainenYuqing WangYing SongDavide TaibiMika Mäntylähttp://arxiv.org/abs/2606.16827v1No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages2026-06-15T15:08:55ZLarge Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning.2026-06-15T15:08:55ZAccepted for publication at IEEE Transactions on Software EngineeringAlessandro GiagnorioAlberto Martin-LopezGabriele Bavotahttp://arxiv.org/abs/2511.20709v2DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents2026-06-15T14:19:33ZLarge language models (LLMs) and LLM-based coding agents are now used to generate code from natural-language specifications, yet ensuring such code is both functionally correct and secure remains a challenge. We present DualGauge, the first fully automated framework for jointly evaluating correctness and security of specification-only code generation, supported by DualGauge-Bench, a language-agnostic benchmark of 307 coding tasks each paired with functional and security tests derived from the same specification. Evaluating 10 representative LLMs across Python, C++, and JavaScript, we find that functional correctness substantially overestimates reliable code generation: even the strongest model remains below 15% joint security-functionality success in every language. Common model-side factors--scale, extended thinking, quantization, instruction tuning, and code specialization--do not reliably improve joint performance, suggesting secure-and-correct code generation does not simply emerge from stronger coding capability. Evaluation of 3 leading agentic coding systems (Codex, OpenHands, and Claude Code) shows that iterative scaffolding provides no advantage over direct (LLM-based) generation on specification-only tasks. A qualitative audit reveals failures concentrate at the output contract boundary and in guards that exist but are insufficient--patterns that only joint benchmarking reliably exposes.2025-11-24T22:26:14ZRupam PatirKeyan GuoSuvadra BaruaAbhijeet PathakDinesh GudimetlaJiawei GuoHongxin HuHaipeng Caihttp://arxiv.org/abs/2603.13584v2An Empirical Investigation of Pre-Trained Deep Learning Model Reuse in the Scientific Process2026-06-15T13:28:01ZDeep learning has achieved recognition for its impact within natural sciences, yet the prohibitive financial and technical cost of training models from scratch inhibit adoption. Following software engineering community guidance, natural scientists are reusing pre-trained deep learning models (PTMs) to amortize these costs. While prior works recommend PTM reuse patterns, we present the first empirical study of PTM reuse patterns in the natural sciences, quantifying the utilization and impact of PTM reuse within the scientific process across 17,718 peer reviewed, open access papers. Our results show that "Biochemistry, Genetics and Molecular Biology" has outpaced other natural scientific fields in PTM reuse, "adaptation" reuse is the most prevalent PTM reuse pattern identified across all natural science fields, and the "testing" stage of the scientific process has been most impacted by PTM integration.2026-03-13T20:49:02Z22 pages, 7 figures, 4 tablesNicholas M. SynovicKarolina RyzkaAlessandra V. Vellucci SolariKenny LyonsJames C. DavisGeorge K. Thiruvathukalhttp://arxiv.org/abs/2606.16692v1Reference Architecture for Metadata-driven Services to Promote Reusability in Software Systems2026-06-15T13:27:29ZService-based Architectures place reusability among their central design goals, yet structural heterogeneity across clients often drives the creation of services with similar functionalities, undermining system evolution and maintainability. In this work, we address this issue by focusing on validated architectural artifacts that bound to a limit the number of replicated services. We do so by proposing and validating a reference architecture that employs metadata as the core mechanism to promote service reusability, embracing heterogeneous data. The proposed RA is designed based on a pattern language with the same purpose, and it is evaluated by combining two well-established methods for RA evaluation: scenario-based evaluation and case studies with real-world systems. The triangulation of these methods' results demonstrated that, during the system's evolution, the most common change types the RA incurs are either no change or less impactful ones, like configuration changes or the addition of a pluggable class.2026-06-15T13:27:29ZJoão F. L. DanielBruno P. RomanoXiaofeng WangAndrea JanesEduardo M. Guerrahttp://arxiv.org/abs/2606.16670v1Trust by design -- in praise of modularization: a case study2026-06-15T13:05:15ZEnsuring that collective adaptive systems remain safe, reliable, and trustworthy requires measures that transcend so far established formal methods, and in particular established verification techniques. In this contribution, we suggest three such measures: (1) conceptual means: runs with locally confined cause and effect of events, (2) temporal logic like verification techniques that respect and exploit such runs, (3) composing system properties from properties of components. This contribution presents a case study which particularly focuses on the benefits of modularization for achieving trust by design. Further work will develop a full-fledged theory for the presented ideas.2026-06-15T13:05:15Z11 pages, 11 figures, submitted to ISoLA 2026Peter FettkeWolfgang Reisighttp://arxiv.org/abs/2606.16650v1Understanding Automated Web GUI Testing: An Empirical Study Across Exploration Strategies and State Abstractions2026-06-15T12:40:12ZAutomated web GUI testing (AWGT) relies on exploration strategies that exercise web applications through GUI actions to maximize code coverage, spanning traditional model-based, reinforcement learning (RL)-based, and emerging large language model (LLM)-based approaches. State abstraction, which detects pages with the same functionality to avoid repeated testing, has long been recognized as critical to guiding exploration. However, how exploration strategies and state abstractions jointly affect testing effectiveness remains underexplored.
We present an empirical study analyzing both factors from the perspectives of code coverage and failure revelation. We compare representative model-based, RL-based, and LLM-based approaches; investigate how six state abstractions influence model-based and RL-based approaches; examine LLM-based approaches under different history representations, which act as a form of state abstraction; and compare the failures exposed by different approaches.
Our results show that no single strategy excels across all dimensions; instead, categories exhibit complementary strengths in code coverage, state coverage, and failure discovery. State abstraction is a key factor: strict, fine-grained abstractions favor model-based strategies, while compact ones better support RL-based strategies. History representation substantially affects LLM-based strategies, where concise, functionality-level context performs best. We also find that code coverage is weakly correlated with failure-revealing ability, underscoring the need for multi-dimensional evaluation. These findings offer practical guidance for selecting exploration strategies and designing effective state abstractions for AWGT.2026-06-15T12:40:12ZChenxu LiuWei YangYing ZhangTao Xiehttp://arxiv.org/abs/2512.22827v2FasterPy: An LLM-based Code Execution Efficiency Optimization Framework2026-06-15T12:03:15ZCode often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant loops, repeated computations), making them labor-intensive and limited in applicability. In recent years, machine learning and deep learning-based methods have emerged as promising alternatives by learning optimization heuristics from annotated code corpora and performance measurements. However, these approaches usually depend on specific program representations and meticulously crafted training datasets, making them costly to develop and difficult to scale. With the booming of Large Language Models (LLMs), their remarkable capabilities in code generation have opened new avenues for automated code optimization. In this work, we proposed FasterPy, a low-cost and efficient framework that adapts LLMs to optimize the execution efficiency of Python code. FasterPy combines Retrieval-Augmented Generation (RAG), supported by a knowledge base constructed from existing performance-improving code pairs and corresponding performance measurements, with Low-Rank Adaptation (LoRA) to enhance code optimization performance. Our experimental results on the Performance Improving Code Edits (PIE) benchmark demonstrate that our method outperforms existing models on multiple metrics. The FasterPy tool and the experimental results are available at https://github.com/WuYue22/fasterpy.2025-12-28T07:43:08Z38 pages, 5 images, 14 tables, Manuscript revision submitted to a Journal (2026)Yue WuMinghao HanRuiyin LiPeng LiangAmjed TahirZengyang LiQiong FengMojtaba Shahinhttp://arxiv.org/abs/2606.14061v2LLM Agents Can See Code Repositories2026-06-15T09:45:16ZCoding agents powered by large language models have demonstrated strong performance on software engineering tasks. Yet most agents consume repositories almost entirely as text, which differs from how human developers use visual structure such as folder hierarchies and dependency relationships to orient themselves in large codebases. With multimodal large language models (MLLMs), it is an open question whether agents can effectively benefit from visual representations of repositories. This paper presents the first systematic empirical study of visual repository representations for LLM-based agents on repository-level issue resolution. We evaluate four recent multimodal models. Our results show that a strictly vision-only setup degrades accuracy and increases token cost, because agents lack sufficient symbolic detail and compensate with repeated visual queries. In contrast, integrating visual graphs of repository structure as a supplementary modality alongside standard text interfaces helps agents understand structure more efficiently: input token consumption decreases by up to 26% while issue-resolution accuracy is maintained or improved. Visualization is most useful during fault localization and when the agent autonomously controls exploration depth. These findings point to a practical hybrid text-and-vision design for next-generation coding agents.2026-06-12T03:14:40ZThe paper is not yet completedDongjian MaSilin ChenYufei YangYulin ShiYanfu yanXiaodong Guhttp://arxiv.org/abs/2508.19610v3The Influence of Code Comments on the Perceived Helpfulness of Stack Overflow Posts2026-06-15T09:09:09ZQuestion-and-answer platforms such as Stack Overflow are an important way for software developers to share and retrieve knowledge. However, reusing poorly understood code can lead to serious problems, such as bugs or security vulnerabilities. To better understand how code comments affect the perceived helpfulness of Stack Overflow answers, we conducted an online experiment simulating a Stack Overflow environment (n=91). The results indicate that both block and inline comments are perceived as significantly more helpful than uncommented source code. Moreover, novices rated code snippets with block comments as more helpful than those with inline comments. Interestingly, other surface features, such as the position of an answer and its answer score, were considered less important. Moreover, the content of Stack Overflow has been a major source for training large language models. AI-based coding assistants such as GitHub Copilot, which are based on these models, are changing the way Stack Overflow is used. However, our findings have implications beyond Stack Overflow. First, they may help to improve the relevance also of other community-driven platforms, which provide human advice and explanations of code solutions, complementing AI-based support for software developers. Second, since chat-based AI tools can be prompted to generate code in different ways, knowing which properties influence perceived helpfulness can lead to more targeted prompting strategies to generate readable code snippets.2025-08-27T06:45:00Z32 pages, 7 figures, 2 tables, accepted in Empirical Software EngineeringKathrin FiglMaria KirchnerSebastian BaltesMichael Feldererhttp://arxiv.org/abs/2505.13553v3Towards Functional Correctness of Large Code Models with Selective Generation2026-06-15T09:04:45ZThe hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose a \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our method along with the controllability of code hallucination and reasonable selection efficiency.2025-05-19T06:29:16ZICML 2026Jaewoo JeongTaesoo KimSangdon Parkhttp://arxiv.org/abs/2606.16364v1Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents2026-06-15T07:58:56ZLLM agents mis-call tools, and the natural guess is that the model failed to see the right tool in a crowded harness. We show the opposite through a lens concurrent work sets aside -- the model's attention to labeled tool-definition segments. On real BFCL failures, by per-candidate attention argmax the model attends most to the correct tool 80% of the time (vs. 21% chance), and the gold is the under-attended segment on only 10%: it looks at the right tool and still picks wrong. This directly refutes the intuitive "crowded-harness / lost-in-the-middle" explanation: the failure is at the decision readout, not the harness, and we pin it there three ways. (1) Input vs. readout: repairing the prompt (reordering or duplicating the gold tool) recovers <=23% of failures, while readout-side interventions recover 59-91%. (2) Representation-invariance: two gold-pointed interventions in different representations -- an additive attention-logit bias and a residual-stream steering vector -- recover largely the same failures (per-task Jaccard 0.865 pooled, 0.79-0.91 per model), so the bottleneck is localized to the readout independent of which representation is poked. (3) A training-free, gold-free selector: per-segment attention closes most of the gold-free-vs-oracle gap on BFCL (+11.9 pts pooled function-name selection vs. +17.9-pt oracle headroom) and adds +14.9 pts on Seal-Tools; every model positive (exact McNemar p<=8e-4 each). Scopes differ: the causal attention-bias dose-response is bidirectional and monotonic on 10 mask-honoring models (3-32B), the full 0.5-32B span carrying only the correlational diagnostic; the deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop.2026-06-15T07:58:56Z13 pages, 1 figure, 15 tablesShiyang Chenhttp://arxiv.org/abs/2606.16292v1AI Supply Chain Galaxy: 3D Visual Analytics for License Compliance2026-06-15T06:54:33ZThe rapid proliferation of machine learning model reuse has transformed the AI ecosystem into a highly interconnected supply chain. Traditional compliance tools and static reports struggle to navigate these massive, multi-hop dependency networks. To address this, we present AI Supply Chain Galaxy (AISCG), an interactive 3D visual analytics system for model provenance and compliance auditing. AISCG maps models into a 3D spatial layout, integrating explicit structural dependencies with a rule-based compliance engine. It supports multi-scale exploration, from global community detection to localized, path-aware lineage tracing. We demonstrate its efficacy through an ecosystem-scale empirical analysis of 908,449 models from Hugging Face. Our findings reveal a concerning landscape: 55.46% of models exhibit compliance risks or metadata conflicts/omissions. We also identified distinct risk patterns, including a 56.67% license omission rate in adapter derivations and an 8.05% "license drift" rate in fine-tuning. Through a case study on the complex Llama model family, we show how AISCG empowers analysts to intuitively trace inherited restrictive terms and identify root causes across deep topological networks, significantly reducing the cognitive load of compliance auditing.2026-06-15T06:54:33Z15 pages, 6 figuresWeiru HanXuetao ShiWenyi HeWei WangRui ZhaoMoming Duanhttp://arxiv.org/abs/2601.19697v2AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion2026-06-15T06:39:19ZRepository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.2026-01-27T15:23:14ZTo appear at ASE'25Tianyue JiangYanli WangYanlin WangDaya GuoEnsheng ShiYuchi MaJiachi ChenZibin Zhenghttp://arxiv.org/abs/2606.16262v1UXBench: Measuring the Actionability of LLM-Generated UX Critiques2026-06-15T06:08:39ZLarge language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories2026-06-15T06:08:39Z30 pagesWenjie WangYue HuangZipeng LingHan BaoHang huaXiaonan LuoYu JiangShiyi DuYuexing HaoXiaomin LiYuchen MaDianzhuo WangYanfang YeXiangliang Zhang