https://arxiv.org/api/ehVJXZsO4KTvO6FoNmxa3XsAcQk 2026-06-21T18:36:45Z 27359 165 15 http://arxiv.org/abs/2508.15503v7 Guidelines for Empirical Studies in Software Engineering involving Large Language Models 2026-06-12T08:34:46Z Large Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommendations (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the system and prompt design beyond the model; (4) report session traces, i.e., interaction logs and runtime traces; (5) use suitable baselines, benchmarks, and metrics; (6) include an open LLM as a baseline; (7) validate LLM outputs against human judgment; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org). 2025-08-21T12:30:30Z 86 pages, 4 tables, accepted in Empirical Software Engineering Sebastian Baltes Florian Angermeir Chetan Arora Marvin Muñoz Barón Chunyang Chen Lukas Böhme Fabio Calefato Neil Ernst Davide Falessi Brian Fitzgerald Davide Fucci Junda He Christoph Treude Marcos Kalinowski Stefano Lambiase Daniel Russo Mircea Lungu Cristina Martinez Montes Lutz Prechelt Paul Ralph Rijnard van Tonder Stefan Wagner http://arxiv.org/abs/2606.14233v1 Evaluating LLMs for Obfuscation Detection and Classification in Android Apps 2026-06-12T08:14:55Z Android applications (apps) developers increasingly rely on code obfuscation techniques to hinder reverse engineering and protect intellectual property. However, obfuscation also reduces the effectiveness of static analysis and vulnerability detection tools, creating challenges for Android security analysis. Existing approaches for detecting obfuscation in Android apps predominantly rely on handcrafted heuristics, engineered features, or task-specific learning pipelines, which may struggle to generalize across evolving obfuscation strategies. This paper presents a large-scale empirical study investigating the capability of Large Language Models (LLMs) to detect obfuscation in Android apps through semantic reasoning. Our study evaluates whether off-the-shelf LLMs can identify obfuscated code without relying on handcrafted rules, predefined signatures, or dedicated model training. The empirical evaluation is conducted on both a controlled benchmark containing an app obfuscated with multiple techniques and a real-world dataset of Android apps collected from Google Play. The study further examines the impact of prompt design, model selection, and decision thresholds across several open-weight and proprietary LLMs. Finally, the analysis compares LLM-based reasoning with existing SAST-based obfuscation-detection approaches and discusses the broader implications and limitations of applying LLMs to Android security analysis. 2026-06-12T08:14:55Z Luca Ferrari Marco Alecci Jordan Samhi Tegawende' F. Bissyande' Jacques Klein Mariano Ceccato Luca Verderame http://arxiv.org/abs/2604.20462v3 Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark 2026-06-12T07:43:16Z Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines. Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5% of step lines are eliminable. 2026-04-22T11:44:05Z 28 pages, 2 figures, 4 tables. Submitted to Information and Software Technology (Elsevier). Tool, corpus, labelled benchmark, and rubric released at https://github.com/amughalbscs16/cukereuse-release under Apache-2.0 Ali Hassaan Mughal Noor Fatima Muhammad Bilal http://arxiv.org/abs/2507.00481v2 The Influence of HEXACO Personality Traits on the Teamwork Quality in Software Teams -- A Preliminary Research Approach 2026-06-12T07:02:59Z Although software engineering research has focused on optimizing processes and technology, there is a growing recognition that human factors, particularly teamwork, also significantly impact optimization. Recent research suggests that developer personality has a strong influence on teamwork. In fact, personality considerations may have a greater impact on software development than processes and tools. This paper aims to design a study that measures the impact of HEXACO personality traits on the Teamwork Quality (TWQ) of software teams. A preliminary data collection (n=54) was conducted for this purpose. The analysis showed that several personality traits, as well as their composition, had a significant impact on TWQ. Additionally, other variables, such as the proportion of women and age distribution, also affected TWQ. The study's initial results demonstrate the usefulness and validity of the study design. The results also suggest several opportunities to improve teamwork in IT organizations and avenues for further research. 2025-07-01T06:56:48Z Philipp M. Zähl Sabine Theis Martin R. Wolf http://arxiv.org/abs/2606.14164v1 Investigating Metamorphic Fuzz Oracle Enhancement via Large Language Models 2026-06-12T06:43:35Z Fuzz drivers are essential components of greybox fuzzing, as they encapsulate target interfaces, define test spaces, and largely determine fuzzing effectiveness. Existing fuzz drivers typically rely on crash-based oracles for security testing, overlooking library functionality and limiting bug detection capability. In this paper, we present the first study on metamorphic-based fuzz oracle enhancement (MFOE), which augments existing fuzz drivers with metamorphic-based oracles derived from metamorphic relations (MRs). Since constructing and integrating such oracles requires substantial domain knowledge, automating MFOE is challenging. To address this challenge, we propose MetaFOE, an LLM-based framework that automatically generates and integrates metamorphic-based oracles. We evaluate MetaFOE on OSS-Fuzz drivers using three modern LLMs and five prompt strategies. MetaFOE generates 3,475 MRs, of which 77.3% are applicable, and implements 12,351 meta drivers, with 6,228 being valid. After three hours of fuzzing, the valid meta drivers improve edge coverage by an average of 18.7% and trigger 1,528 unique crashes. Our results demonstrate both the effectiveness of metamorphic-based oracle enhancement and the feasibility of using LLMs to automate MFOE, providing valuable insights for advancing greybox fuzzing. 2026-06-12T06:43:35Z 28 pages Ruixiang Qian Ding Yang Zengxu Chen Yuxuan Gao Chunrong Fang Chao Zhang Zhenyu Chen http://arxiv.org/abs/2606.14113v1 Simulating Students' Java Programming Errors with Large Language Models 2026-06-12T04:51:49Z Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics. 2026-06-12T04:51:49Z Ali Keramati Jie Cao Iman Mohammadi Mark Warschauer Yang Shi http://arxiv.org/abs/2508.17912v2 Citizens' Contentment with e-Government Solutions and Services in Saudi Arabia 2026-06-12T01:08:05Z Governments around the world have worked tirelessly to develop technological solutions, on the one hand to better serve their citizens and, on the other hand, to advance in the United Nations Electronic Government Development Index (EGDI). Thus, it is crucial to assess e-government solutions and services from different aspects. This study evaluates e-government solutions and services based on user expectations in four aspects: general satisfaction, satisfaction with features, trust, and pleasure. In this study, a questionnaire was developed to allow the evaluation of e-government solutions and services in Saudi Arabia, and could also be utilized to evaluate e-government in any other nation. The study included 276 valid participants, while the required sample size was calculated using a standard sample size estimation formula for large populations (95% confidence level, 5% margin of error). In addition, descriptive analysis was used to analyze participant responses. The results showed that e-government services in Saudi Arabia achieved a level of citizen contentment that is consistent with its score in the EGDI published in the year 2024. 2025-08-25T11:29:26Z Title, abstract, and keywords updated to align with the final version accepted for publication by Inderscience Publishers. Originally archived under the working title: "Evaluating Citizen Satisfaction with Saudi Arabia's E-Government Services: A Standards-Based, Theory-Informed Approach". 38 pages, 1 figure, 16 tables, journal research paper Mohammed O. Alannsary http://arxiv.org/abs/2606.14805v1 Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces 2026-06-11T22:34:39Z Reliable operation of multi-agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool calls. The standard tool is counterfactual replay (rewind, edit, and re-run the trajectory to measure each event's effect), but its cost grows linearly with the number of candidate events, making exhaustive replay infeasible at scale. We frame trace debugging as a knowledge-based decision-support problem. Each trace is compiled into a structured event knowledge graph over routing, memory, tool-use, uncertainty, and latent evidence, and a calibrated predictor decides where a scarce replay budget should be spent. We do not propose a new replay oracle; we propose a method to predict its results without paying the replay cost. We formulate zero-replay counterfactual-effect prediction: given a trace under a fixed budget, predict which events the oracle would mark high-effect before any replay is performed. BranchPoint-Latent is a lightweight predictor over observable, structural, uncertainty, and latent features of the knowledge graph. Calibrated against a deterministic replay oracle across 37 trace families, a single learning-to-rank gradient-boosted predictor raises per-trace localization (Branch Recall@5) from 0.73 to 0.93 on held-out families at zero oracle-replay cost. Rather than claiming universal dominance, we characterize when cheap graph centrality suffices and when learned evidence is necessary. The result is an auditable, cost-efficient decision-support system for AI-reliability debugging, positioned explicitly on the cost-accuracy frontier with reproducible artifacts. 2026-06-11T22:34:39Z 21 pages, 1 figure, 6 tables. Submitted to Knowledge-Based Systems Dong Ho Kang Hyeonjeong Cha Daein Weon http://arxiv.org/abs/2606.13918v1 Bayesian-Calibrated Detection of Hallucinated Package Imports in AI-Assisted Code 2026-06-11T21:15:39Z We present a Bayesian calibration layer for slopsquat detectors -- those that flag hallucinated package imports in code produced by large language models (LLMs). Where existing pipelines emit binary decisions (flag / do-not-flag), our layer emits a Beta-posterior probability per detection, derived from a 3-category epistemic taxonomy that explicitly classifies each prior as empirically calibrated, constructively argued, or engineering-judgement-traced. Beyond the primary 200/404 registry channel, the calibrated layer exploits PyPI metadata signals -- package age, release count, author descriptor, summary -- to surface registered-but-suspicious packages that a binary registry detector misses, which is the realistic post-LLM-emission attacker regime. The resulting risk-aware primitive is directly consumable by downstream CI gates and supports principled threshold decisions across detection rules. We evaluate the calibration on a merged corpus of 1,734 Python snippets -- a stratified 189-prompt BigCodeBench slice plus a 100-prompt niche-library stress-test set, generated across a six-model panel spanning four cloud models (Claude-Sonnet-4.6, Mistral-Large, DeepSeek-v4-pro, DeepSeek-R1) and two local open-weight code models (Mistral Codestral, Meta CodeLlama). Against a re-implemented binary baseline inspired by Mahmud et al. -- which shares its registry oracle with our ground truth and therefore serves as a degenerate upper bound rather than a genuine competitor -- the calibrated layer reproduces the strict-registry detections and introduces well-calibrated additional flags on the metadata channel. We assess detector asymmetry with a McNemar paired test and calibration with both a flagged-subset Expected Calibration Error and a strictly proper full-corpus Brier score. 2026-06-11T21:15:39Z 23 pages, 2 figures, 5 tables Lom M. Hillah NewCo Partners, Paris, France Sorbonne Université, CNRS, LIP6, Paris, France Jean-Marc Richard NewCo Partners, Paris, France Ryan Hasnaoui NewCo Partners, Paris, France http://arxiv.org/abs/2606.13882v1 A Principled Framework for Safe Algorithm Updates in Automated Insulin Delivery Systems 2026-06-11T20:18:13Z Background: AID algorithms require ongoing software updates and bug fixes. In co-adapted systems, where users tune settings around existing algorithmic behavior, bug fixes can paradoxically disrupt glycemic control. No principled framework evaluates the safety of AID algorithm updates. Methods: Our two-part framework classifies bugs and evaluates the clinical equivalence of AID system software updates. Bugs are classified as factual, heuristic, or computational, each with distinct management strategies. Classifications were validated from porting Trio's oref algorithm from Javascript to a bug-fixed Swift implementation. We compared implementations using shadow execution on 736,480 invocations from eight Trio users. The second component assesses clinical equivalence with error analysis on paired glucose values, applied to both Trio implementations using mechanistic in silico and data-driven replay simulation. Results: In mechanistic in silico simulation, the Swift and Javascript implementations produced nearly identical Time in Range (84.9% vs. 84.9%) and Glycemia Risk Index (23.5% vs. 23.9%), with more than 99% of paired glucose in Parkes Error Grid Zones A and B, meeting our clinical equivalence threshold. Shadow execution showed low mismatch rates in oref components (iob 0.43%, autosens 1.22%, determineBasal 0.07%, meal 0.01%), with clinically meaningful differences in 0.03% of iob invocations. Data-driven replay simulations of bugs revealed more than 99% of downstream paired glucose in Parkes Error Grid Zones A and B, also meeting our clinical equivalence threshold. Conclusions: Our framework integrates bug-fixing principles with multi-method clinical evaluation to assess AID algorithm update safety. It is system-agnostic and applicable to all widely used OS-AID systems, with case studies highlighting the need for systematic remediation of factual and computational bugs. 2026-06-11T20:18:13Z Thomas Screven Ziqiang "Joe" Zhu Deniz Cengiz Rayhan A. Lal Korey K. Hood Samuel T. King http://arxiv.org/abs/2606.13804v1 An Empirical Study of Gemini 3 for Detecting Natural Language Test Smells in Manual Test Cases 2026-06-11T18:20:15Z Manual testing, in which testers follow natural language instructions to validate system behavior, remains essential for uncovering issues that are difficult to capture with automation. However, manual test cases often contain test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce reliability, maintainability, and reproducibility. Existing detection approaches largely depend on manually engineered rules and thus struggle to generalize and scale across heterogeneous test suites. In our previous work, we assessed the feasibility of using Small Language Models (SLMs) for test smell detection by evaluating GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B on test steps from 143 real-world Ubuntu test cases, covering seven smell types. PHI-4-14B achieved the best performance. In this article, we investigate whether a contemporary Large Language Model (GEMINI-3-PRO-PREVIEW) available at the time of the study can identify test smells in natural language manual test cases using a prompt-based, whole-test-case analysis strategy. Unlike approaches that analyze individual test steps in isolation, our approach evaluates complete test cases, enabling the model to consider relationships and dependencies among test steps. We evaluate the approach on 100 Ubuntu test cases covering seven test smell types and compare its performance against previously evaluated SLMs, including GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B. Our results show that GEMINI-3-PRO-PREVIEW outperforms the SLMs, while producing actionable explanations that can help practitioners revise manual test cases for greater clarity and consistency. We also find that test smells are pervasive in practice, with nearly one detected test smell per step on average, highlighting the need for scalable and automated quality support for manual testing artifacts. 2026-06-11T18:20:15Z Keila Lucas Rohit Gheyi Márcio Ribeiro Fabio Palomba Luana Martins Elvys Soares http://arxiv.org/abs/2606.13802v1 A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets 2026-06-11T18:16:37Z Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context. 2026-06-11T18:16:37Z Accepted at ICML 2026. Code and benchmark: https://github.com/Tej-55/NAPE Tejas Agrawal Vu Le Sumit Gulwani Gust Verbruggen http://arxiv.org/abs/2403.17382v4 How Quickly Do Development Teams Update Their Vulnerable Dependencies? 2026-06-11T18:08:46Z Industry practitioners are increasingly concerned with software that contains vulnerable versions of third-party dependencies that are included both directly and transitively. To address this problem, projects are encouraged to both (a)~quickly update to non-vulnerable versions of dependencies and (b)~be mindful of the update practices of the dependencies they choose to use. To this end, researchers have proposed metrics to measure the responsiveness of the development teams of the packages in keeping their dependencies updated: Mean-Time-To-Update (MTTU) and Mean-Time-To-Remediate (MTTR). While MTTU covers all dependencies, MTTR quantifies the time needed for a package to update its vulnerable dependencies. However, existing metrics fail to capture important nuances, such as considering floating versions and prioritizing recent updates, leading to inaccurate reflections of a development team's update practices. \textit{The goal of this study is to aid practitioners in understanding how quickly packages update their dependencies.} We propose two novel metrics, Mean-Time-To-Update for dependencies (MTTU) and Mean-Time-To-Remediate for vulnerable dependencies (MTTR), that overcome the limitations of existing metrics. We conduct an empirical study using $163,207$ packages in npm ($117,129$), PyPI ($42,777$), and Cargo ($3,301$) and characterize how the ecosystems differ in MTTU and MTTR, as well as what package characteristics influence MTTU and MTTR. We found that most packages have a relatively fast dependency update practice. We further study whether MTTU can be used as a proxy for MTTR when sufficient vulnerability data is not available. As we did not find enough statistical evidence for a strong proxy, our findings suggest that MTTU could only be partially used (may be used but with caution) as a proxy for MTTR when vulnerability data is not available. 2024-03-26T05:01:53Z under review Imranur Rahman Ranindya Paramitha Nusrat Zahan William Enck Laurie Williams http://arxiv.org/abs/2606.13763v1 Do programming languages still matter to your AI coding agent teammate? Evidence at scale from chess engines 2026-06-11T17:34:00Z Frontier coding agents now promise end-to-end authorship of complete software systems. Two empirical questions follow: can AI coding-agent teammates program in any target language, including ones with no comparable prior open-source artefact? If so, does language choice still shape the artefact, and along which dimensions? We study both through a polyglot case study built around chess engines: non-trivial multi-component systems that admit a hierarchy of language-agnostic oracles, from exact move-generation correctness to a strength scale (Elo), observable from Rust to Brainfuck. We prompted two frontier agents (Claude Code and Codex) at the capability level, without chess knowledge or implementation guidance, under a documented intervention and stopping policy. The agents produced 34 chess engines spanning 17 primary programming languages, from mainstream to specialised, domain-specific, legacy, and esoteric targets. We combine per-engine feature analysis, independent Elo assessment, and session trajectories with qualitative analysis of code and transcripts. Frontier coding agents are genuinely polyglot: every language we tried produced at least one feature-rich working engine, several with no prior open-source counterpart of comparable scope (e.g., LaTeX), and the code is synthesised from scratch rather than copied. Yet language choice still matters: strong playing strength is only reachable in mainstream compiled languages, cost and engineering effort grow sharply as the language becomes more exotic, and feature choices shift across language families. Agents validate their own work unprompted, but their strength self-estimates are biased and a few engines cheated by calling a chess library. Programming language is no longer about whether AI teammates can build a working system, but about performance, cost, what gets built, and how much human supervision validation still needs. 2026-06-11T17:34:00Z Mathieu Acher Jean-Marc Jézéquel http://arxiv.org/abs/2602.18545v2 Programmable Property-Based Testing 2026-06-11T15:52:44Z Property-based testing (PBT) is a popular technique for establishing confidence in software, where users write properties -- i.e., executable specifications -- that can be checked many times in a loop by a testing framework. In modern PBT frameworks, properties are usually written in shallowly embedded domain-specific languages, and their definition is tightly coupled to the way they are tested. Such frameworks often provide convenient configuration options to customize aspects of the testing process, but users are limited to precisely what library authors had the prescience to allow for when developing the framework; if they want more flexibility, they may need to write a new framework from scratch. We propose a new, deeper language for properties based on a mixed embedding that we call deferred binding abstract syntax, which reifies properties as a data structure and decouples them from the property runners that execute them. We implement this language in Rocq and Racket, leveraging the power of dependent and dynamic types, respectively. Finally, we showcase the flexibility of this new approach by rapidly prototyping a variety of property runners, highlighting domain-specific testing improvements that can be unlocked by more programmable testing. 2026-02-20T16:52:03Z Alperen Keles Justine Frank Ceren Mert Harrison Goldstein Leonidas Lampropoulos