https://arxiv.org/api/ehVJXZsO4KTvO6FoNmxa3XsAcQk2026-06-21T18:36:45Z2735916515http://arxiv.org/abs/2508.15503v7Guidelines for Empirical Studies in Software Engineering involving Large Language Models2026-06-12T08:34:46ZLarge Language Models (LLMs) are widely used in software engineering (SE) research and practice, yet their non-determinism, opaque training data, and rapidly evolving models threaten the reproducibility and replicability of empirical studies. We address this challenge through a collaborative effort of 22 researchers, presenting a taxonomy of seven study types that organizes how LLMs are used in SE research, together with eight guidelines for designing and reporting such studies. Each guideline distinguishes requirements (must) from recommendations (should) and is contextualized by the study types it applies to. Our guidelines recommend that researchers: (1) declare LLM usage and role; (2) report model versions, configurations, and customizations; (3) document the system and prompt design beyond the model; (4) report session traces, i.e., interaction logs and runtime traces; (5) use suitable baselines, benchmarks, and metrics; (6) include an open LLM as a baseline; (7) validate LLM outputs against human judgment; and (8) articulate limitations and mitigations. We complement the guidelines with an applicability matrix mapping guidelines to study types and a reporting checklist for authors and reviewers. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines$.$org).2025-08-21T12:30:30Z86 pages, 4 tables, accepted in Empirical Software EngineeringSebastian BaltesFlorian AngermeirChetan AroraMarvin Muñoz BarónChunyang ChenLukas BöhmeFabio CalefatoNeil ErnstDavide FalessiBrian FitzgeraldDavide FucciJunda HeChristoph TreudeMarcos KalinowskiStefano LambiaseDaniel RussoMircea LunguCristina Martinez MontesLutz PrecheltPaul RalphRijnard van TonderStefan Wagnerhttp://arxiv.org/abs/2606.14233v1Evaluating LLMs for Obfuscation Detection and Classification in Android Apps2026-06-12T08:14:55ZAndroid applications (apps) developers increasingly rely on code obfuscation techniques to hinder reverse engineering and protect intellectual property. However, obfuscation also reduces the effectiveness of static analysis and vulnerability detection tools, creating challenges for Android security analysis. Existing approaches for detecting obfuscation in Android apps predominantly rely on handcrafted heuristics, engineered features, or task-specific learning pipelines, which may struggle to generalize across evolving obfuscation strategies.
This paper presents a large-scale empirical study investigating the capability of Large Language Models (LLMs) to detect obfuscation in Android apps through semantic reasoning. Our study evaluates whether off-the-shelf LLMs can identify obfuscated code without relying on handcrafted rules, predefined signatures, or dedicated model training. The empirical evaluation is conducted on both a controlled benchmark containing an app obfuscated with multiple techniques and a real-world dataset of Android apps collected from Google Play. The study further examines the impact of prompt design, model selection, and decision thresholds across several open-weight and proprietary LLMs. Finally, the analysis compares LLM-based reasoning with existing SAST-based obfuscation-detection approaches and discusses the broader implications and limitations of applying LLMs to Android security analysis.2026-06-12T08:14:55ZLuca FerrariMarco AlecciJordan SamhiTegawende' F. Bissyande'Jacques KleinMariano CeccatoLuca Verderamehttp://arxiv.org/abs/2604.20462v3Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark2026-06-12T07:43:16ZContext. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication
with documented maintenance cost. Prior detectors either require runnable tests or are
single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public
benchmark to calibrate it.
Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a
labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a
consolidation-savings model linking clusters to ISO/IEC 25010 maintainability
sub-characteristics.
Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616
Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein,
sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually
labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report
precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free
relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines.
Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman
rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches
F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a
disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings
model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5%
of step lines are eliminable.2026-04-22T11:44:05Z28 pages, 2 figures, 4 tables. Submitted to Information and Software Technology (Elsevier). Tool, corpus, labelled benchmark, and rubric released at https://github.com/amughalbscs16/cukereuse-release under Apache-2.0Ali Hassaan MughalNoor FatimaMuhammad Bilalhttp://arxiv.org/abs/2507.00481v2The Influence of HEXACO Personality Traits on the Teamwork Quality in Software Teams -- A Preliminary Research Approach2026-06-12T07:02:59ZAlthough software engineering research has focused on optimizing processes and technology, there is a growing recognition that human factors, particularly teamwork, also significantly impact optimization. Recent research suggests that developer personality has a strong influence on teamwork. In fact, personality considerations may have a greater impact on software development than processes and tools. This paper aims to design a study that measures the impact of HEXACO personality traits on the Teamwork Quality (TWQ) of software teams. A preliminary data collection (n=54) was conducted for this purpose. The analysis showed that several personality traits, as well as their composition, had a significant impact on TWQ. Additionally, other variables, such as the proportion of women and age distribution, also affected TWQ. The study's initial results demonstrate the usefulness and validity of the study design. The results also suggest several opportunities to improve teamwork in IT organizations and avenues for further research.2025-07-01T06:56:48ZPhilipp M. ZählSabine TheisMartin R. Wolfhttp://arxiv.org/abs/2606.14164v1Investigating Metamorphic Fuzz Oracle Enhancement via Large Language Models2026-06-12T06:43:35ZFuzz drivers are essential components of greybox fuzzing, as they encapsulate target interfaces, define test spaces, and largely determine fuzzing effectiveness. Existing fuzz drivers typically rely on crash-based oracles for security testing, overlooking library functionality and limiting bug detection capability.
In this paper, we present the first study on metamorphic-based fuzz oracle enhancement (MFOE), which augments existing fuzz drivers with metamorphic-based oracles derived from metamorphic relations (MRs). Since constructing and integrating such oracles requires substantial domain knowledge, automating MFOE is challenging. To address this challenge, we propose MetaFOE, an LLM-based framework that automatically generates and integrates metamorphic-based oracles.
We evaluate MetaFOE on OSS-Fuzz drivers using three modern LLMs and five prompt strategies. MetaFOE generates 3,475 MRs, of which 77.3% are applicable, and implements 12,351 meta drivers, with 6,228 being valid. After three hours of fuzzing, the valid meta drivers improve edge coverage by an average of 18.7% and trigger 1,528 unique crashes. Our results demonstrate both the effectiveness of metamorphic-based oracle enhancement and the feasibility of using LLMs to automate MFOE, providing valuable insights for advancing greybox fuzzing.2026-06-12T06:43:35Z28 pagesRuixiang QianDing YangZengxu ChenYuxuan GaoChunrong FangChao ZhangZhenyu Chenhttp://arxiv.org/abs/2606.14113v1Simulating Students' Java Programming Errors with Large Language Models2026-06-12T04:51:49ZUnderstanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.2026-06-12T04:51:49ZAli KeramatiJie CaoIman MohammadiMark WarschauerYang Shihttp://arxiv.org/abs/2508.17912v2Citizens' Contentment with e-Government Solutions and Services in Saudi Arabia2026-06-12T01:08:05ZGovernments around the world have worked tirelessly to develop technological solutions, on the one hand to better serve their citizens and, on the other hand, to advance in the United Nations Electronic Government Development Index (EGDI). Thus, it is crucial to assess e-government solutions and services from different aspects. This study evaluates e-government solutions and services based on user expectations in four aspects: general satisfaction, satisfaction with features, trust, and pleasure. In this study, a questionnaire was developed to allow the evaluation of e-government solutions and services in Saudi Arabia, and could also be utilized to evaluate e-government in any other nation. The study included 276 valid participants, while the required sample size was calculated using a standard sample size estimation formula for large populations (95% confidence level, 5% margin of error). In addition, descriptive analysis was used to analyze participant responses. The results showed that e-government services in Saudi Arabia achieved a level of citizen contentment that is consistent with its score in the EGDI published in the year 2024.2025-08-25T11:29:26ZTitle, abstract, and keywords updated to align with the final version accepted for publication by Inderscience Publishers. Originally archived under the working title: "Evaluating Citizen Satisfaction with Saudi Arabia's E-Government Services: A Standards-Based, Theory-Informed Approach". 38 pages, 1 figure, 16 tables, journal research paperMohammed O. Alannsaryhttp://arxiv.org/abs/2606.14805v1Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces2026-06-11T22:34:39ZReliable operation of multi-agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool calls. The standard tool is counterfactual replay (rewind, edit, and re-run the trajectory to measure each event's effect), but its cost grows linearly with the number of candidate events, making exhaustive replay infeasible at scale. We frame trace debugging as a knowledge-based decision-support problem. Each trace is compiled into a structured event knowledge graph over routing, memory, tool-use, uncertainty, and latent evidence, and a calibrated predictor decides where a scarce replay budget should be spent. We do not propose a new replay oracle; we propose a method to predict its results without paying the replay cost. We formulate zero-replay counterfactual-effect prediction: given a trace under a fixed budget, predict which events the oracle would mark high-effect before any replay is performed. BranchPoint-Latent is a lightweight predictor over observable, structural, uncertainty, and latent features of the knowledge graph. Calibrated against a deterministic replay oracle across 37 trace families, a single learning-to-rank gradient-boosted predictor raises per-trace localization (Branch Recall@5) from 0.73 to 0.93 on held-out families at zero oracle-replay cost. Rather than claiming universal dominance, we characterize when cheap graph centrality suffices and when learned evidence is necessary. The result is an auditable, cost-efficient decision-support system for AI-reliability debugging, positioned explicitly on the cost-accuracy frontier with reproducible artifacts.2026-06-11T22:34:39Z21 pages, 1 figure, 6 tables. Submitted to Knowledge-Based SystemsDong Ho KangHyeonjeong ChaDaein Weonhttp://arxiv.org/abs/2606.13918v1Bayesian-Calibrated Detection of Hallucinated Package Imports in AI-Assisted Code2026-06-11T21:15:39ZWe present a Bayesian calibration layer for slopsquat detectors -- those that flag hallucinated package imports in code produced by large language models (LLMs). Where existing pipelines emit binary decisions (flag / do-not-flag), our layer emits a Beta-posterior probability per detection, derived from a 3-category epistemic taxonomy that explicitly classifies each prior as empirically calibrated, constructively argued, or engineering-judgement-traced. Beyond the primary 200/404 registry channel, the calibrated layer exploits PyPI metadata signals -- package age, release count, author descriptor, summary -- to surface registered-but-suspicious packages that a binary registry detector misses, which is the realistic post-LLM-emission attacker regime. The resulting risk-aware primitive is directly consumable by downstream CI gates and supports principled threshold decisions across detection rules. We evaluate the calibration on a merged corpus of 1,734 Python snippets -- a stratified 189-prompt BigCodeBench slice plus a 100-prompt niche-library stress-test set, generated across a six-model panel spanning four cloud models (Claude-Sonnet-4.6, Mistral-Large, DeepSeek-v4-pro, DeepSeek-R1) and two local open-weight code models (Mistral Codestral, Meta CodeLlama). Against a re-implemented binary baseline inspired by Mahmud et al. -- which shares its registry oracle with our ground truth and therefore serves as a degenerate upper bound rather than a genuine competitor -- the calibrated layer reproduces the strict-registry detections and introduces well-calibrated additional flags on the metadata channel. We assess detector asymmetry with a McNemar paired test and calibration with both a flagged-subset Expected Calibration Error and a strictly proper full-corpus Brier score.2026-06-11T21:15:39Z23 pages, 2 figures, 5 tablesLom M. HillahNewCo Partners, Paris, FranceSorbonne Université, CNRS, LIP6, Paris, FranceJean-Marc RichardNewCo Partners, Paris, FranceRyan HasnaouiNewCo Partners, Paris, Francehttp://arxiv.org/abs/2606.13882v1A Principled Framework for Safe Algorithm Updates in Automated Insulin Delivery Systems2026-06-11T20:18:13ZBackground: AID algorithms require ongoing software updates and bug fixes. In co-adapted systems, where users tune settings around existing algorithmic behavior, bug fixes can paradoxically disrupt glycemic control. No principled framework evaluates the safety of AID algorithm updates.
Methods: Our two-part framework classifies bugs and evaluates the clinical equivalence of AID system software updates. Bugs are classified as factual, heuristic, or computational, each with distinct management strategies. Classifications were validated from porting Trio's oref algorithm from Javascript to a bug-fixed Swift implementation. We compared implementations using shadow execution on 736,480 invocations from eight Trio users. The second component assesses clinical equivalence with error analysis on paired glucose values, applied to both Trio implementations using mechanistic in silico and data-driven replay simulation.
Results: In mechanistic in silico simulation, the Swift and Javascript implementations produced nearly identical Time in Range (84.9% vs. 84.9%) and Glycemia Risk Index (23.5% vs. 23.9%), with more than 99% of paired glucose in Parkes Error Grid Zones A and B, meeting our clinical equivalence threshold. Shadow execution showed low mismatch rates in oref components (iob 0.43%, autosens 1.22%, determineBasal 0.07%, meal 0.01%), with clinically meaningful differences in 0.03% of iob invocations. Data-driven replay simulations of bugs revealed more than 99% of downstream paired glucose in Parkes Error Grid Zones A and B, also meeting our clinical equivalence threshold.
Conclusions: Our framework integrates bug-fixing principles with multi-method clinical evaluation to assess AID algorithm update safety. It is system-agnostic and applicable to all widely used OS-AID systems, with case studies highlighting the need for systematic remediation of factual and computational bugs.2026-06-11T20:18:13ZThomas ScrevenZiqiang "Joe" ZhuDeniz CengizRayhan A. LalKorey K. HoodSamuel T. Kinghttp://arxiv.org/abs/2606.13804v1An Empirical Study of Gemini 3 for Detecting Natural Language Test Smells in Manual Test Cases2026-06-11T18:20:15ZManual testing, in which testers follow natural language instructions to validate system behavior, remains essential for uncovering issues that are difficult to capture with automation. However, manual test cases often contain test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce reliability, maintainability, and reproducibility. Existing detection approaches largely depend on manually engineered rules and thus struggle to generalize and scale across heterogeneous test suites. In our previous work, we assessed the feasibility of using Small Language Models (SLMs) for test smell detection by evaluating GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B on test steps from 143 real-world Ubuntu test cases, covering seven smell types. PHI-4-14B achieved the best performance. In this article, we investigate whether a contemporary Large Language Model (GEMINI-3-PRO-PREVIEW) available at the time of the study can identify test smells in natural language manual test cases using a prompt-based, whole-test-case analysis strategy. Unlike approaches that analyze individual test steps in isolation, our approach evaluates complete test cases, enabling the model to consider relationships and dependencies among test steps. We evaluate the approach on 100 Ubuntu test cases covering seven test smell types and compare its performance against previously evaluated SLMs, including GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B. Our results show that GEMINI-3-PRO-PREVIEW outperforms the SLMs, while producing actionable explanations that can help practitioners revise manual test cases for greater clarity and consistency. We also find that test smells are pervasive in practice, with nearly one detected test smell per step on average, highlighting the need for scalable and automated quality support for manual testing artifacts.2026-06-11T18:20:15ZKeila LucasRohit GheyiMárcio RibeiroFabio PalombaLuana MartinsElvys Soareshttp://arxiv.org/abs/2606.13802v1A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets2026-06-11T18:16:37ZPredictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.2026-06-11T18:16:37ZAccepted at ICML 2026. Code and benchmark: https://github.com/Tej-55/NAPETejas AgrawalVu LeSumit GulwaniGust Verbruggenhttp://arxiv.org/abs/2403.17382v4How Quickly Do Development Teams Update Their Vulnerable Dependencies?2026-06-11T18:08:46ZIndustry practitioners are increasingly concerned with software that contains vulnerable versions of third-party dependencies that are included both directly and transitively. To address this problem, projects are encouraged to both (a)~quickly update to non-vulnerable versions of dependencies and (b)~be mindful of the update practices of the dependencies they choose to use. To this end, researchers have proposed metrics to measure the responsiveness of the development teams of the packages in keeping their dependencies updated: Mean-Time-To-Update (MTTU) and Mean-Time-To-Remediate (MTTR). While MTTU covers all dependencies, MTTR quantifies the time needed for a package to update its vulnerable dependencies. However, existing metrics fail to capture important nuances, such as considering floating versions and prioritizing recent updates, leading to inaccurate reflections of a development team's update practices. \textit{The goal of this study is to aid practitioners in understanding how quickly packages update their dependencies.} We propose two novel metrics, Mean-Time-To-Update for dependencies (MTTU) and Mean-Time-To-Remediate for vulnerable dependencies (MTTR), that overcome the limitations of existing metrics. We conduct an empirical study using $163,207$ packages in npm ($117,129$), PyPI ($42,777$), and Cargo ($3,301$) and characterize how the ecosystems differ in MTTU and MTTR, as well as what package characteristics influence MTTU and MTTR. We found that most packages have a relatively fast dependency update practice. We further study whether MTTU can be used as a proxy for MTTR when sufficient vulnerability data is not available. As we did not find enough statistical evidence for a strong proxy, our findings suggest that MTTU could only be partially used (may be used but with caution) as a proxy for MTTR when vulnerability data is not available.2024-03-26T05:01:53Zunder reviewImranur RahmanRanindya ParamithaNusrat ZahanWilliam EnckLaurie Williamshttp://arxiv.org/abs/2606.13763v1Do programming languages still matter to your AI coding agent teammate? Evidence at scale from chess engines2026-06-11T17:34:00ZFrontier coding agents now promise end-to-end authorship of complete software systems. Two empirical questions follow: can AI coding-agent teammates program in any target language, including ones with no comparable prior open-source artefact? If so, does language choice still shape the artefact, and along which dimensions? We study both through a polyglot case study built around chess engines: non-trivial multi-component systems that admit a hierarchy of language-agnostic oracles, from exact move-generation correctness to a strength scale (Elo), observable from Rust to Brainfuck. We prompted two frontier agents (Claude Code and Codex) at the capability level, without chess knowledge or implementation guidance, under a documented intervention and stopping policy. The agents produced 34 chess engines spanning 17 primary programming languages, from mainstream to specialised, domain-specific, legacy, and esoteric targets. We combine per-engine feature analysis, independent Elo assessment, and session trajectories with qualitative analysis of code and transcripts.
Frontier coding agents are genuinely polyglot: every language we tried produced at least one feature-rich working engine, several with no prior open-source counterpart of comparable scope (e.g., LaTeX), and the code is synthesised from scratch rather than copied. Yet language choice still matters: strong playing strength is only reachable in mainstream compiled languages, cost and engineering effort grow sharply as the language becomes more exotic, and feature choices shift across language families. Agents validate their own work unprompted, but their strength self-estimates are biased and a few engines cheated by calling a chess library. Programming language is no longer about whether AI teammates can build a working system, but about performance, cost, what gets built, and how much human supervision validation still needs.2026-06-11T17:34:00ZMathieu AcherJean-Marc Jézéquelhttp://arxiv.org/abs/2602.18545v2Programmable Property-Based Testing2026-06-11T15:52:44ZProperty-based testing (PBT) is a popular technique for establishing confidence in software, where users write properties -- i.e., executable specifications -- that can be checked many times in a loop by a testing framework. In modern PBT frameworks, properties are usually written in shallowly embedded domain-specific languages, and their definition is tightly coupled to the way they are tested. Such frameworks often provide convenient configuration options to customize aspects of the testing process, but users are limited to precisely what library authors had the prescience to allow for when developing the framework; if they want more flexibility, they may need to write a new framework from scratch.
We propose a new, deeper language for properties based on a mixed embedding that we call deferred binding abstract syntax, which reifies properties as a data structure and decouples them from the property runners that execute them. We implement this language in Rocq and Racket, leveraging the power of dependent and dynamic types, respectively. Finally, we showcase the flexibility of this new approach by rapidly prototyping a variety of property runners, highlighting domain-specific testing improvements that can be unlocked by more programmable testing.2026-02-20T16:52:03ZAlperen KelesJustine FrankCeren MertHarrison GoldsteinLeonidas Lampropoulos