https://arxiv.org/api/RmNBbYQDcYyiwj8AJKCNEteRSYU 2026-06-21T12:38:57Z 27359 90 15 http://arxiv.org/abs/2606.17387v1 Supporting the Adoption of Privacy-Enhancing Technologies through Requirements Engineering 2026-06-16T00:49:10Z In recent decades, privacy-enhancing technologies (PETs) have been recognized as a means of meeting regulatory and user privacy requirements in software systems that process personal data. Despite substantial research efforts, support from regulators, contributions by large technology companies such as Google and Microsoft, and growing interest among software practitioners, the practical adoption of PETs remains limited. Existing research consistently identifies recurring challenges to PETs adoption in SE, such as technical complexity and insufficient training. Despite ongoing research efforts, these challenges largely remain unresolved in practice. In this industrial challenge paper, we apply a practical, requirements engineering (RE)-driven perspective to examine challenges to PET adoption across multiple stakeholder groups (PET developers, integrators, and adopters) as well as across different disciplinary perspectives (engineering, law, and business). We argue that RE can facilitate the adoption of PETs by systematically addressing each of the complementary engineering, business, and legal viewpoints on privacy. Neglecting challenges in any of these viewpoints (e.g., the impact of PETs on software architecture, their business implications, and their contribution to regulatory compliance) can increase the impediments or even lead to implementation failure. In practice, explicit specification of these viewpoints within RE can enable meaningful coordination among stakeholders to more effectively realize the benefits of PETs in software engineering. 2026-06-16T00:49:10Z Accepted to the 34th International Requirements Engineering Conference (RE 2026), Montreal, Canada, from 17 to 21 August 2026 Oleksandr Kosenkov Vadym Honcharenko Abhinava Singh Volodymyr Spirin Danica Vranjanin http://arxiv.org/abs/2606.17374v1 Verifying the Rust Standard Library 2026-06-16T00:11:04Z Rust's type system prevents many classes of memory errors, yet its standard library relies heavily on unsafe code whose correctness is validated through testing, including dynamic checks under Miri, but lacks static verification. We present what is, to the best of our knowledge, the largest verification campaign reported for a software library: an open, crowdsourced effort that integrates complementary verification tools into the continuous integration of a verification repository forked from the Rust standard library. We analyze the campaign's effectiveness, discuss the practical value of machine-checked proofs for a subset of undefined behaviors (e.g., out-of-bounds access, null and dangling pointer dereferences, and use of uninitialized memory), and frame the remaining obstacles as open challenges for the formal-methods community. 2026-06-16T00:11:04Z Published at 18th NASA Formal Methods Symposium (NFM 2026) In: Deshmukh, J., Havelund, K., Pinto, A. (eds) NASA Formal Methods. NFM 2026. Lecture Notes in Computer Science, vol 16622. Springer, Cham, pp. 415-435 Byron Cook Remi Delmas Zyad Hassan Bart Jacobs Ranjit Jhala Rahul Kumar Felipe R. Monteiro Thanh Nguyen Rebecca Rumbul Michael Tautschnig Celina Val Carolyn Zech 10.1007/978-3-032-28079-4_19 http://arxiv.org/abs/2606.19386v1 Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence 2026-06-15T21:25:52Z Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing. 2026-06-15T21:25:52Z 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap Manvendra Modgil http://arxiv.org/abs/2606.17261v1 The Right Call for Software Benchmarking: Consistent Decisions in Stateful Environments 2026-06-15T20:11:47Z In the perpetual pursuit of performance, modern computing systems rely ever more on stateful mechanisms to accommodate the dynamics of workloads and physical environments, bolstering efficiency but confounding benchmarking and thereby the optimization of software. Indeed, by their nature, adaptive mechanisms introduce temporal dependencies between measurements and render naive estimators of individual program performance biased. Observing that rectifying such biases necessitates speculative assumptions about system dynamics, we call for prioritizing performance differentials over absolute measures and formalize software benchmarking as the decision problem of identifying the fastest program, for which relative knowledge suffices. To this end, we propose simple experiment designs admitting consistent estimators of contrasts, whereby program-specific biases cancel under tenable assumptions. These designs asymptotically yield the correct decision and afford a robust methodology for finite-budget benchmarking in stateful environments, bearing broad implications for the development of performance-sensitive software. 2026-06-15T20:11:47Z Gábor Melis http://arxiv.org/abs/2604.20912v2 Quantum-HPC Software Stacks and the openQSE Reference Architecture: A Survey 2026-06-15T18:55:55Z Quantum resources are increasingly integrated into high-performance computing (HPC) and cloud environments, but quantum high-performance computing (QHPC) software stacks remain isolated, often proprietary, full-stack solutions lacking common interfaces across runtime, resource management, orchestration, and execution layers. This paper analyzes nine production QHPC stacks and identifies common design patterns and emerging requirements, covering deployment models, application interaction patterns, SDK support, and readiness for fault-tolerant operation. The survey exposes consistent needs in runtime abstraction, resource management, interconnect semantics, and observability. Based on these findings, we propose the open quantum-HPC software ecosystem ( openQSE) reference architecture as a first step toward unifying the state-of-the-practice. openQSE defines a set of layer boundaries that allow different implementations to interoperate while preserving deployment flexibility, and is structured to support both current noisy intermediate-scale quantum (NISQ) workloads and future fault-tolerant quantum computing (FTQC) systems without changes to upper-layer application interfaces. 2026-04-22T01:56:58Z 23 pages, 2 figures Amir Shehata Brian Austin Tom Beck Lukas Burgholzer Alex Chernoguzov Spencer Churchill Andrea Delgado Yasuko Eckert Jeffery Heckey Kevin Kissell Katherine Klymko Josh Moles Thomas Naughton Lee James O'Riordan Christian Ortiz Pauyac Guen Prawiroatmodjo Ermal Rrapaj Jiri Schindler Laura Schulz Sebastian Stern Tyler Takeshita Miwako Tsuji Aleksander Wennersteen Travis Humble Martin Schulz http://arxiv.org/abs/2606.18293v1 Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming 2026-06-15T18:49:02Z Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter. 2026-06-15T18:49:02Z 10 pages, 2 figures Callum Barbour http://arxiv.org/abs/2606.17203v1 Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management 2026-06-15T18:41:50Z Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low-confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety-critical domains. We propose a trust-aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other's contributions using calibrated confidence scores. Our approach introduces a two-stage traceability link prediction pipeline combining embedding-based retrieval with LLM-based multi-criteria analysis, a traceability seeding mechanism that enables comparison between derivation-time and validation-time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination. 2026-06-15T18:41:50Z Mohamed Essam Kareem Wael Azza Hassan Ahmed Haitham Mahmoud Soliman Samer Saber Ibrahim Habib http://arxiv.org/abs/2606.17197v1 Cluster-Aware Dual-Level Test Specification Generation for Large-Scale Automotive Software Requirements 2026-06-15T18:36:57Z Generating test specifications that satisfy Automotive SPICE SWE.6 requirements becomes increasingly challenging and time-consuming as projects scale to thousands of requirements. Because this manual process often consumes weeks of engineering effort, automation becomes a critical necessity. However, standard Large Language Model (LLM) approaches struggle at scale: processing requirements individually discards vital inter-requirement dependencies, while feeding entire corpora at once exceeds context-window limits, leading to incomplete integration coverage and redundant test cases. This paper presents a novel "Cluster-then-Summarize" pipeline that addresses these limitations through three-stages. Requirements are embedded using sentence transformers and grouped using UMAP dimensionality reduction followed by HDBSCAN density-based clustering. This grouping utilizes an automatic minimum cluster size selection driven by a quality criterion combining normalized Silhouette and Calinski-Harabasz scores. A multi-level map-reduce summarization algorithm then distills each cluster into concise, domain-conformant descriptions while preserving quantitative thresholds and safety integrity levels. The pipeline exploits the derived cluster topology to generate test specifications at two levels: individual requirement verification and cluster-level integration tests that verify cross-requirement feature behavior. A nearby-cluster context mechanism provides bounded cross-feature awareness during each LLM call, and Retrieval-Augmented Generation grounds all outputs in ISO 26262 and ASPICE standards. Evaluation on automotive requirement datasets of varying scale demonstrates that the cluster-aware approach improves integration test coverage and maintains summarization fidelity compared to baseline methods while scaling efficiently to thousands of requirements. 2026-06-15T18:36:57Z Hazem Ayman Menna Sedik Kareem Mostafa Mahmoud Soliman Samer Saber Ibrahim Habib http://arxiv.org/abs/2606.06523v2 Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory 2026-06-15T18:24:02Z Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior. 2026-06-02T18:46:50Z Ruida Wang Jerry Huang Pengcheng Wang Xuanqing Liu Luyang Kong Tong Zhang http://arxiv.org/abs/2606.17164v1 PromptMN: Pseudo Prompting Language 2026-06-15T18:04:50Z Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN's feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction. 2026-06-15T18:04:50Z 32 pages, 2 figures Enkhzol Dovdon http://arxiv.org/abs/2606.16999v1 Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models 2026-06-15T17:36:23Z Frozen small code models (<=1.5B parameters, run locally without fine-tuning) suit offline and privacy-constrained use, but often emit plausible-but-wrong programs. A natural remedy is a post-hoc operator that selects, verifies, repairs, or re-processes the model's samples without retraining; in principled form it is Popperian: attack each candidate with a severe test, keep what survives. We measure whether such operators help. Under one deterministic execution oracle and a leakage-free, matched-compute protocol, 26 semantic post-hoc operators (selection, verification, repair, elimination, portfolios, sound vetoes, generation conditioning) are evaluated against Best-of-N (BoN); on the cells and benchmarks tested, none improves held-out accuracy over BoN. The negative is mechanistic: a coverage wall (systematic hard-task failures deeper sampling does not rescue), a capability scissors (a competent generator leaves almost no discriminable error among visible-test passers), and a near-empty consensus trap (the visible-pass-but-hidden-wrong majority a leakage-free selector needs rarely co-occurs with a correct alternative). A distribution-free do-no-harm bound cannot certify a harm rate <=alpha at zero observed harm unless n>=45. Two operators help on a different axis, outside the semantic output space. An expression-layer recovery (M1), the only accuracy gain here, recovers correct programs the standard extractor discards (robust extraction and public-test signature alignment); it does no harm (b10=0), is leakage-free, and lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4). An adaptive consensus early-stop (ACE) is a calibrated compute-saving control (~19% saving, zero harm). M1 and the selection negative replicate on HumanEval+ and MBPP+ across three model cells. The lesson: fix the harness and measure coverage before blaming semantic post-hoc reasoning. 2026-06-15T17:36:23Z 33 pages, 4 figures, 8 tables Mehmet Iscan http://arxiv.org/abs/2606.16988v1 Agent trajectories as programs: fingerprinting and programming coding-agent behavior 2026-06-15T17:28:41Z Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We develop procedural representations for agent problem-solving procedures with an emergent vocabulary induction technique that is meant to be maximally compressive to avoid surface-level variation while being expressive enough to unveil the quirks of the models' patterns. We apply our framework to the software engineering evaluation dataset SWE-Bench to study the structural distinctness of agent trajectories and find that behavior is most similar between models from similar release periods and those that are distilled from one another (e.g., a distilled student model and its teacher have a Jensen-Shannon divergence of 0.25, about half the distance between other model pairs). As more models saturate evaluations, we believe that it will be important to probe model behavior along more holistic dimensions than success rates alone. We introduce ProcGrep, a library for auditing and evaluating agents for how they approach tasks at a procedural level given their traces in a top-down fashion. We believe this work has a range of applications to help developers work with and program coding agents, such as task-aware model routing, agent monitoring, and finer-grained cost analysis. 2026-06-15T17:28:41Z Hamidah Oderinwale http://arxiv.org/abs/2606.16965v1 How Many Shots Are Enough for a Quantum Circuit? 2026-06-15T17:02:51Z Quantum algorithms require repeated circuit executions, known as shots, to estimate output distributions accurately. Determining the minimal number of shots needed to meet a target accuracy is crucial to reduce costs and resource usage, especially on today's noisy and expensive quantum hardware. In this paper, we address the shot optimisation problem in a black-box setting, where no assumptions are made about the structure of the quantum circuit or the noise model of the backend. We introduce IncrementalExecution, a novel online framework that dynamically determines when to stop executing shots based on the principle of point of diminishing returns: the point at which additional shots no longer significantly alter the empirical distribution of a fixed circuit. The framework supports customisable policies for shot management, enabling flexible trade-offs between execution cost and result fidelity within static execution scenarios. We assess our proposal through an extensive experimental evaluation spanning 33,750 framework configurations across 180 unique static quantum circuit-backend combinations, for a total of 7.3M independent experiments. Unlike prior work that relies on problem-specific knowledge or algorithm-dependent assumptions (e.g., variational or adaptive workflows), our approach is applicable to a large set of static circuits and immediately deployable on current quantum cloud platforms. 2026-06-15T17:02:51Z 57 pages Giuseppe Bisicchia Alessandro Bocci Ernesto Pimentel Antonio Brogi http://arxiv.org/abs/2606.16886v1 Neuro-Symbolic Software Verification: Hyper-charging Local Language Models with Symbolic Reasoning at Scale 2026-06-15T15:59:10Z Loop invariant synthesis remains a central and pivotal bottleneck in formal software verification. Recent LLM-based Neuro-Symbolic tools have achieved impressive solve rates. However, these tools rely on proprietary, often expensive cloud APIs, which constitute a hurdle for privacy-sensitive industrial deployments where the source code cannot leave the organisation or where cost is a factor. We present VerIbmc, a neuro-symbolic pipeline that pairs symbolic invariant generation with locally deployable open-weight language models with the ESBMC verification tool. Our pipeline combines a deterministic symbolic invariant synthesis phase with an iterative LLM refinement loop driven by structured verifier feedback. In addition, we provide two types of pipelines that differ in their prompting strategy: Chain-of-Thought vs. Tree-of-Thought. We conduct an extensive experimental evaluation with five open-weight models (ranging from 7B to 120B parameters) across five benchmark families comprising of 520 problems (499 after excluding 21 with unavoidable overflow). Overall, the best single configuration (GPT-OSS-120B) solves 431 of 499 problems (86.4%). Additionally, on the four benchmark suites shared with the strongest cloud-API tools, VerIbmc is competitive running only on a single local machine. The evaluation shows symbolic invariant synthesis solves 75 problems without any LLM call and yields up to +35 additional problems for the weakest model. Importantly, all inference runs entirely on a single local machine using open-weight models -- no cloud API or proprietary model is required. Overall, we demonstrate that a neuro-symbolic approach based on LLMs can be used effectively for invariant synthesis in a privacy-preserving and energy-efficient manner, without having to resort to expensive proprietary frontier models locked behind APIs. 2026-06-15T15:59:10Z Muhammad A. A. Pirzada Julian Parsert Weiqi Wang Konstantin Korovin Lucas C. Cordeiro http://arxiv.org/abs/2606.16842v1 Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course 2026-06-15T15:21:54Z Teaching Software Engineering for AI-enabled systems entails addressing the integration of AI components within full-scale software architectures under realistic constraints. While machine learning courses emphasize model development, students often lack experience in architectural design, deployment, and monitoring of AI-enabled systems. Empirical evaluations of such system-oriented AI courses remain limited. This paper reflects on the design and implementation of a project-based master's-level course titled AI Algorithms: Theory and Engineering, at the University of Bremen, in which students developed a movie recommendation system while making architectural design decisions to address challenges related to scalability, deployment, and evolving requirements. We conducted a mixed-methods study combining analyses of student submissions and questionnaire responses to investigate integration challenges, learning outcomes, and opportunities for improvement. Our results indicate persistent difficulties in early architectural decisions, heterogeneous ML integration, evolving requirements, and data management, largely due to uneven ML and software engineering expertise. From the educator's perspective, the course fostered system-level reasoning and strengthened awareness of data-centric ML practices in AI-enabled systems. 2026-06-15T15:21:54Z Amir Mashmool Kishan Ravindra Sawant Mojtaba Shahin Nico Hochgeschwender Rainer Koschke