https://arxiv.org/api/mSKr/Dk1AWdxQJpCWf4SgLWNhZI 2026-03-22T10:27:58Z 25302 30 15 http://arxiv.org/abs/2603.17909v1 In Perfect Harmony: Orchestrating Causality in Actor-Based Systems 2026-03-18T16:47:25Z

Runtime verification has gained popularity as a lightweight approach for increasing assurance in systems under scrutiny. Performing runtime checks enables dynamic monitoring and alerts for unexpected behavior, thereby improving reliability and correctness. Actor-based systems present significant challenges for runtime verification. Properties frequently span multiple actors with complex causal dependencies, while nondeterministic message interleavings can obscure execution semantics. Moreover, most existing monitoring tools are designed for single-process behavior. This paper presents ACTORCHESTRA, a runtime verification framework for Erlang that automatically tracks causality across multi-actor interactions. The framework instruments Erlang systems that comply with OTP guidelines via targeted code injection. This method establishes the orchestration infrastructure required to track causal relationships between actors without requiring manual modifications to the target system. To ease the specification of multi-actor properties, the framework provides WALTZ, a specification language that automatically compiles properties into executable Erlang monitors that integrate with the instrumented system. Three case studies demonstrate ACTORCHESTRA's effectiveness in detecting complex behavioral violations in real-world actor systems. A performance evaluation quantifies the runtime overhead of the monitoring infrastructure and analyzes the trade-offs between added safety guarantees and execution costs.

2026-03-18T16:47:25Z Accepted at the 19th IEEE International Conference on Software Testing, Verification and Validation (ICST 2026) Vladyslav Mikytiv Bernardo Toninho Carla Ferreira http://arxiv.org/abs/2603.17893v1 scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns 2026-03-18T16:23:02Z

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

2026-03-18T16:23:02Z Sergey V. Samsonau http://arxiv.org/abs/2603.15722v2 A Framework and Prototype for a Navigable Map of Datasets in Engineering Design and Systems Engineering 2026-03-18T15:32:25Z

The proliferation of data across the system lifecycle presents both a significant opportunity and a challenge for Engineering Design and Systems Engineering (EDSE). While this "digital thread" has the potential to drive innovation, the fragmented and inaccessible nature of existing datasets hinders method validation, limits reproducibility, and slows research progress. Unlike fields such as computer vision and natural language processing, which benefit from established benchmark ecosystems, engineering design research often relies on small, proprietary, or ad-hoc datasets. This paper addresses this challenge by proposing a systematic framework for a "Map of Datasets in EDSE." The framework is built upon a multi-dimensional taxonomy designed to classify engineering datasets by domain, lifecycle stage, data type, and format, enabling faceted discovery. An architecture for an interactive discovery tool is detailed and demonstrated through a working prototype, employing a knowledge graph data model to capture rich semantic relationships between datasets, tools, and publications. An analysis of the current data landscape reveals underrepresented areas ("data deserts") in early-stage design and system architecture, as well as relatively well-represented areas ("data oases") in predictive maintenance and autonomous systems. The paper identifies key challenges in curation and sustainability and proposes mitigation strategies, laying the groundwork for a dynamic, community-driven resource to accelerate data-centric engineering research.

2026-03-16T17:08:20Z 10 pages, 3 figures, Submitted to ASME IDETC 2026-DAC22 H. Sinan Bank Daniel R. Herber http://arxiv.org/abs/2603.17833v1 ArchBench: Benchmarking Generative-AI for Software Architecture Tasks 2026-03-18T15:26:46Z

Benchmarks for large language models (LLMs) have progressed from snippet-level function generation to repository-level issue resolution, yet they overwhelmingly target implementation correctness. Software architecture tasks remain under-specified and difficult to compare across models, despite their central role in maintaining and evolving complex systems. We present ArchBench, the first unified platform for benchmarking LLM capabilities on software architecture tasks. ArchBench provides a command-line tool with a standardized pipeline for dataset download, inference with trajectory logging, and automated evaluation, alongside a public web interface with an interactive leaderboard. The platform is built around a plugin architecture where each task is a self-contained module, making it straightforward for the community to contribute new architectural tasks and evaluation results. We use the term LLMs broadly to encompass generative AI (GenAI) solutions for software engineering, including both standalone models and LLM-based coding agents equipped with tools. Both the CLI tool and the web platform are openly available to support reproducible research and community-driven growth of architectural benchmarking.

2026-03-18T15:26:46Z 5 pages, 3 figures, Software Architecture Showcase Track, ICSA 2026 Bassam Adnan Aviral Gupta Sreemaee Akshathala Karthik Vaidhyanathan http://arxiv.org/abs/2603.17829v1 CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents 2026-03-18T15:25:42Z

A prerequisite for coding agents to perform tasks on large repositories is code localization - the identification of relevant files, classes, and functions to work on. While repository-level code localization has been performed using embedding-based retrieval approaches such as vector search, recent work has focused on developing agents to localize relevant code either as a standalone precursor to or interleaved with performing actual work. Most prior methods on agentic code search equip the agent with complex, specialized tools, such as repository graphs derived from static analysis. In this paper, we demonstrate that, with an effective reinforcement learning recipe, a coding agent equipped with nothing more than a standard Unix terminal can be trained to achieve strong results. Our experiments on three benchmarks (SWE-Bench Verified, Pro, and Lite) reveal that our models consistently achieve superior or competitive performance over 2-18x larger base and post-trained LLMs and sometimes approach performance provided by closed models like Claude Sonnet, even when using specialized scaffolds. Our work particularly focuses on techniques for re-purposing existing coding agent environments for code search, reward design, and RL optimization. We release the resulting model family, CodeScout, along with all our code and data for the community to build upon.

2026-03-18T15:25:42Z Lintang Sutawika Aditya Bharat Soni Bharath Sriraam R R Apurva Gandhi Taha Yassine Sanidhya Vijayvargiya Yuchen Li Xuhui Zhou Yilin Zhang Leander Melroy Maben Graham Neubig http://arxiv.org/abs/2603.17826v1 FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair 2026-03-18T15:24:22Z

Multimodal Automated Program Repair (MAPR) extends traditional program repair by requiring models to jointly reason over source code, textual issue descriptions, and visual artifacts such as GUI screenshots. While recent LLM-based repair systems have shown promising results, existing approaches face several limitations: rigid workflow pipelines restrict exploration during debugging, visual reasoning is often performed over full-page screenshots without localized grounding, and failed repair attempts are rarely transformed into reusable knowledge. To address these challenges, we propose FailureMem, a multimodal repair framework that integrates three key mechanisms: a hybrid workflow-agent architecture that balances structured localization with flexible reasoning, active perception tools that enable region-level visual grounding, and a Failure Memory Bank that converts past repair attempts into reusable guidance. Experiments on SWE-bench Multimodal demonstrate FailureMem improves the resolved rate over GUIRepair by 3.7%.

2026-03-18T15:24:22Z Ruize Ma Yilei Jiang Shilin Zhang Zheng Ma Yi Feng Vincent Ng Zhi Wang Xiangyu Yue Chuanyi Li Lewei Lu http://arxiv.org/abs/2601.08806v2 APEX-SWE 2026-03-18T14:30:55Z

We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eleven frontier models for the APEX-SWE leaderboard. Claude Opus 4.6 and Claude Opus 4.5 perform best, both with a Pass@1 score of 38.5%. Our analysis shows that strong performance is primarily driven by epistemic discipline, defined as the capacity to distinguish between assumptions and verified facts, combined with systematic verification prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).

2026-01-13T18:44:08Z Abhi Kottamasu Chirag Mahapatra Sam Lee Ben Pan Aakash Barthwal Akul Datta Ajay Arun Silas Alberti Adarsh Hiremath Brendan Foody Bertie Vidgen http://arxiv.org/abs/2507.10593v2 ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs 2026-03-18T14:23:55Z

Large Language Model (LLM) applications are increasingly relying on external tools to extend their capabilities beyond text generation. However, current tool integration approaches suffer from fragmentation, protocol limitations, and implementation complexity, leading to substantial development overhead. This paper presents ToolRegistry, a protocol-agnostic tool management system that has evolved from a single library into a modular three-package ecosystem: a core registry for tool management and execution, a server package providing protocol adapters (MCP, OpenAPI) and routing, and a hub package offering curated, production-tested tool implementations. Beyond the original contributions of unified registration, automated schema generation, and dual-mode concurrent execution, the ecosystem now includes an independent MCP client supporting four transport mechanisms, a web-based admin panel for runtime management, an event system for change propagation, and fine-grained tool lifecycle control. Our evaluation demonstrates that ToolRegistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and broad compatibility with OpenAI function calling standards. Real-world case studies show significant improvements in development efficiency and code maintainability across diverse integration scenarios. ToolRegistry is open-source and available at https://github.com/Oaklight/ToolRegistry, with comprehensive documentation at https://toolregistry.readthedocs.io/.

2025-07-11T20:23:23Z 15 pages, 4 figures, v2: major revision reflecting ecosystem evolution to three-package architecture Peng Ding http://arxiv.org/abs/2603.16013v2 Safety Case Patterns for VLA-based driving systems: Insights from SimLingo 2026-03-18T14:04:09Z

Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving as well as understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. For instance, the integration of open-ended natural language inputs (e.g., user or navigation instructions) into the multimodal control loop, may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.

2026-03-16T23:43:38Z Gerhard Yu Fuyuki Ishikawa Oluwafemi Odu Alvine Boaye Belle http://arxiv.org/abs/2603.14225v2 "I'm Not Reading All of That": Understanding Software Engineers' Level of Cognitive Engagement with Agentic Coding Assistants 2026-03-18T13:41:18Z

Over-reliance on AI systems can undermine users' critical thinking and promote complacency, a risk intensified by the emergence of agentic AI systems that operate with minimal human involvement. In software engineering, agentic coding assistants (ACAs) are rapidly becoming embedded in everyday development workflows. Since software engineers (SEs) create systems deployed across diverse and high-stakes real-world contexts, these assistants must function not merely as autonomous task performers but as Tools for Thought that actively support human reasoning and sensemaking. We conducted a formative study examining software engineers' cognitive engagement and sensemaking processes when working with an ACA. Our findings reveal that cognitive engagement consistently declines as tasks progress, and that current ACA designs provide limited affordances for reflection, verification, and meaning-making. Based on these findings, we identify concrete design opportunities leveraging richer interaction modalities and cognitive-forcing mechanisms to sustain engagement and promote deeper thinking in AI-assisted programming.

2026-03-15T05:03:20Z 7 pages, 5 figures, 2 tables, published and presented in CHI 2026 Workshop on Tools for Thought Carlos Rafael Catalan Lheane Marie Dizon Patricia Nicole Monderin Emily Kuang http://arxiv.org/abs/2603.17659v1 From Symbol to Meaning: Ontological and Philosophical Reflections on Large Language Models in Information Systems Engineering 2026-03-18T12:26:57Z

The advent of Large Language Models (LLMs) represents a turning point in the theoretical foundations of Information Systems Engineering. Beyond their technical significance, LLMs challenge the ontological, epistemological, and semiotic assumptions that have long structured our understanding of in-formation, representation, and knowledge. This article proposes an integrative reflection on how LLMs reconfigure the relationships among language, meaning, and system design, suggesting that their emergence demands a re-examination of the conceptual foundations of contemporary information systems. Sketching on philosophical traditions from Peirce to Heidegger and Floridi, we investigate how the logic of generative models both extends and destabilises classical notions of ontology and signification. The discussion emphasises the necessity of grounding LLM-based systems in transparent, ethically coherent frameworks that respect the integrity of human-centred knowledge processes. Ultimately, the paper argues that LLMs should be understood not merely as tools for automation but as epistemic agents that reshape the philosophical and semiotic foundations of information systems engineering.

2026-03-18T12:26:57Z This paper constitutes a substantially extended version of a conference article to be published in the proceedings of the International Conference on Enterprise Information Systems ICEIS 2026 José Palazzo Moreira de Oliveira http://arxiv.org/abs/2603.17648v1 Requirements Volatility in Software Architecture Design: An Exploratory Case Study 2026-03-18T12:08:55Z

Requirements volatility is a major issue in software (SW) development, causing problems such as project delays and cost overruns. Even though there is a considerable amount of research related to requirement volatility, the majority of it is inclined toward project management aspects. The relationship between SW architecture design and requirements volatility has not been researched widely, even though changing requirements may for example lead to higher defect density during testing. An exploratory case study was conducted to study how requirements volatility affects SW architecture design. Fifteen semi-structured, thematic interviews were conducted in the case company, which provides the selection of software products for business customers and consumers. The research revealed the factors, such as requirements uncertainty and dynamic business environment, causing requirements volatility in the case company. The study identified the challenges that requirements volatility posed to SW architecture design, including scheduling and architectural technical debt. In addition, this study discusses means of mitigating the factors that cause requirements volatility and addressing the challenges posed by requirements volatility. SW architects are strongly influenced by requirement volatility. Thus understanding the factors causing requirements volatility as well as means to mitigate the challenges has high industrial relevance.

2026-03-18T12:08:55Z International Conference on Software and System Process 2017 Sanja Aaramaa Sandun Dasanayake Markku Oivo Jouni Markkula Samuli Saukkonen 10.1145/3084100.3084105 http://arxiv.org/abs/2603.03823v3 SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration 2026-03-18T12:07:41Z

Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

2026-03-04T08:20:25Z Jialong Chen Xander Xu Hu Wei Chuan Chen Bing Zhao http://arxiv.org/abs/2511.17762v2 The Software Engineering Simulations Lab: Agentic AI for RE Quality Simulations 2026-03-18T10:25:47Z

Context and motivation. Requirements Engineering (RE) quality still lacks empirical evidence on how specific requirement defects affect downstream activities. Problem: However, empirical data on the detailed effects of requirements quality defects is scarce, since it is costly to obtain. Furthermore, with the advent of AI-based development, the requirements quality factors may change: Requirements are no longer only consumed by humans, but increasingly also by AI agents, which might lead to a different efficient and effective requirements style. Principal ideas: We propose to extend the RE research toolbox with Agentic AI simulations, in which software engineering (SE) processes are replicated by standardized agents in qualitative simulations. We argue that their speed and simplicity makes them a valuable addition to RE research, although limitations in replicating human behavior need to be studied and understood. Contribution: This paper contributes a first concept, a research roadmap, a prototype, and a first feasibility study for RE simulations with agentic AI. Study results indicate that even a naïve implementation leads to executable simulations, encouraging technical improvements along with broader application in RE research.

2025-11-21T20:19:08Z Henning Femmer Ivan Esau http://arxiv.org/abs/2510.01002v2 Semantics-Aligned, Curriculum-Driven, and Reasoning-Enhanced Vulnerability Repair Framework 2026-03-18T08:58:13Z

Current learning-based Automated Vulnerability Repair (AVR) approaches, while promising, often fail to generalize effectively in real-world scenarios. Our diagnostic analysis reveals three fundamental weaknesses in state-of-the-art AVR approaches: (1) limited cross-repository generalization, with performance drops on unseen codebases; (2) inability to capture long-range dependencies, causing a performance degradation on complex, multi-hunk repairs; and (3) over-reliance on superficial lexical patterns, leading to significant performance drops on vulnerabilities with minor syntactic variations like variable renaming. To address these limitations, we propose SeCuRepair, a semantics-aligned, curriculum-driven, and reasoning-enhanced framework for vulnerability repair. At its core, SeCuRepair adopts a reason-then-edit paradigm, requiring the model to articulate why and how a vulnerability should be fixed before generating the patch. This explicit reasoning enforces a genuine understanding of repair logic rather than superficial memorization of lexical patterns. SeCuRepair also moves beyond traditional supervised fine-tuning and employs semantics-aware reinforcement learning, rewarding patches for their syntactic and semantic alignment with the oracle patch rather than mere token overlap. Complementing this, a difficulty-aware curriculum progressively trains the model, starting with simple fixes and advancing to complex, multi-hunk coordinated edits. We evaluate SeCuRepair on strict, repository-level splits of BigVul and newly crafted PrimeVul_AVR datasets. SeCuRepair significantly outperforms all baselines, surpassing the best-performing baselines by 34.52% on BigVul and 31.52% on PrimeVul\textsubscript{AVR} in terms of CodeBLEU, respectively. Comprehensive ablation studies further confirm that each component of our framework contributes to its final performance.

2025-10-01T15:09:27Z Chengran Yang Ting Zhang Jinfeng Jiang Xin Zhou Haoye Tian Mingzhe Du Jieke Shi Junkai Chen Yikun Li Eng Lieh Ouh Lwin Khin Shar David Lo