https://arxiv.org/api/YQnzckcY8UffrH8BNMsHPEgdwfo 2026-03-22T17:17:27Z 25302 105 15 http://arxiv.org/abs/2603.15375v1 Formalizing and validating properties in Asmeta with Large Language Models (Extended Abstract) 2026-03-16T14:50:39Z

Writing temporal logic properties is often a challenging task for users of model-based development frameworks, particularly when translating informal requirements into formal specifications. In this paper, we explore the idea of integrating Large Language Models (LLMs) into the Asmeta framework to assist users during the definition, formalization, explanation, and validation of temporal properties. We present a workflow in which an LLM-based agent supports these activities by leveraging the Asmeta specification and the feedback produced by the model checker. This work serves as a proof of concept that illustrates the feasibility and potential benefits of such an integration through representative examples.

2026-03-16T14:50:39Z Andrea Bombarda Silvia Bonfanti Angelo Gargantini Nico Pellegrinelli http://arxiv.org/abs/2603.15372v1 SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations 2026-03-16T14:48:53Z

As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).

2026-03-16T14:48:53Z Ivo Brett http://arxiv.org/abs/2603.15366v1 To be FAIR or RIGHT? Methodological [R]esearch [I]ntegrity [G]iven [H]uman-facing [T]echnologies using the example of Learning Technologies 2026-03-16T14:41:05Z

Quality assessment of Research Software Engineering (RSE) plays an important role in all scientific fields. From the canonical three criteria (reliability, validity, and objectivity) previous research has focussed on reliability and the FAIR principles. The RIGHT framework is introduced to fill the gap of existing frameworks for the validity aspect. The framework is constructed using the methods of theory transfer and process modelling. It is based on existing models of simulation research, design-based research, software engineering and empirical social sciences. The paper concludes with two case studies drawn from the field of learning technologies to illustrate the practical relevance of the framework for human-facing RSE.

2026-03-16T14:41:05Z Julian Dehne http://arxiv.org/abs/2603.15707v1 SEMAG: Self-Evolutionary Multi-Agent Code Generation 2026-03-16T13:24:55Z

Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

2026-03-16T13:24:55Z Yulin Peng Haowen Hou Xinxin Zhu Ying Tiffany He F. Richard Yu http://arxiv.org/abs/2602.08561v2 Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches 2026-03-16T13:09:04Z

Reproducing computational research is often assumed to be as simple as rerunning the original code with provided data. In practice, missing packages, fragile file paths, version conflicts, or incomplete logic frequently cause analyses to fail, even when materials are shared. This study investigates whether large language models and AI agents can automate the diagnosis and repair of such failures, making computational results easier to reproduce and verify. We evaluate this using a controlled reproducibility testbed built from five fully reproducible R-based social science studies. Realistic failures were injected, ranging from simple issues to complex missing logic, and two automated repair workflows were tested in clean Docker environments. The first workflow is prompt-based, repeatedly querying language models with structured prompts of varying context, while the second uses agent-based systems that inspect files, modify code, and rerun analyses autonomously. Across prompt-based runs, reproduction success ranged from 31-79 percent, with performance strongly influenced by prompt context and error complexity. Complex cases benefited most from additional context. Agent-based workflows performed substantially better, with success rates of 69-96 percent across all complexity levels. These results suggest that automated workflows, especially agent-based systems, can significantly reduce manual effort and improve reproduction success across diverse error types. Unlike prior benchmarks, our testbed isolates post-publication repair under controlled failure modes, allowing direct comparison of prompt-based and agent-based approaches.

2026-02-09T11:59:59Z 12 pages, 5 figures. Submitted to ACM conference Syed Mehtab Hussain Shah Frank Hopfgartner Arnim Bleier http://arxiv.org/abs/2603.13023v2 daVinci-Env: Open SWE Environment Synthesis at Scale 2026-03-16T11:55:18Z

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.

2026-03-13T14:32:40Z Dayuan Fu Shenyu Wu Yunze Wu Zerui Peng Yaxing Huang Jie Sun Ji Zeng Mohan Jiang Lin Zhang Yukun Li Jiarui Hu Liming Liu Jinlong Hou Pengfei Liu http://arxiv.org/abs/2603.15087v1 Beyond Monolithic Models: Symbolic Seams for Composable Neuro-Symbolic Architectures 2026-03-16T10:41:01Z

Current Artificial Intelligence (AI) systems are frequently built around monolithic models that entangle perception, reasoning, and decision-making, a design that often conflicts with established software architecture principles. Large Language Models (LLMs) amplify this tendency, offering scale but limited transparency and adaptability. To address this, we argue for composability as a guiding principle that treats AI as a living architecture rather than a fixed artifact. We introduce symbolic seams: explicit architectural breakpoints where a system commits to inspectable, typed boundary objects, versioned constraint bundles, and decision traces. We describe how seams enable a composable neuro-symbolic design that combines the data-driven adaptability of learned components with the verifiability of explicit symbolic constraints -- combining strengths neither paradigm achieves alone. By treating AI systems as assemblies of interchangeable parts rather than indivisible wholes, we outline a direction for intelligent systems that are extensible, transparent, and amenable to principled evolution.

2026-03-16T10:41:01Z Submitted to New and Emerging Ideas (NEMI) track at ICSA 2026 Nicolas Schuler Vincenzo Scotti Raffaela Mirandola http://arxiv.org/abs/2305.13883v3 Leveraging Imperfect Sources to Detect Fairwashing in Black-Box Auditing 2026-03-16T10:30:53Z

Algorithmic auditing has become central to platform accountability under frameworks such as the AI Act and the Digital Services Act. In practice, this obligation is discharged through dedicated Audit APIs. This architecture creates a paradox: the entity under scrutiny controls the evaluation interface. A platform facing legal sanctions can serve a compliant surrogate model on its Audit API, while running a discriminatory production system. This deceptive practice is known as fairwashing. Manipulation is undetectable if the auditor relies on only one source. To address this limitation, we introduce the Two-Source Audit Model (2SAM). This model cross-references the Audit API with an independent trusted stream. The key insight is that the trusted stream does not need to be perfectly aligned with the Audit API. We introduce a consistency proxy, a probabilistic mapping that can reconcile discrepancies between sources. This approach yields three results. First, we quantify the rate of manipulation above which a single-source auditor is blind. Second, we show how proxy quality governs detection power. Third, we provide a closed-form budget condition guaranteeing detection at any target confidence level, closing the blind spot mentioned above. We validate 2SAM on the UCI Adult dataset, achieving $70\%$ detection power with as few as $127$ cross-verification queries out of a total budget of $750$, using a name-based gender proxy with $94.2\%$ accuracy.

2023-05-23T10:06:22Z 23 pages, 10 figures Jade Garcia Bourrée Erwan Le Merrer Gilles Tredan Benoît Rottembourg http://arxiv.org/abs/2603.10969v2 TOSSS: a CVE-based Software Security Benchmark for Large Language Models 2026-03-16T10:16:20Z

With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.

2026-03-11T16:54:01Z Marc Damie Murat Bilgehan Ertan Domenico Essoussi Angela Makhanu Gaëtan Peter Roos Wensveen http://arxiv.org/abs/2603.10621v2 QuantumX: an experience for the consolidation of Quantum Computing and Quantum Software Engineering as an emerging discipline 2026-03-16T09:40:45Z

The first edition of the QuantumX track, held within the XXIX Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2025), brought together leading Spanish research groups working at the intersection of Quantum Computing and Software Engineering. The event served as a pioneering forum to explore how principles of software quality, governance, testing, orchestration, and abstraction can be adapted to the quantum paradigm. The presented works spanned diverse areas (from quantum service engineering and hybrid architectures to quality models, circuit optimization, and quantum machine learning), reflecting the interdisciplinary nature and growing maturity of Quantum Computing and Quantum Software Engineering. The track also fostered community building and collaboration through the presentation of national and Ibero-American research networks such as RIPAISC and QSpain, and through dedicated networking sessions that encouraged joint initiatives. Beyond reporting on the event, this article provides a structured synthesis of the contributions presented at QuantumX, identifies common research themes and engineering concerns, and outlines a set of open challenges and future directions for the advancement of Quantum Software Engineering. This first QuantumX track established the foundation for a sustained research community and positioned Spain as an emerging contributor to the European and global quantum software ecosystem.

2026-03-11T10:33:14Z 16 pages Juan M. Murillo Ignacio García Rodríguez de Guzmán Enrique Moguel Javier Romero-Álvarez Jaime Alvarado-Valiente Álvaro M. Aparicio-Morales Jose Garcia-Alonso Ana Díaz Muñoz Eduardo Fernández-Medina Francisco Chicano Carlos Canal José Daniel Viqueira Sebastián Villarroya Eduardo Gutiérrez Adrián Romero-Flores Alfonso E. Márquez-Chamorro Antonio Ruiz-Cortes Cyrille YetuYetu Kesiku Pedro Sánchez Diego Alonso Cáceres Lidia Sánchez-González Fernando Plou http://arxiv.org/abs/2603.15021v1 Describing Agentic AI Systems with C4: Lessons from Industry Projects 2026-03-16T09:23:27Z

Different domains foster different architectural styles -- and thus different documentation practices (e.g., state-based models for behavioral control vs. ER-style models for information structures). Agentic AI systems exhibit another characteristic style: specialized agents collaborate by exchanging artifacts, invoking external tools, and coordinating via recurring interaction patterns and quality gates. As these systems evolve into long-lived industrial solutions, documentation must capture these style-defining concerns rather than relying on ad-hoc code sketches or pipeline drawings. This paper reports industrial experience from joint projects and derives a documentation systematics tailored to this style. Concretely, we provide (i) a style-oriented modeling vocabulary and a small set of views for agents, artifacts, tools, and their coordination patterns, (ii) a hierarchical description technique aligned with C4 to structure these views across abstraction levels, and (iii) industrial examples with lessons learned that demonstrate how the approach yields transparent, maintainable architecture documentation supporting sustained evolution.

2026-03-16T09:23:27Z Andreas Rausch Stefan Wittek http://arxiv.org/abs/2603.15004v1 TriFusion-LLM: Prior-Guided Multimodal Fusion with LLM Arbitration for Fine-grained Code Clone Detection 2026-03-16T09:08:07Z

Code clone detection (CCD) supports software maintenance, refactoring, and security analysis. Although pre-trained models capture code semantics, most work reduces CCD to binary classification, overlooking the heterogeneity of clone types and the seven fine-grained categories in BigCloneBench. We present Full Model, a multimodal fusion framework that jointly integrates heuristic similarity priors from classical machine learning, structural signals from abstract syntax trees (ASTs), and deep semantic embeddings from CodeBERT into a single predictor. By fusing structural, statistical, and semantic representations, Full Model improves discrimination among fine-grained clone types while keeping inference cost practical. On the seven-class BigCloneBench benchmark, Full Model raises Macro-F1 from 0.695 to 0.875. Ablation studies show that using the primary model's probability distribution as a prior to guide selective arbitration by a large language model (LLM) substantially outperforms blind reclassification; arbitrating only ~0.2% of high-uncertainty samples yields an additional 0.3 absolute Macro-F1 gain. Overall, Full Model achieves an effective performance-cost trade-off for fine-grained CCD and offers a practical solution for large-scale industrial deployment.

2026-03-16T09:08:07Z Mengdi Li Yuming Liu He Wang Zifeng Xu Yuqing Zhang http://arxiv.org/abs/2603.15699v1 This Is Taking Too Long -- Investigating Time as a Proxy for Energy Consumption of LLMs 2026-03-16T08:26:57Z

The energy consumption of Large Language Models (LLMs) is raising growing concerns due to their adverse effects on environmental stability and resource use. Yet, these energy costs remain largely opaque to users, especially when models are accessed through an API -- a black box in which all information depends on what providers choose to disclose. In this work, we investigate inference time measurements as a proxy to approximate the associated energy costs of API-based LLMs. We ground our approach by comparing our estimations with actual energy measurements from locally hosted equivalents. Our results show that time measurements allow us to infer GPU models for API-based LLMs, grounding our energy cost estimations. Our work aims to create means for understanding the associated energy costs of API-based LLMs, especially for end users.

2026-03-16T08:26:57Z This work was accepted at PerCom 2026 Lars Krupp Daniel Geißler Francisco M. Calatrava-Nicolas Vishal Banwari Paul Lukowicz Jakob Karolus http://arxiv.org/abs/2602.16291v7 A Calculus of Inheritance 2026-03-16T07:35:47Z

Just as the $λ$-calculus uses three primitives (abstraction, application, variable) as the foundation of functional programming, inheritance-calculus uses three primitives (record, definition, inheritance) as the foundation of declarative programming. By unifying modules, classes, objects, methods, fields, and locals under a single record abstraction, the calculus models inheritance simply as set union. Consequently, composition is inherently commutative, idempotent, and associative, structurally eliminating the multiple-inheritance linearization problem. Its semantics is first-order~\cite{vanemden1976-predicate-logic-semantics, reynolds1972-definitional-interpreters, aczel1977-inductive-definitions}, denotational, and computable by tabling~\cite{tamaki1986-tabled-resolution}, even for cyclic inheritance hierarchies. These three properties extend to the $λ$-calculus, since Böhm tree equivalence~\cite{barendregt1984-lambda-calculus} is fully abstract for the first-iteration approximation of a sublanguage of inheritance-calculus. As a corollary, this establishes a convergence hierarchy $\text{eager} \subsetneq \text{lazy} \subsetneq \text{fixpoint}$ among $λ$-calculi sharing the same $λ$-syntax. Inheritance-calculus is distilled from MIXINv2, a practical implementation in which the same code acts as different function colors~\cite{nystrom2015-function-color}; ordinary arithmetic yields the relational semantics of logic programming~\cite{vanemden1976-predicate-logic-semantics}; $\mathtt{this}$ resolves to multiple targets; and programs are immune to nonextensibility in the sense of the Expression Problem~\cite{wadler1998-expression-problem}. This makes inheritance-calculus strictly more expressive than the $λ$-calculus in both common sense and Felleisen's sense~\cite{felleisen1991-expressive-power}.

2026-02-18T09:17:20Z Bo Yang http://arxiv.org/abs/2510.24358v2 Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation 2026-03-16T07:22:35Z

Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating high-quality project-level evaluation datasets requires extensive domain expertise, leading to prohibitive annotation costs and limited diversity. Second, while recent Agent-as-a-Judge paradigms address the rigidity of traditional unit tests by enabling flexible metrics, their reliance on In-Context Learning (ICL) with general LLMs often results in inaccurate assessments that misalign with human standards. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project-level tasks. Based on this, we introduce PRDBench, comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Documents (PRDs) and comprehensive criteria. Furthermore, to overcome the inaccuracy of general LLM judges, we propose a highly reliable evaluation framework powered by a specialized, fine-tuned model. Based on Qwen3-Coder-30B, our dedicated PRDJudge achieves over 90% human alignment in fixed-interface scenarios. Extensive experiments demonstrate that our suite provides a scalable, robust, and highly accurate framework for assessing state-of-the-art code agents.

2025-10-28T12:26:45Z Lingyue Fu Bolun Zhang Hao Guan Yaoming Zhu Lin Qiu Weiwen Liu Xuezhi Cao Xunliang Cai Weinan Zhang Yong Yu