https://arxiv.org/api/YQnzckcY8UffrH8BNMsHPEgdwfo2026-03-22T17:17:27Z2530210515http://arxiv.org/abs/2603.15375v1Formalizing and validating properties in Asmeta with Large Language Models (Extended Abstract)2026-03-16T14:50:39ZWriting temporal logic properties is often a challenging task for users of model-based development frameworks, particularly when translating informal requirements into formal specifications. In this paper, we explore the idea of integrating Large Language Models (LLMs) into the Asmeta framework to assist users during the definition, formalization, explanation, and validation of temporal properties. We present a workflow in which an LLM-based agent supports these activities by leveraging the Asmeta specification and the feedback produced by the model checker. This work serves as a proof of concept that illustrates the feasibility and potential benefits of such an integration through representative examples.2026-03-16T14:50:39ZAndrea BombardaSilvia BonfantiAngelo GargantiniNico Pellegrinellihttp://arxiv.org/abs/2603.15372v1SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations2026-03-16T14:48:53ZAs telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).2026-03-16T14:48:53ZIvo Bretthttp://arxiv.org/abs/2603.15366v1To be FAIR or RIGHT? Methodological [R]esearch [I]ntegrity [G]iven [H]uman-facing [T]echnologies using the example of Learning Technologies2026-03-16T14:41:05ZQuality assessment of Research Software Engineering (RSE) plays an important role in all scientific fields. From the canonical three criteria (reliability, validity, and
objectivity) previous research has focussed on reliability and the FAIR principles. The RIGHT framework is introduced to fill the gap of existing
frameworks for the validity aspect. The framework is constructed using the methods of theory transfer and process modelling. It is based on existing models of
simulation research, design-based research, software engineering and empirical social sciences. The paper concludes with two case studies drawn from the field of learning technologies to illustrate the practical relevance
of the framework for human-facing RSE.2026-03-16T14:41:05ZJulian Dehnehttp://arxiv.org/abs/2603.15707v1SEMAG: Self-Evolutionary Multi-Agent Code Generation2026-03-16T13:24:55ZLarge Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self-Evolutionary Multi-Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self-evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state-of-the-art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self-evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.2026-03-16T13:24:55ZYulin PengHaowen HouXinxin ZhuYing Tiffany HeF. Richard Yuhttp://arxiv.org/abs/2602.08561v2Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches2026-03-16T13:09:04ZReproducing computational research is often assumed to be as simple as rerunning the original code with provided data. In practice, missing packages, fragile file paths, version conflicts, or incomplete logic frequently cause analyses to fail, even when materials are shared. This study investigates whether large language models and AI agents can automate the diagnosis and repair of such failures, making computational results easier to reproduce and verify. We evaluate this using a controlled reproducibility testbed built from five fully reproducible R-based social science studies. Realistic failures were injected, ranging from simple issues to complex missing logic, and two automated repair workflows were tested in clean Docker environments. The first workflow is prompt-based, repeatedly querying language models with structured prompts of varying context, while the second uses agent-based systems that inspect files, modify code, and rerun analyses autonomously. Across prompt-based runs, reproduction success ranged from 31-79 percent, with performance strongly influenced by prompt context and error complexity. Complex cases benefited most from additional context. Agent-based workflows performed substantially better, with success rates of 69-96 percent across all complexity levels. These results suggest that automated workflows, especially agent-based systems, can significantly reduce manual effort and improve reproduction success across diverse error types. Unlike prior benchmarks, our testbed isolates post-publication repair under controlled failure modes, allowing direct comparison of prompt-based and agent-based approaches.2026-02-09T11:59:59Z12 pages, 5 figures. Submitted to ACM conferenceSyed Mehtab Hussain ShahFrank HopfgartnerArnim Bleierhttp://arxiv.org/abs/2603.13023v2daVinci-Env: Open SWE Environment Synthesis at Scale2026-03-16T11:55:18ZTraining capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.2026-03-13T14:32:40ZDayuan FuShenyu WuYunze WuZerui PengYaxing HuangJie SunJi ZengMohan JiangLin ZhangYukun LiJiarui HuLiming LiuJinlong HouPengfei Liuhttp://arxiv.org/abs/2603.15087v1Beyond Monolithic Models: Symbolic Seams for Composable Neuro-Symbolic Architectures2026-03-16T10:41:01ZCurrent Artificial Intelligence (AI) systems are frequently built around monolithic models that entangle perception, reasoning, and decision-making, a design that often conflicts with established software architecture principles. Large Language Models (LLMs) amplify this tendency, offering scale but limited transparency and adaptability. To address this, we argue for composability as a guiding principle that treats AI as a living architecture rather than a fixed artifact. We introduce symbolic seams: explicit architectural breakpoints where a system commits to inspectable, typed boundary objects, versioned constraint bundles, and decision traces. We describe how seams enable a composable neuro-symbolic design that combines the data-driven adaptability of learned components with the verifiability of explicit symbolic constraints -- combining strengths neither paradigm achieves alone. By treating AI systems as assemblies of interchangeable parts rather than indivisible wholes, we outline a direction for intelligent systems that are extensible, transparent, and amenable to principled evolution.2026-03-16T10:41:01ZSubmitted to New and Emerging Ideas (NEMI) track at ICSA 2026Nicolas SchulerVincenzo ScottiRaffaela Mirandolahttp://arxiv.org/abs/2305.13883v3Leveraging Imperfect Sources to Detect Fairwashing in Black-Box Auditing2026-03-16T10:30:53ZAlgorithmic auditing has become central to platform accountability under frameworks such as the AI Act and the Digital Services Act. In practice, this obligation is discharged through dedicated Audit APIs. This architecture creates a paradox: the entity under scrutiny controls the evaluation interface. A platform facing legal sanctions can serve a compliant surrogate model on its Audit API, while running a discriminatory production system. This deceptive practice is known as fairwashing. Manipulation is undetectable if the auditor relies on only one source. To address this limitation, we introduce the Two-Source Audit Model (2SAM). This model cross-references the Audit API with an independent trusted stream. The key insight is that the trusted stream does not need to be perfectly aligned with the Audit API. We introduce a consistency proxy, a probabilistic mapping that can reconcile discrepancies between sources. This approach yields three results. First, we quantify the rate of manipulation above which a single-source auditor is blind. Second, we show how proxy quality governs detection power. Third, we provide a closed-form budget condition guaranteeing detection at any target confidence level, closing the blind spot mentioned above. We validate 2SAM on the UCI Adult dataset, achieving $70\%$ detection power with as few as $127$ cross-verification queries out of a total budget of $750$, using a name-based gender proxy with $94.2\%$ accuracy.2023-05-23T10:06:22Z23 pages, 10 figuresJade Garcia BourréeErwan Le MerrerGilles TredanBenoît Rottembourghttp://arxiv.org/abs/2603.10969v2TOSSS: a CVE-based Software Security Benchmark for Large Language Models2026-03-16T10:16:20ZWith their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts.
We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.2026-03-11T16:54:01ZMarc DamieMurat Bilgehan ErtanDomenico EssoussiAngela MakhanuGaëtan PeterRoos Wensveenhttp://arxiv.org/abs/2603.10621v2QuantumX: an experience for the consolidation of Quantum Computing and Quantum Software Engineering as an emerging discipline2026-03-16T09:40:45ZThe first edition of the QuantumX track, held within the XXIX Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2025), brought together leading Spanish research groups working at the intersection of Quantum Computing and Software Engineering. The event served as a pioneering forum to explore how principles of software quality, governance, testing, orchestration, and abstraction can be adapted to the quantum paradigm. The presented works spanned diverse areas (from quantum service engineering and hybrid architectures to quality models, circuit optimization, and quantum machine learning), reflecting the interdisciplinary nature and growing maturity of Quantum Computing and Quantum Software Engineering. The track also fostered community building and collaboration through the presentation of national and Ibero-American research networks such as RIPAISC and QSpain, and through dedicated networking sessions that encouraged joint initiatives. Beyond reporting on the event, this article provides a structured synthesis of the contributions presented at QuantumX, identifies common research themes and engineering concerns, and outlines a set of open challenges and future directions for the advancement of Quantum Software Engineering. This first QuantumX track established the foundation for a sustained research community and positioned Spain as an emerging contributor to the European and global quantum software ecosystem.2026-03-11T10:33:14Z16 pagesJuan M. MurilloIgnacio García Rodríguez de GuzmánEnrique MoguelJavier Romero-ÁlvarezJaime Alvarado-ValienteÁlvaro M. Aparicio-MoralesJose Garcia-AlonsoAna Díaz MuñozEduardo Fernández-MedinaFrancisco ChicanoCarlos CanalJosé Daniel ViqueiraSebastián VillarroyaEduardo GutiérrezAdrián Romero-FloresAlfonso E. Márquez-ChamorroAntonio Ruiz-CortesCyrille YetuYetu KesikuPedro SánchezDiego Alonso CáceresLidia Sánchez-GonzálezFernando Plouhttp://arxiv.org/abs/2603.15021v1Describing Agentic AI Systems with C4: Lessons from Industry Projects2026-03-16T09:23:27ZDifferent domains foster different architectural styles -- and thus different documentation practices (e.g., state-based models for behavioral control vs. ER-style models for information structures). Agentic AI systems exhibit another characteristic style: specialized agents collaborate by exchanging artifacts, invoking external tools, and coordinating via recurring interaction patterns and quality gates. As these systems evolve into long-lived industrial solutions, documentation must capture these style-defining concerns rather than relying on ad-hoc code sketches or pipeline drawings. This paper reports industrial experience from joint projects and derives a documentation systematics tailored to this style. Concretely, we provide (i) a style-oriented modeling vocabulary and a small set of views for agents, artifacts, tools, and their coordination patterns, (ii) a hierarchical description technique aligned with C4 to structure these views across abstraction levels, and (iii) industrial examples with lessons learned that demonstrate how the approach yields transparent, maintainable architecture documentation supporting sustained evolution.2026-03-16T09:23:27ZAndreas RauschStefan Wittekhttp://arxiv.org/abs/2603.15004v1TriFusion-LLM: Prior-Guided Multimodal Fusion with LLM Arbitration for Fine-grained Code Clone Detection2026-03-16T09:08:07ZCode clone detection (CCD) supports software maintenance, refactoring, and security analysis. Although pre-trained models capture code semantics, most work reduces CCD to binary classification, overlooking the heterogeneity of clone types and the seven fine-grained categories in BigCloneBench. We present Full Model, a multimodal fusion framework that jointly integrates heuristic similarity priors from classical machine learning, structural signals from abstract syntax trees (ASTs), and deep semantic embeddings from CodeBERT into a single predictor. By fusing structural, statistical, and semantic representations, Full Model improves discrimination among fine-grained clone types while keeping inference cost practical. On the seven-class BigCloneBench benchmark, Full Model raises Macro-F1 from 0.695 to 0.875. Ablation studies show that using the primary model's probability distribution as a prior to guide selective arbitration by a large language model (LLM) substantially outperforms blind reclassification; arbitrating only ~0.2% of high-uncertainty samples yields an additional 0.3 absolute Macro-F1 gain. Overall, Full Model achieves an effective performance-cost trade-off for fine-grained CCD and offers a practical solution for large-scale industrial deployment.2026-03-16T09:08:07ZMengdi LiYuming LiuHe WangZifeng XuYuqing Zhanghttp://arxiv.org/abs/2603.15699v1This Is Taking Too Long -- Investigating Time as a Proxy for Energy Consumption of LLMs2026-03-16T08:26:57ZThe energy consumption of Large Language Models (LLMs) is raising growing concerns due to their adverse effects on environmental stability and resource use. Yet, these energy costs remain largely opaque to users, especially when models are accessed through an API -- a black box in which all information depends on what providers choose to disclose. In this work, we investigate inference time measurements as a proxy to approximate the associated energy costs of API-based LLMs. We ground our approach by comparing our estimations with actual energy measurements from locally hosted equivalents. Our results show that time measurements allow us to infer GPU models for API-based LLMs, grounding our energy cost estimations. Our work aims to create means for understanding the associated energy costs of API-based LLMs, especially for end users.2026-03-16T08:26:57ZThis work was accepted at PerCom 2026Lars KruppDaniel GeißlerFrancisco M. Calatrava-NicolasVishal BanwariPaul LukowiczJakob Karolushttp://arxiv.org/abs/2602.16291v7A Calculus of Inheritance2026-03-16T07:35:47ZJust as the $λ$-calculus uses three primitives (abstraction, application, variable) as the foundation of functional programming, inheritance-calculus uses three primitives (record, definition, inheritance) as the foundation of declarative programming. By unifying modules, classes, objects, methods, fields, and locals under a single record abstraction, the calculus models inheritance simply as set union. Consequently, composition is inherently commutative, idempotent, and associative, structurally eliminating the multiple-inheritance linearization problem. Its semantics is first-order~\cite{vanemden1976-predicate-logic-semantics, reynolds1972-definitional-interpreters, aczel1977-inductive-definitions}, denotational, and computable by tabling~\cite{tamaki1986-tabled-resolution}, even for cyclic inheritance hierarchies. These three properties extend to the $λ$-calculus, since Böhm tree equivalence~\cite{barendregt1984-lambda-calculus} is fully abstract for the first-iteration approximation of a sublanguage of inheritance-calculus. As a corollary, this establishes a convergence hierarchy $\text{eager} \subsetneq \text{lazy} \subsetneq \text{fixpoint}$ among $λ$-calculi sharing the same $λ$-syntax.
Inheritance-calculus is distilled from MIXINv2, a practical implementation in which the same code acts as different function colors~\cite{nystrom2015-function-color}; ordinary arithmetic yields the relational semantics of logic programming~\cite{vanemden1976-predicate-logic-semantics}; $\mathtt{this}$ resolves to multiple targets; and programs are immune to nonextensibility in the sense of the Expression Problem~\cite{wadler1998-expression-problem}. This makes inheritance-calculus strictly more expressive than the $λ$-calculus in both common sense and Felleisen's sense~\cite{felleisen1991-expressive-power}.2026-02-18T09:17:20ZBo Yanghttp://arxiv.org/abs/2510.24358v2Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation2026-03-16T07:22:35ZRecent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating high-quality project-level evaluation datasets requires extensive domain expertise, leading to prohibitive annotation costs and limited diversity. Second, while recent Agent-as-a-Judge paradigms address the rigidity of traditional unit tests by enabling flexible metrics, their reliance on In-Context Learning (ICL) with general LLMs often results in inaccurate assessments that misalign with human standards. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project-level tasks. Based on this, we introduce PRDBench, comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Documents (PRDs) and comprehensive criteria. Furthermore, to overcome the inaccuracy of general LLM judges, we propose a highly reliable evaluation framework powered by a specialized, fine-tuned model. Based on Qwen3-Coder-30B, our dedicated PRDJudge achieves over 90% human alignment in fixed-interface scenarios. Extensive experiments demonstrate that our suite provides a scalable, robust, and highly accurate framework for assessing state-of-the-art code agents.2025-10-28T12:26:45ZLingyue FuBolun ZhangHao GuanYaoming ZhuLin QiuWeiwen LiuXuezhi CaoXunliang CaiWeinan ZhangYong Yu