https://arxiv.org/api/MZp/KpD10HcPte6kxziA4jl0BrQ2026-06-14T05:16:18Z988525515http://arxiv.org/abs/2604.27000v1Adaptive and AI-Augmented Security Testing: A Systematic Survey of Program Analysis, Feedback-Driven Testing, and Hybrid Learning-Based Approaches2026-04-29T03:32:59ZModern software systems are increasingly developed within rapid continuous integration and deployment (CI/CD) pipelines, where ensuring security prior to release presents significant technical and organizational challenges. Traditional static and dynamic analysis tools provide valuable structural and behavioral insights, yet they often operate in non-adaptive workflows and produce large volumes of warnings requiring manual triage. Feedback-driven fuzzing and search-based testing approaches have demonstrated the power of iterative input refinement guided by execution signals, while large language models (LLMs) have shown promise in automated test generation but frequently lack semantic grounding in program structure. This paper presents a systematic survey of adaptive and AI-augmented security testing research across five domains: (1) structural program analysis for vulnerability detection, (2) DevSecOps and continuous security testing, (3) feedback-driven fuzzing and search-based testing, (4) LLM-based automated test generation, and (5) emerging hybrid systems integrating program analysis with adaptive learning. We analyze fifty-five peer-reviewed studies drawn from a systematic search of four major databases yielding 22,088 raw records. Our analysis reveals a persistent disconnect between structural program representations (ASTs, CFGs, and CPGs) and adaptive testing mechanisms. We characterize this as structural-adaptive fragmentation: a systematic separation that neither paradigm individually addresses. No existing system incorporates human triage signals as feedback for refining structural models. We conclude by identifying five open research challenges and outlining a unified agenda for semantically grounded, feedback-driven, polyglot security testing frameworks.2026-04-29T03:32:59Z29 pages, submitted for reviewMichael Wienczkowskihttp://arxiv.org/abs/2512.22417v2Open-World Assertion Checking for Smart Contracts via Game Semantics2026-04-28T23:42:11ZWe present a game semantics framework for open-world safety analysis of Ethereum smart contracts. We model the interaction between a contract and its environment as a two-player game between the contract and the environment, and prove up to gas model approximations soundness: every assertion violation found corresponds to a real execution; and completeness: every open-world execution is captured. To our knowledge, this provides the first formal open-world interaction semantics for Ethereum smart contracts with mathematical guarantees of soundness and completeness.
We implement this framework in YulTracer, an assertion reachability tool for real-world Solidity contracts, built on Yul, the intermediate language of the Solidity compiler. YulTracer uses concrete execution and exhaustively explores game traces within user-specified bounds. We evaluate it on reentrancy benchmarks, where YulTracer achieves 100% recall and precision -- the only tool to do so from those we examined -- and on two large real-world exploits (the DAO and PredyPool), where it detects the known vulnerabilities and produces no false positives on fixed versions. To our knowledge, YulTracer is the first tool to achieve this level of precision on real-world contracts without false positives. We additionally demonstrate generality of the approach via the examination of access control benchmarks.2025-12-27T00:21:42Z49 pages, 8 figures, 6 tables, 3 listingsVasileios KoutavasYu-Yang LinNikos Tzevelekoshttp://arxiv.org/abs/2604.26161v1Finite Functional Programming2026-04-28T22:49:28ZWe unify functional and logic programming by treating predicatesas functions equipped with their support: the set of inputs whose output is nonzero. Datalog, for instance, is a language of finitely supported boolean functions. Finite support allows representing functions as input-output tables. Generalizing from boolean functions to other pointed sets neatly handles aggregation and weighted logic programming. We refer to the combination of finitely supported functions, represented as data, with higher order functions, represented as code, as finite functional programming. We give a simple type system to check finite support, using graded effects to check variable grounding and relevance types to model pointed sets.2026-04-28T22:49:28Z16 pages, 6 figuresMichael ArntzeniusMax Willseyhttp://arxiv.org/abs/2604.23035v2Remote Concolic Multiverse Debugging -- Extended Version with Additional Appendices2026-04-28T17:59:31ZDebugging nondeterministic programs is inherently difficult, particularly in microcontroller environments where execution paths can diverge unpredictably due to external sensor inputs. Traditional debugging techniques often fail to capture or reproduce this nondeterministic behavior effectively. Multiverse debugging has emerged as a compelling technique to debug nondeterministic programs, allowing developers to systematically explore all possible execution paths. Unfortunately, current multiverse debuggers are snapshot-based and most operate over a model of the program, limiting their use for debugging resource-constrained microcontrollers. Additionally, current multiverse debuggers, even ones specifically designed for microcontrollers suffer from state explosion making the state space overwhelming during debugging.
To address these challenges, we introduce a trace-based multiverse debugger with a novel state-space reduction technique based on concolic execution. Our approach interleaves concolic analysis with live debugging to identify input values that define unique program paths. This hybrid technique efficiently prunes redundant paths from the state space while ensuring full code coverage. Unlike MIO, a recently published multiverse debugger for microcontrollers that focuses on IO consistency, our approach directly targets state explosion by leveraging concolic execution and uses a trace-based approach, significantly reducing the memory and communication overhead.
We implemented a prototype using the WARDuino WebAssembly VM, demonstrating the feasibility and efficiency of our approach in real-world scenarios. Our results highlight substantial reductions in the state space compared to traditional multiverse debugging. This makes multiverse debugging more accessible and efficient for developers working with complex, nondeterministic programs running on microcontrollers.2026-04-24T21:48:10ZMaarten SteevensTom LauwaertsChristophe Scholliershttp://arxiv.org/abs/2210.13387v5Towards a Higher-Order Mathematical Operational Semantics2026-04-28T15:38:43ZCompositionality proofs in higher-order languages are notoriously involved, and general semantic frameworks guaranteeing compositionality are hard to come by. In particular, Turi and Plotkin's bialgebraic abstract GSOS framework, which has been successfully applied to obtain off-the-shelf compositionality results for first-order languages, so far does not apply to higher-order languages. In the present work, we develop a theory of abstract GSOS specifications for higher-order languages, in effect transferring the core principles of Turi and Plotkin's framework to a higher-order setting. In our theory, the operational semantics of higher-order languages is represented by certain dinatural transformations that we term pointed higher-order GSOS laws. We give a general compositionality result that applies to all systems specified in this way and discuss how compositionality of the SKI calculus and the $λ$-calculus w.r.t. a strong variant of Abramsky's applicative bisimilarity are obtained as instances.2022-10-24T16:38:55ZThis version restores arXiv:2210.13387v2. arXiv:2210.13387v3 was a mistaken replacement, which was withdrawn. The replacement belonged to another submission, namely arXiv:2405.16708Sergey GoncharovStefan MiliusLutz SchröderStelios TsampasHenning Urbathttp://arxiv.org/abs/2604.25478v1Practical Insights into Fair Comparison and Evaluation Frame for Neutral-Atom Compilers2026-04-28T10:30:25ZNeutral-atom quantum computing is among the most promising platforms for scalable quantum computation, and compilation toolchains are crucial for leveraging capabilities such as qubit shuttling and parallel gate execution. An important challenge, however, is that existing neutral-atom compilers are often evaluated using metrics computed over different parts of the toolchain and under non-equivalent assumptions. Consequently, fair quantification and comparison of compiler performance remain difficult. Reported metrics may depend on inconsistent transpilation optimization levels, different movement-duration models, different sets of considered fidelity sources, and even minor implementation bugs or undocumented representation choices. To address this problem, we present a unified and reproducible evaluation framework for neutral-atom compilers. Our framework introduces RSQASM (Routed and Scheduled QASM), a QASM-inspired post-compilation representation that captures mapped, routed, and scheduled circuits, including explicit parallel gate execution and shuttling operations. As part of the framework, we provide adapter scripts that translate existing compiler outputs and intermediate artifacts into RSQASM. As a case study, we compare three well-known neutral-atom compilation toolchains: HybridMapper, DasAtom, and Enola, motivated by the large performance differences reported in prior work. Using our framework and representation, we perform a new evaluation and show that several previously claimed performance gaps become substantially smaller and, in some cases, are not reproduced once evaluation inconsistencies are removed.2026-04-28T10:30:25Z11 pages, 2 figures, submitted to IEEE Quantum Week 2026Emil KhusainovYanbin ChenJonas WinklmannHelmut SeidlChristian B. Mendlhttp://arxiv.org/abs/2605.00034v1Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets2026-04-28T01:27:09ZThis paper presents a system combining symbolic execution (KLEE) with a 4-agent multi-LLM architecture for detecting memory vulnerabilities in Rust unsafe code. A central challenge we address is the incomplete-code problem: CVE database entries provide only isolated code snippets that lack struct definitions, imports, and Cargo manifests, causing all existing formal verification tools to fail at compilation with zero output. Our system resolves this through four specialized agents -- an Oracle/Validator for strategic planning, a Safety Checker for vulnerability analysis, a Code Specialist for FFI wrapper generation, and a Fast Filter for execution optimization -- that collaboratively synthesize KLEE-compatible harnesses from otherwise uncompilable fragments. KLEE's output is then ingested by graph_klee.py, which constructs a Graph Database linking CVE files, CWE categories, error types, and symbolic execution paths as typed nodes and labelled edges, enabling structured cross-CVE vulnerability queries. We evaluated our system on 31 real-world Rust CVEs spanning 11 CWE categories, achieving 90.3% wrapper compilation success where all state-of-the-art formal verification tools achieve 0%. Our system detected 1,206 critical errors across 26 files (83.9% detection rate), compared to 14 warnings across 11 files for Clippy (35.5%) and generic labels for Miri. The 4-agent architecture reduced wrapper compilation failures from 42% (single-agent baseline) to 9.7% and increased detected errors from 487 to 1,206, confirming that role specialization and structured context passing produce measurably better results than a single general-purpose model. Our replication package is publicly available at https://github.com/Zeyad-Ab/Symbolic-Execution-with-Multi-LLM-Architecture-for-Rust-Security2026-04-28T01:27:09Z11 pages, 1 figure, to be published in : Ease 2026 The 6th International Workshop on Software Security EngineeringZeyad AbdelrazekYoung Leehttp://arxiv.org/abs/2604.25960v1Large Language Models for Multilingual Code Intelligence: A Survey2026-04-27T20:20:26ZLarge language models have transformed AI-assisted software engineering, but current research remains biased toward high-resource languages such as Python, with weaker performance in languages like Rust and OCaml. Since real-world systems are inherently polyglot, robust multilingual code intelligence is crucial. This survey focuses on two key tasks: multilingual code generation from shared natural-language requirements, and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics, and highlights challenges and opportunities for trustworthy cross-language generalization.2026-04-27T20:20:26ZChao JiangDugang LiuCheng WenZhiwu XuHua ZhengMuhammad SadiqJawwad Ahmed ShamsiShengchao QinZhong Minghttp://arxiv.org/abs/2603.16437v6Dimensional Type Systems and Deterministic Memory Management: Design-Time Semantic Preservation in Native Compilation2026-04-27T17:48:52ZWe present a compilation framework in which dimensional type annotations persist through multi-stage MLIR lowering, enabling the compiler to jointly resolve numeric representation selection and deterministic memory management as coeffect properties of a single program semantic graph (PSG). Dimensional inference determines value ranges, which in turn determine representation selection, word width, memory footprint, allocation strategy, and cross-target transfer fidelity. The Dimensional Type System (DTS) extends Hindley-Milner unification with constraints drawn from finitely generated abelian groups, yielding inference that is decidable in polynomial time, complete (no annotations required), and principal. Where conventional systems erase dimensional annotations before code generation, DTS carries them as compilation metadata through each lowering stage, making them available where representation and memory placement decisions occur. Deterministic Memory Management (DMM), formalized as a coeffect discipline within the same graph, classifies value lifetimes into four escape categories, each mapping to a verified allocation strategy. The dimensional algebra is closed under the chain rule, and forward-mode gradient computation exhibits a specific coeffect signature that the framework can verify at compile time. The practical consequence is a development environment where escape diagnostics, allocation strategy, representation fidelity, and cache locality estimation are design-time views over the compilation graph.2026-03-17T12:19:07Z33 pages, 8 tables, 3 appendices with extended examplesHouston Hayneshttp://arxiv.org/abs/2604.13351v3Optimal Predicate Pushdown Synthesis2026-04-27T16:20:12ZPredicate pushdown is a long-standing performance optimization that filters data as early as possible in a computational workflow. In modern data pipelines, this transformation is especially important because much of the computation occurs inside user-defined functions (UDFs) written in general-purpose languages such as Python and Scala. These UDFs capture rich domain logic and complex aggregations and are among the most expensive operations in a pipeline. Moving filters ahead of such UDFs can yield substantial performance gains, but doing so requires semantic reasoning. This paper introduces a general semantic foundation for predicate pushdown over stateful fold-based computations.
We view pushdown as a correspondence between two programs that process different subsets of input data, with correctness witnessed by a bisimulation invariant relating their internal states. Building on this foundation, we develop a sound and relatively complete framework for verification, alongside a synthesis algorithm that automatically constructs optimal pushdown decompositions by finding the strongest admissible pre-filters and weakest residual post-filters. We implement this approach in a tool called Pusharoo and evaluate it on 150 real-world pandas and Spark data-processing pipelines. Our evaluation shows that Pusharoo is significantly more expressive than prior work, producing optimal pushdown transformations with a median synthesis time of 1.6 seconds per benchmark. Furthermore, our experiments demonstrate that the discovered pushdown optimizations speed up end-to-end execution by an average of 2.4$\times$ and up to two orders of magnitude.2026-04-14T23:22:34Z45 pages, 9 figures, 7 tables. Extended version (with appendices) of the PLDI 2026 paperProc. ACM Program. Lang. 10, PLDI, Article 234 (June 2026), 25 pagesRobert ZhangEric Hayden CampbellDixin TangIsil Dillig10.1145/3808312http://arxiv.org/abs/2604.24578v1Hybrid Path-Sums for Hybrid Quantum Programs2026-04-27T15:05:17ZAs quantum computing becomes an emerging reality, designing efficient quantum programming capabilities is becoming more and more important. Particularly, the debugging and validation of quantum programs is of paramount importance, as these programs are by definition hard to test. Static analysis and formal verification methods for quantum programs started to emerge a few years now, yet they often miss hybrid quantum/classical reasoning facilities with, e.g., generic quantum control, classical control and classical computation instructions. In this paper, we lay out the foundations of a framework for the automated formal verification of (full) hybrid quantum programs featuring both classical and quantum control, measurement and hybrid data structures. In particular, we propose: (1) a novel symbolic representation for describing and manipulating sets of hybrid quantum/classical states called Hybrid Path-Sums (HPS); (2) a set of rewriting rules providing a rich mechanism for simplifying and reasoning on these symbolic hybrid states, and (3) a core assertion language to specify equivalence of hybrid quantum programs, the satisfaction of properties on (parts of) hybrid states, and the extraction of probabilistic statements about the program behavior. We prove the correctness of the novel symbolic representation, of its rewriting system and of the specification system. Finally, we propose a full implementation of this framework as a dedicated symbolic execution engine for hybrid programs. We present an evaluation of a set of representative hybrid case-studies from the literature, showcasing the advantage of our approach and its efficiency compared to state-of-the-art solutions.2026-04-27T15:05:17ZChristophe CharetonJad IssaMathieu NguyenNicolas BlancoSébastien Bardin10.1145/3808314http://arxiv.org/abs/2604.19906v2Demonstrating a Future for MLIR-native DSL Compilers on a NumPy-like Example2026-04-27T12:23:59ZCompilers for general-purpose languages have been shown to be at a disadvantage when it comes to specialized application domains as opposed to their Domain-Specific Language (DSL) counterparts. However, the field of DSL compilers features little consolidation in terms of compiler frameworks and adjacent software ecosystems. As a result, considerable work is duplicated, lost to maintenance issues, or remains undiscovered, and most DSLs are never considered "production-ready". One notable development is the introduction of the Multi-Level Intermediate Representation (MLIR), which promises a similar impact on DSL compilers as LLVM had on general-purpose tooling.
In this work, we present a NumPy-like DSL made for offloading numeric tensor kernels that is entirely MLIR-native. In a first for open-source, it implements all frontend actions and semantic analyses directly within MLIR. Most notably, this is made possible by our new dialect-agnostic MLIR type checker, created for the future of DSLs in MLIR. We implement a simple, yet effective, parallel-first lowering scheme that connects our language to another MLIR dataflow dialect for seamless offloading. We show that our approach performs well in real-world use cases from the domain of weather modeling and Computational Fluid Dynamics (CFD) in Fortran.2026-04-21T18:30:11Z17 pages, 5 figuresKarl F. A. FriebelJascha A. OhlmannJeronimo Castrillonhttp://arxiv.org/abs/2511.09447v2SpaDA: A Spatial Dataflow Architecture Programming Language2026-04-27T09:15:08ZSpatial dataflow architectures like the Cerebras Wafer-Scale Engine deliver exceptional performance in AI and scientific computing by distributing scratchpad memory across hundreds of thousands of processing elements (PEs). Yet programming these architectures remains difficult: with no shared memory, data movement requires explicit configuration, and asynchronous task management introduces substantial complexity. We present SpaDA, a programming language that offers precise control over data placement, dataflow patterns, and asynchronous operations while abstracting low-level architectural details. We design and implement a compiler targeting Cerebras CSL through multi-level lowering and unique optimization passes. SpaDA functions as a high-level programming interface and an intermediate representation for domain-specific languages (DSLs), demonstrated here with the GT4Py stencil DSL. SpaDA enables concise expression of operations with complex parallel patterns -- including pipelined collective operations, multi-dimensional stencils, and dense linear algebra -- in 14.09x fewer lines than CSL, achieving over 260 TFlop/s across 730,000 PEs on a single device.2025-11-12T16:03:52ZLukas GianinazziTal Ben-NunTorsten Hoeflerhttp://arxiv.org/abs/2512.00127v3Generating Verifiable Chain of Thoughts from Exection-Traces2026-04-27T09:10:32ZGetting language models to reason correctly about code requires training on data where each reasoning step can be checked. Current synthetic Chain-of-Thought (CoT) training data often consists of plausible-sounding explanations generated by teacher models, and not verifiable accounts of actual program behavior. Models trained on such data learn logically flawed reasoning patterns despite syntactic correctness. To address this, we build a pipeline that generates execution-trace-verified CoT rationales by instrumenting code to capture traces, narrating them into natural language, and cross-checking each narration against the original trace. We systematically create 54,000 verified, bi-directional rationales that teach models to reason both forward (input$\rightarrow$output) and backward (output$\rightarrow$input). Models fine-tuned on our verified data achieve substantial improvements, with a peak gain of +26.6 on LiveCodeBench-Exec, +22.2 on CruxEval, and +19.5 on HumanEval across our fine-tuned models, demonstrating that verification quality directly determines both reasoning and code generation capabilities. Complete synthesis pipeline is avilable as open-source: https://github.com/IBM/verified-code-cot/2025-11-28T07:43:43ZShailja ThakurVaibhav SaxenaRohan KulkarniShivdeep SinghParameswaran SelvamHima PatelHiroshi Kanayamahttp://arxiv.org/abs/2511.10135v4Contextual Refinement of Higher-Order Concurrent Probabilistic Programs (Extended Version)2026-04-27T08:45:27ZWe present Foxtrot, the first higher-order separation logic for proving contextual refinement of higher-order concurrent probabilistic programs with higher-order local state. From a high level, Foxtrot inherits various concurrency reasoning principles from standard concurrent separation logic, e.g. invariants and ghost resources, and supports advanced probabilistic reasoning principles for reasoning about complex probability distributions induced by concurrent threads, e.g. tape presampling and induction by error amplification. The integration of these strong reasoning principles is highly non-trivial due to the combination of probability and concurrency in the language and the complexity of the Foxtrot model; the soundness of the logic relies on a version of the axiom of choice within the Iris logic, which is not used in earlier work on Iris-based logics. We demonstrate the expressiveness of Foxtrot on a wide range of examples, including the adversarial von Neumann coin and the $\mathsf{randombytes\_uniform}$ function of the Sodium cryptography software library.
All results have been mechanized in the Rocq proof assistant and the Iris separation logic framework.2025-11-13T09:48:37ZUpdated version (fix bibliography)Kwing Hei LiAlejandro AguirreJoseph TassarottiLars Birkedal