https://arxiv.org/api/zRriE1VR5rZHxSVUmNq2JKh6iKo 2026-06-23T11:35:09Z 9934 975 15 http://arxiv.org/abs/2511.09964v1 EnvTrace: Simulation-Based Semantic Evaluation of LLM Code via Execution Trace Alignment -- Demonstrated at Synchrotron Beamlines 2025-11-13T04:52:01Z

Evaluating large language models (LLMs) for instrument control requires methods that go beyond standard, stateless algorithmic benchmarks, since the behavior of physical systems cannot be fully captured by unit tests alone. Here we introduce EnvTrace, a simulation-based method that evaluates execution traces to assess semantic code equivalence. EnvTrace is demonstrated with a beamline control-logic digital twin to facilitate the evaluation of instrument control code, with the digital twin itself also enabling the pre-execution validation of live experiments. Over 30 LLMs were evaluated using trace alignment to generate a multi-faceted score for functional correctness across key behavioral dimensions, showing that many top-tier models can approach human-level performance in rapid control-code generation. This is a first step toward a broader vision where LLMs and digital twins work symbiotically: LLMs providing intuitive control and agentic orchestration, and digital twins offering safe and high-fidelity environments, paving the way towards autonomous embodied AI.

2025-11-13T04:52:01Z Noah van der Vleuten Anthony Flores Shray Mathur Max Rakitin Thomas Hopkins Kevin G. Yager Esther H. R. Tsai http://arxiv.org/abs/2511.05987v2 High-Performance Generation of Constrained Inputs 2025-11-12T12:24:18Z

Language-based testing combines context-free grammar definitions with semantic constraints over grammar elements to generate test inputs. By pairing context-free grammars with constraints, users have the expressiveness of unrestricted grammars while retaining simple structure. However, producing inputs in the presence of such constraints can be challenging. In past approaches, SMT solvers have been found to be very slow at finding string solutions; evolutionary algorithms are faster and more general, but current implementations still struggle with complex constraints that would be required for domains such as compiler testing. In this paper, we present a novel approach for evolutionary language-based testing that improves performance by 3-4 orders of magnitude over the current state of the art, reducing hours of generation and constraint solving time to seconds. We accomplish this by (1) carefully transforming grammar definitions into Rust types and trait implementations, ensuring that the compiler may near-maximally optimize arbitrary operations on arbitrary grammars; and (2) using better evolutionary algorithms that improve the ability of language-based testing to solve complex constraint systems. These performance and algorithmic improvements allow our prototype, FANDANGO-RS, to solve constraints that previous strategies simply cannot handle. We demonstrate this by a case study for a C subset, in which FANDANGO-RS is able to generate 401 diverse, complex, and valid test inputs for a C compiler per minute.

2025-11-08T12:26:55Z Addison Crump Alexi Turcotte José Antonio Zamudio Amaya Andreas Zeller http://arxiv.org/abs/2511.09203v1 Galois Slicing as Automatic Differentiation 2025-11-12T11:07:21Z

Galois slicing is a technique for program slicing for provenance, developed by Perera and collaborators. Galois slicing aims to explain program executions by demonstrating how to track approximations of the input and output forwards and backwards along a particular execution. In this paper, we explore an analogy between Galois slicing and differentiable programming, seeing the implementation of forwards and backwards slicing as a kind of automatic differentiation. Using the CHAD approach to automatic differentiation due to Vákár and collaborators, we reformulate Galois slicing via a categorical semantics. In doing so, we are able to explore extensions of the Galois slicing idea to quantitative interval analysis, and to clarify the implicit choices made in existing instantiations of this approach.

2025-11-12T11:07:21Z Robert Atkey Roly Perera http://arxiv.org/abs/2511.08767v1 Hey Pentti, We Did (More of) It!: A Vector-Symbolic Lisp With Residue Arithmetic 2025-11-11T20:42:56Z

Using Frequency-domain Holographic Reduced Representations (FHRRs), we extend a Vector-Symbolic Architecture (VSA) encoding of Lisp 1.5 with primitives for arithmetic operations using Residue Hyperdimensional Computing (RHC). Encoding a Turing-complete syntax over a high-dimensional vector space increases the expressivity of neural network states, enabling network states to contain arbitrarily structured representations that are inherently interpretable. We discuss the potential applications of the VSA encoding in machine learning tasks, as well as the importance of encoding structured representations and designing neural networks whose behavior is sensitive to the structure of their representations in virtue of attaining more general intelligent agents than exist at present.

2025-11-11T20:42:56Z 11 pages, 1 figure, conference paper at IJCNN 2025 Rome Connor Hanley Eilene Tomkins-Flanaganm Mary Alexandria Kelly http://arxiv.org/abs/2511.08713v1 An MLIR pipeline for offloading Fortran to FPGAs via OpenMP 2025-11-11T19:22:04Z

With the slowing of Moore's Law, heterogeneous computing platforms such as Field Programmable Gate Arrays (FPGAs) have gained increasing interest for accelerating HPC workloads. In this work we present, to the best of our knowledge, the first implementation of selective code offloading to FPGAs via the OpenMP target directive within MLIR. Our approach combines the MLIR OpenMP dialect with a High-Level Synthesis (HLS) dialect to provide a portable compilation flow targeting FPGAs. Unlike prior OpenMP FPGA efforts that rely on custom compilers, by contrast we integrate with MLIR and so support any MLIR-compatible front end, demonstrated here with Flang. Building upon a range of existing MLIR building blocks significantly reduces the effort required and demonstrates the composability benefits of the MLIR ecosystem. Our approach supports manual optimisation of offloaded kernels through standard OpenMP directives, and this work establishes a flexible and extensible path for directive-based FPGA acceleration integrated within the MLIR ecosystem.

2025-11-11T19:22:04Z Author accepted version of paper published in SC25 LLVM workshop Gabriel Rodriguez-Canal David Katz Nick Brown 10.1145/3731599.3767485 http://arxiv.org/abs/2511.08530v1 Can Large Language Models Simulate Symbolic Execution Output Like KLEE? 2025-11-11T18:12:09Z

Symbolic execution helps check programs by exploring different paths based on symbolic inputs. Tools like KLEE are commonly used because they can automatically detect bugs and create test cases. But one of KLEE's biggest issues is how slow it can get when programs have lots of branching paths-it often becomes too resource-heavy to run on large or complex code. In this project, we wanted to see if a large language model like GPT-4o could simulate the kinds of outputs that KLEE generates. The idea was to explore whether LLMs could one day replace parts of symbolic execution to save time and resources. One specific goal was to have GPT-4o identify the most constrained path in a program, this is the execution path with the most symbolic conditions. These paths are especially important because they often represent edge cases that are harder to test and more likely to contain deep bugs. However, figuring this out usually requires fully running KLEE, which can be expensive. So, we tested whether GPT-4o could predict the KLEE outputs and the most complex path using a dataset of 100 C programs. Our results showed about 20% accuracy in generating KLEE-like outputs and identifying the most constrained path. While not highly accurate, this early work helps show what current LLMs can and can't do when it comes to simulating symbolic execution.

2025-11-11T18:12:09Z Rong Feng Vanisha Gupta Vivek Patel Viroopaksh Reddy Ernampati Suman Saha http://arxiv.org/abs/2511.08403v1 Blockly2Hooks: Smart Contracts for Everyone with the XRP Ledger and Google Blockly 2025-11-11T16:18:39Z

Recent technologies such as inter-ledger payments, non-fungible tokens, and smart contracts are all fruited from the ongoing development of Distributed Ledger Technologies. The foreseen trend is that they will play an increasingly visible role in daily life, which will have to be backed by appropriate operational resources. For example, due to increasing demand, smart contracts could soon face a shortage of knowledgeable users and tools to handle them in practice. Widespread smart contract adoption is currently limited by security, usability and costs aspects. Because of a steep learning curve, the handling of smart contracts is currently performed by specialised developers mainly, and most of the research effort is focusing on smart contract security, while other aspects like usability being somewhat neglected. Specific tools would lower the entry barrier, enabling interested non-experts to create smart contracts. In this paper we designed, developed and tested Blockly2Hooks, a solution towards filling this gap even in challenging scenarios such as when the smart contracts are written in an advanced language like C. With the XRP Ledger as a concrete working case, Blockly2Hooks helps interested non-experts from the community to learn smart contracts easily and adopt the technology, through leveraging well-proven teaching methodologies like Visual Programming Languages, and more specifically, the Blockly Visual Programming library from Google. The platform was developed and tested and the results are promising to make learning smart contract development smoother.

2025-11-11T16:18:39Z 6 pages 2023 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPS) Lucian Trestioreanu Wazen Shbair Flaviene Scheidt de Cristo Radu State 10.1109/DAPPS57946.2023.00027 http://arxiv.org/abs/2511.03866v2 OMPILOT: Harnessing Transformer Models for Auto Parallelization to Shared Memory Computing Paradigms 2025-11-11T15:25:21Z

Recent advances in large language models (LLMs) have significantly accelerated progress in code translation, enabling more accurate and efficient transformation across programming languages. While originally developed for natural language processing, LLMs have shown strong capabilities in modeling programming language syntax and semantics, outperforming traditional rule-based systems in both accuracy and flexibility. These models have streamlined cross-language conversion, reduced development overhead, and accelerated legacy code migration. In this paper, we introduce OMPILOT, a novel domain-specific encoder-decoder transformer tailored for translating C++ code into OpenMP, enabling effective shared-memory parallelization. OMPILOT leverages custom pre-training objectives that incorporate the semantics of parallel constructs and combines both unsupervised and supervised learning strategies to improve code translation robustness. Unlike previous work that focused primarily on loop-level transformations, OMPILOT operates at the function level to capture a wider semantic context. To evaluate our approach, we propose OMPBLEU, a novel composite metric specifically crafted to assess the correctness and quality of OpenMP parallel constructs, addressing limitations in conventional translation metrics.

2025-11-05T21:21:15Z Arijit Bhattacharjee Ali TehraniJamsaz Le Chen Niranjan Hasabnis Mihai Capota Nesreen Ahmed Ali Jannesari http://arxiv.org/abs/2511.06601v1 Duality-based Mode Operations and Pyramid Multilayer Mapping for Rhetorical Modes 2025-11-10T01:17:00Z

Rhetorical modes are useful in both academic and non-academic writing, and can be subjects to be studied within linguistic research and computational modeling. Establishing a conceptual bridge among these domains could enable each to benefit from the others. This paper proposes duality-based mode operations (split-unite, forward-backward, expansion-reduction and orthogonal dualities) to expand the set of rhetorical modes, introducing generated modes like combination and generalization, thereby enhancing epistemic diversity across multiple applications. It further presents a pyramid multilayer mapping framework (e.g., three layers from the rhetorical model layer, to cognitive layer, and to epistemic layers) that reduces the resulting cognitive complexity. The degrees of expressive diversity and complexity reduction are quantified through binomial combinatorics and Shannon entropy analysis. A Marginal Rhetorical Bit (MRB) is identified, permitting the definition of a rhetorical-scalable parameter that measures expressive growth speed in bits per stage. A direct entropy measure shows that hierarchical selection over smaller subsets markedly reduces choice uncertainty compared with flat selection across all modes. These considerations appear to transform static and non-measurable rhetorical taxonomies into more dynamic and more measurable systems for discourse design. From this work, it would be possible to identify a pathway for future AI systems to operate not only on language tokens but on layered rhetorical reasoning structures, bridging linguistic, pedagogical, academic, and computational research

2025-11-10T01:17:00Z Zi-Niu Wu http://arxiv.org/abs/2511.06565v1 FPGA or GPU? Analyzing comparative research for application-specific guidance 2025-11-09T22:54:28Z

The growing complexity of computational workloads has amplified the need for efficient and specialized hardware accelerators. Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) have emerged as prominent solutions, each excelling in specific domains. Although there is substantial research comparing FPGAs and GPUs, most of the work focuses primarily on performance metrics, offering limited insight into the specific types of applications that each accelerator benefits the most. This paper aims to bridge this gap by synthesizing insights from various research articles to guide users in selecting the appropriate accelerator for domain-specific applications. By categorizing the reviewed studies and analyzing key performance metrics, this work highlights the strengths, limitations, and ideal use cases for FPGAs and GPUs. The findings offer actionable recommendations, helping researchers and practitioners navigate trade-offs in performance, energy efficiency, and programmability.

2025-11-09T22:54:28Z 7 pages Arnab A Purkayastha Jay Tharwani Shobhit Aggarwal http://arxiv.org/abs/2511.06120v1 A Deep Learning Model for Predicting Transformation Legality 2025-11-08T20:08:16Z

Compilers must check the legality of code transformations to guarantee the correctness of applying a sequence of code transformations to a given code. While such a legality check needs to be precisely computed in general, we can use an approximate legality prediction model in certain cases, such as training a reinforcement learning (RL) agent for schedule prediction. In this paper, we propose an approximate method for legality checks. We propose a novel DL model for predicting the legality of transformations. The model takes the code representation and a list of transformations as input and predicts whether applying those transformations to the code is legal. We implement and evaluate the proposed model, demonstrating its effectiveness. Our evaluation shows an F1 score of 0.91 on a test set of randomly generated programs. To further evaluate the model in a practical scenario, we used the model to replace the legality check used during the training of an RL agent designed for automatic code optimization. We demonstrate that such a replacement enables the agent to train on twice as many steps, resulting in faster training and reducing resource usage by approximately 80\% for CPU and 35\% for RAM. The agent trained using this approach maintains comparable performance, with only a 4\% reduction on benchmarks from the Polybench suite compared to the traditional method.

2025-11-08T20:08:16Z Avani Tiwari Yacine Hakimi Riyadh Baghdadi http://arxiv.org/abs/2511.06117v1 A Data-driven Analysis of Code Optimizations 2025-11-08T19:58:38Z

As the demand for computational power grows, optimizing code through compilers becomes increasingly crucial. In this context, we focus on fully automatic code optimization techniques that automate the process of selecting and applying code transformations for better performance without manual intervention. Understanding how these transformations behave and interact is key to designing more effective optimization strategies. Compiler developers must make numerous design choices when constructing these heuristics. For instance, they may decide whether to allow transformations to be explored in any arbitrary order or to enforce a fixed sequence. While the former may theoretically offer the best performance gains, it significantly increases the search space. This raises an important question: Can a predefined, fixed order of applying transformations speed up the search without severely compromising optimization potential? In this paper, we address this and other related questions that arise in the design of automatic code optimization algorithms. Using a data-driven approach, we generate a large dataset of random programs, apply random optimization sequences, and record their execution times. Through statistical analysis, we provide insights that guide the development of more efficient automatic code optimization algorithms.

2025-11-08T19:58:38Z Yacine Hakimi Riyadh Baghdadi http://arxiv.org/abs/2511.05739v1 Handling Higher-Order Effectful Operations with Judgemental Monadic Laws 2025-11-07T21:55:50Z

This paper studies the design of programming languages with handlers of higher-order effectful operations -- effectful operations that may take in computations as arguments or return computations as output. We present and analyse a core calculus with higher-kinded impredicative polymorphism, handlers of higher-order effectful operations, and optionally general recursion. The distinctive design choice of this calculus is that handlers are carried by lawless raw monads, while the computation judgements still satisfy the monadic laws judgementally. We present the calculus with a logical framework and give denotational models of the calculus using realizability semantics. We prove closed-term canonicity and parametricity for the recursion-free fragment of the language using synthetic Tait computability and a novel form of the $\top\top$-lifting technique.

2025-11-07T21:55:50Z This is the full version of the paper of the same title to appear at POPL 2026, with detailed proofs and additional explanation in the appendices Zhixuan Yang Nicolas Wu http://arxiv.org/abs/2511.07463v1 Dynamic Stability of LLM-Generated Code 2025-11-07T09:58:06Z

Current evaluations of LLMs for code generation emphasize functional correctness, overlooking the fact that functionally correct solutions can differ significantly in algorithmic complexity. For instance, an $(O(n^2))$ versus $(O(n \log n))$ sorting algorithm may yield similar output but incur vastly different performance costs in production. This discrepancy reveals a critical limitation in current evaluation methods: they fail to capture the behavioral and performance diversity among correct solutions. To address this, we introduce a principled framework for evaluating the dynamic stability of generated code. We propose two metrics derived from opcode distributions: Static Canonical Trace Divergence (SCTD), which captures algorithmic structure diversity across generated solutions, and Dynamic Canonical Trace Divergence (DCTD), which quantifies runtime behavioral variance. Their ratio, the Behavioral Expression Factor (BEF), serves as a diagnostic signal: it indicates critical runtime instability when BEF $\ll$ 1 and functional redundancy when BEF $\gg$ 1. Empirical results on BigOBench and CodeContests show that state-of-the-art LLMs exhibit significant algorithmic variance even among functionally correct outputs. Notably, increasing sampling temperature improves pass@1 rates but degrades stability, revealing an unrecognized trade-off: searching for correct solutions in diverse output spaces introduces a "penalty of instability" between correctness and behavioral consistency. Our findings call for stability-aware objectives in code generation and new benchmarks with asymptotic test cases for robust, real-world LLM evaluation.

2025-11-07T09:58:06Z 10 pages, 8 figures Prateek Rajput Abdoul Aziz Bonkoungou Yewei Song Abdoul Kader Kabore Iyiola E. Olatunji Jacques Klein Tegewende Bissyande http://arxiv.org/abs/2509.26616v2 Black-box Context-free Grammar Inference for Readable & Natural Grammars 2025-11-07T03:12:30Z

Black-box context-free grammar inference is crucial for program analysis, reverse engineering, and security, yet existing tools such as Arvada, TreeVada, and Kedavra struggle with scalability, readability, and accuracy on large, complex languages. We present NatGI, a novel LLM-guided grammar inference framework that extends TreeVada's parse tree recovery with three key innovations: bracket-guided bubble exploration, LLM-driven bubble generation and non-terminal labeling, and hierarchical delta debugging (HDD) for systematic tree simplification. Bracket-guided exploration leverages syntactic cues such as parentheses to propose well-structured grammar fragments, while LLM guidance produces meaningful non-terminal names and selects more promising merges. Finally, HDD incrementally reduces unnecessary rules, which makes the grammars both compact and interpretable. In our experiments, we evaluate NatGI on a comprehensive benchmark suite ranging from small languages to larger ones such as lua, c, and mysql. Our results show that NatGI consistently outperforms strong baselines in terms of F1 score. On average, NatGI achieves an F1 score of 0.57, which is 25pp (percentage points) higher than the best-performing baseline, TreeVada. In the case of interpretability, our generated grammars perform significantly better than those produced by existing approaches. Leveraging LLM-based node renaming and bubble exploration, NatGI produces rules with meaningful non-terminal names and compact structures that align more closely with human intuition. As a result, developers and researchers can achieve higher accuracy while still being able to easily inspect, verify, and reason about the structure and semantics of the induced grammars.

2025-09-30T17:54:25Z 20 pages Mohammad Rifat Arefin Shanto Rahman Christoph Csallner