Large Language Model Powered Symbolic Execution

2025-09-19T07:10:27Z

Large Language Models (LLMs) have emerged as a promising alternative to traditional static program analysis methods, such as symbolic execution, offering the ability to reason over code directly without relying on theorem provers or SMT solvers. However, LLMs are also inherently approximate by nature, and therefore face significant challenges in relation to the accuracy and scale of analysis in real-world applications. Such issues often necessitate the use of larger LLMs with higher token limits, but this requires enterprise-grade hardware (GPUs) and thus limits accessibility for many users. In this paper, we propose LLM-based symbolic execution -- a novel approach that enhances LLM inference via a path-based decomposition of the program analysis tasks into smaller (more tractable) subtasks. The core idea is to generalize path constraints using a generic code-based representation that the LLM can directly reason over, and without translation into another (less-expressive) formal language. We implement our approach in the form of AutoBug, an LLM-based symbolic execution engine that is lightweight and language-agnostic, making it a practical tool for analyzing code that is challenging for traditional approaches. We show that AutoBug can improve both the accuracy and scale of LLM-based program analysis, especially for smaller LLMs that can run on consumer-grade hardware.

Code Less to Code More: Streamlining Language Server Protocol and Type System Development for Language Families

2025-09-18T16:57:01Z

Developing editing support for $L$ languages in $E$ editors is complex and time-consuming. Some languages do not provide dedicated editors, while others offer a single native editor. The $\textit{language server protocol}$ (LSP) reduces the language-editor combinations $L \times E$ to $L + E$, where a single language server communicates with editors via LSP plugins. However, overlapping implementations of linguistic components remain an issue. Existing language workbenches struggle with modularity, reusability, and leveraging type systems for language server generation. In this work, we propose: (i) Typelang, a family of domain-specific languages for modular, composable, and reusable type system implementation, (ii) a modular language server generation process, producing servers for languages built in a modular workbench, (iii) the variant-oriented programming paradigm and a cross-artifact coordination layer to manage interdependent software variants, and (iv) an LSP plugin generator, reducing $E$ to $1$ by automating plugin creation for multiple editors. To simplify editing support for language families, each language artifact integrates its own Typelang variant, used to generate language servers. This reduces combinations to $T \times 1$, where $T = L$ represents the number of type systems. Further reuse of language artifacts across languages lowers this to $N \times 1$, where $N << T$, representing unique type systems. We implement Typelang in Neverlang, generating language servers for each artifact and LSP plugins for three editors. Empirical evaluation shows a 93.48% reduction in characters needed for type system implementation and 100% automation of LSP plugin generation, significantly lowering effort for editing support in language families, especially when artifacts are reused.

Theorem Provers: One Size Fits All?

2025-09-18T14:45:58Z

Theorem provers are important tools for people working in formal verification. There are a myriad of interactive systems available today, with varying features and approaches motivating their development. These design choices impact their usability, alongside the problem domain in which they are employed. We test-drive two such provers, Coq and Idris2, by proving the correctness of insertion sort, before providing a qualitative evaluation of their performance. We then compare their community and library support. This work helps users to make an informed choice of system, and highlight approaches in other systems that developers might find useful.

Refinement-Types Driven Development: A study

2025-09-18T14:38:45Z

This paper advocates for the broader application of SMT solvers in everyday programming, challenging the conventional wisdom that these tools are solely for formal methods and verification. We claim that SMT solvers, when seamlessly integrated into a compiler's static checks, significantly enhance the capabilities of ordinary type checkers in program composition. Specifically, we argue that refinement types, as embodied by Liquid Haskell, enable the use of SMT solvers in mundane programming tasks. Through a case study on handling binder scopes in compilers, we envision a future where ordinary programming is made simpler and more enjoyable with the aid of refinement types and SMT solvers. As a secondary contribution, we present a prototype implementation of a theory of finite maps for Liquid Haskell's solver, developed to support our case study.

Evaluating the Limitations of Local LLMs in Solving Complex Programming Challenges

2025-09-18T14:13:30Z

This study examines the performance of today's open-source, locally hosted large-language models (LLMs) in handling complex competitive programming tasks with extended problem descriptions and contexts. Building on the original Framework for AI-driven Code Generation Evaluation (FACE), the authors retrofit the pipeline to work entirely offline through the Ollama runtime, collapsing FACE's sprawling per-problem directory tree into a handful of consolidated JSON files, and adding robust checkpointing so multi-day runs can resume after failures. The enhanced framework generates, submits, and records solutions for the full Kattis corpus of 3,589 problems across eight code-oriented models ranging from 6.7-9 billion parameters. The submission results show that the overall pass@1 accuracy is modest for the local models, with the best models performing at approximately half the acceptance rate of the proprietary models, Gemini 1.5 and ChatGPT-4. These findings expose a persistent gap between private, cost-controlled LLM deployments and state-of-the-art proprietary services, yet also highlight the rapid progress of open models and the practical benefits of an evaluation workflow that organizations can replicate on in-house hardware.

Mini-Batch Robustness Verification of Deep Neural Networks

2025-09-18T07:33:26Z

Neural network image classifiers are ubiquitous in many safety-critical applications. However, they are susceptible to adversarial attacks. To understand their robustness to attacks, many local robustness verifiers have been proposed to analyze $ε$-balls of inputs. Yet, existing verifiers introduce a long analysis time or lose too much precision, making them less effective for a large set of inputs. In this work, we propose a new approach to local robustness: group local robustness verification. The key idea is to leverage the similarity of the network computations of certain $ε$-balls to reduce the overall analysis time. We propose BaVerLy, a sound and complete verifier that boosts the local robustness verification of a set of $ε$-balls by dynamically constructing and verifying mini-batches. BaVerLy adaptively identifies successful mini-batch sizes, accordingly constructs mini-batches of $ε$-balls that have similar network computations, and verifies them jointly. If a mini-batch is verified, all its $ε$-balls are proven robust. Otherwise, one $ε$-ball is suspected as not being robust, guiding the refinement. BaVerLy leverages the analysis results to expedite the analysis of that $ε$-ball as well as the analysis of the mini-batch with the other $ε$-balls. We evaluate BaVerLy on fully connected and convolutional networks for MNIST and CIFAR-10. Results show that BaVerLy scales the common one by one verification by 2.3x on average and up to 4.1x, in which case it reduces the total analysis time from 24 hours to 6 hours.

SALT4Decompile: Inferring Source-level Abstract Logic Tree for LLM-Based Binary Decompilation

2025-09-18T05:57:15Z

Decompilation is widely used in reverse engineering to recover high-level language code from binary executables. While recent approaches leveraging Large Language Models (LLMs) have shown promising progress, they typically treat assembly code as a linear sequence of instructions, overlooking arbitrary jump patterns and isolated data segments inherent to binary files. This limitation significantly hinders their ability to correctly infer source code semantics from assembly code. To address this limitation, we propose \saltm, a novel binary decompilation method that abstracts stable logical features shared between binary and source code. The core idea of \saltm is to abstract selected binary-level operations, such as specific jumps, into a high-level logic framework that better guides LLMs in semantic recovery. Given a binary function, \saltm constructs a Source-level Abstract Logic Tree (\salt) from assembly code to approximate the logic structure of high-level language. It then fine-tunes an LLM using the reconstructed \salt to generate decompiled code. Finally, the output is refined through error correction and symbol recovery to improve readability and correctness. We compare \saltm to three categories of baselines (general-purpose LLMs, commercial decompilers, and decompilation methods) using three well-known datasets (Decompile-Eval, MBPP, Exebench). Our experimental results demonstrate that \saltm is highly effective in recovering the logic of the source code, significantly outperforming state-of-the-art methods (e.g., 70.4\% TCP rate on Decompile-Eval with a 10.6\% improvement). The results further validate its robustness against four commonly used obfuscation techniques. Additionally, analyses of real-world software and a user study confirm that our decompiled output offers superior assistance to human analysts in comprehending binary functions.

Automating Modelica Module Generation Using Large Language Models: A Case Study on Building Control Description Language

2025-09-18T05:07:17Z

Dynamic energy systems and controls require advanced modeling frameworks to design and test supervisory and fault tolerant strategies. Modelica is a widely used equation based language, but developing control modules is labor intensive and requires specialized expertise. This paper examines the use of large language models (LLMs) to automate the generation of Control Description Language modules in the Building Modelica Library as a case study. We developed a structured workflow that combines standardized prompt scaffolds, library aware grounding, automated compilation with OpenModelica, and human in the loop evaluation. Experiments were carried out on four basic logic tasks (And, Or, Not, and Switch) and five control modules (chiller enable/disable, bypass valve control, cooling tower fan speed, plant requests, and relief damper control). The results showed that GPT 4o failed to produce executable Modelica code in zero shot mode, while Claude Sonnet 4 achieved up to full success for basic logic blocks with carefully engineered prompts. For control modules, success rates reached 83 percent, and failed outputs required medium level human repair (estimated one to eight hours). Retrieval augmented generation often produced mismatches in module selection (for example, And retrieved as Or), while a deterministic hard rule search strategy avoided these errors. Human evaluation also outperformed AI evaluation, since current LLMs cannot assess simulation results or validate behavioral correctness. Despite these limitations, the LLM assisted workflow reduced the average development time from 10 to 20 hours down to 4 to 6 hours per module, corresponding to 40 to 60 percent time savings. These results highlight both the potential and current limitations of LLM assisted Modelica generation, and point to future research in pre simulation validation, stronger grounding, and closed loop evaluation.

DeliverC: Teaching Pointers through GenAI-Powered Game-Based Learning

2025-09-18T00:13:04Z

While game-based learning is widely used in programming education, few tools offer adaptive, real-time support for complex topics, such as C pointers. We present DeliverC, a GenAI-enhanced game that integrates GPT-4-mini to provide personalized hints and generate pointer-related challenges on the fly. In a pilot study involving 25 undergraduate students, we investigated the impact of the system on learning through gameplay data and a 15-item survey that covered constructs such as motivation, self-efficacy, metacognition, and feedback quality. Results show that most students felt more confident and reflective after using the tool, and error rates decreased as students progressed through scaffolded levels. However, participation decreased with task difficulty, and some students reported receiving unclear or vague feedback. These findings suggest that DeliverC can enhance engagement and understanding in systems programming, although refinement in AI-generated feedback is still needed. Our study highlights the potential of combining GenAI with game-based learning to support personalized and interactive practice in traditionally challenging programming domains.

A Taxonomy of Prompt Defects in LLM Systems

2025-09-17T20:11:22Z

Large Language Models (LLMs) have become key components of modern software, with prompts acting as their de-facto programming interface. However, prompt design remains largely empirical and small mistakes can cascade into unreliable, insecure, or inefficient behavior. This paper presents the first systematic survey and taxonomy of prompt defects, recurring ways that prompts fail to elicit their intended behavior from LLMs. We organize defects along six dimensions: (1) Specification and Intent, (2) Input and Content, (3) Structure and Formatting, (4) Context and Memory, (5) Performance and Efficiency, and (6) Maintainability and Engineering. Each dimension is refined into fine-grained subtypes, illustrated with concrete examples and root cause analysis. Grounded in software engineering principles, we show how these defects surface in real development workflows and examine their downstream effects. For every subtype, we distill mitigation strategies that span emerging prompt engineering patterns, automated guardrails, testing harnesses, and evaluation frameworks. We then summarize these strategies in a master taxonomy that links defect, impact, and remedy. We conclude with open research challenges and a call for rigorous engineering-oriented methodologies to ensure that LLM-driven systems are dependable by design.

Julia GraphBLAS with Nonblocking Execution

2025-09-17T17:41:17Z

From the beginning, the GraphBLAS were designed for ``nonblocking execution''; i.e., calls to GraphBLAS methods return as soon as the arguments to the methods are validated and define a directed acyclic graph (DAG) of GraphBLAS operations. This lets GraphBLAS implementations fuse functions, elide unneeded objects, exploit parallelism, plus any additional DAG-preserving transformations. GraphBLAS implementations exist that utilize nonblocking execution but with limited scope. In this paper, we describe our work to implement GraphBLAS with support for aggressive nonblocking execution. We show how features of the Julia programming language greatly simplify implementation of nonblocking execution. This is \emph{work-in-progress} sufficient to show the potential for nonblocking execution and is limited to GraphBLAS methods required to support PageRank.

Parallelizable Feynman-Kac Models for Universal Probabilistic Programming

2025-09-17T15:33:41Z

We study provably correct and efficient instantiations of Sequential Monte Carlo (SMC) inference in the context of formal operational semantics of Probabilistic Programs (PPs). We focus on universal PPs featuring sampling from arbitrary measures and conditioning/reweighting in unbounded loops. We first equip Probabilistic Program Graphs (PPGs), an automata-theoretic description format of PPs, with an expectation-based semantics over infinite execution traces, which also incorporates trace weights. We then prove a finite approximation theorem that provides bounds to this semantics based on expectations taken over finite, fixed-length traces. This enables us to frame our semantics within a Feynman-Kac (FK) model, and ensures the consistency of the Particle Filtering (PF) algorithm, an instance of SMC, with respect to our semantics. Building on these results, we introduce VPF, a vectorized version of the PF algorithm tailored to PPGs and our semantics. Experiments conducted with a proof-of-concept implementation of VPF show very promising results compared to state-of-the-art PP inference tools.

CLMTracing: Black-box User-level Watermarking for Code Language Model Tracing

2025-09-17T13:53:08Z

With the widespread adoption of open-source code language models (code LMs), intellectual property (IP) protection has become an increasingly critical concern. While current watermarking techniques have the potential to identify the code LM to protect its IP, they have limitations when facing the more practical and complex demand, i.e., offering the individual user-level tracing in the black-box setting. This work presents CLMTracing, a black-box code LM watermarking framework employing the rule-based watermarks and utility-preserving injection method for user-level model tracing. CLMTracing further incorporates a parameter selection algorithm sensitive to the robust watermark and adversarial training to enhance the robustness against watermark removal attacks. Comprehensive evaluations demonstrate CLMTracing is effective across multiple state-of-the-art (SOTA) code LMs, showing significant harmless improvements compared to existing SOTA baselines and strong robustness against various removal attacks.

VerilogMonkey: Exploring Parallel Scaling for Automated Verilog Code Generation with LLMs

2025-09-17T10:12:24Z

We present VerilogMonkey, an empirical study of parallel scaling for the under-explored task of automated Verilog generation. Parallel scaling improves LLM performance by sampling many outputs in parallel. Across multiple benchmarks and mainstream LLMs, we find that scaling to hundreds of samples is cost-effective in both time and money and, even without any additional enhancements such as post-training or agentic methods, surpasses prior results on LLM-based Verilog generation. We further dissect why parallel scaling delivers these gains and show how output randomness in LLMs affects its effectiveness.

REAMS: Reasoning Enhanced Algorithm for Maths Solving

2025-09-16T21:09:48Z

The challenges of solving complex university-level mathematics problems, particularly those from MIT, and Columbia University courses, and selected tasks from the MATH dataset, remain a significant obstacle in the field of artificial intelligence. Conventional methods have consistently fallen short in this domain, highlighting the need for more advanced approaches. In this paper, we introduce a language-based solution that leverages zero-shot learning and mathematical reasoning to effectively solve, explain, and generate solutions for these advanced math problems. By integrating program synthesis, our method reduces reliance on large-scale training data while significantly improving problem-solving accuracy. Our approach achieves an accuracy of 90.15%, representing a substantial improvement over the previous benchmark of 81% and setting a new standard in automated mathematical problem-solving. These findings highlight the significant potential of advanced AI methodologies to address and overcome the challenges presented by some of the most complex mathematical courses and datasets.