https://arxiv.org/api/gxKzlgl7KiBc19VBLNKPpUV0uV8 2026-06-23T21:56:43Z 9934 1125 15 http://arxiv.org/abs/2509.16215v2 Discovering Software Parallelization Points Using Deep Neural Networks 2025-10-01T23:33:46Z

This study proposes a deep learning-based approach for discovering loops in programming code according to their potential for parallelization. Two genetic algorithm-based code generators were developed to produce two distinct types of code: (i) independent loops, which are parallelizable, and (ii) ambiguous loops, whose dependencies are unclear, making them impossible to define if the loop is parallelizable or not. The generated code snippets were tokenized and preprocessed to ensure a robust dataset. Two deep learning models - a Deep Neural Network (DNN) and a Convolutional Neural Network (CNN) - were implemented to perform the classification. Based on 30 independent runs, a robust statistical analysis was employed to verify the expected performance of both models, DNN and CNN. The CNN showed a slightly higher mean performance, but the two models had a similar variability. Experiments with varying dataset sizes highlighted the importance of data diversity for model performance. These results demonstrate the feasibility of using deep learning to automate the identification of parallelizable structures in code, offering a promising tool for software optimization and performance improvement.

2025-09-05T15:32:23Z 17 pages, 10 figures Izavan dos S. Correia Henrique C. T. Santos Tiago A. E. Ferreira http://arxiv.org/abs/2504.01841v4 Garbage Collection for Rust: The Finalizer Frontier 2025-09-30T12:55:51Z

Rust is a non-Garbage Collected (GCed) language, but the lack of GC makes expressing data-structures that require shared ownership awkward, inefficient, or both. In this paper we explore a new design for, and implementation of, GC in Rust, called Alloy. Unlike previous approaches to GC in Rust, Alloy allows existing Rust destructors to be automatically used as GC finalizers: this makes Alloy integrate better with existing Rust code than previous solutions but introduces surprising soundness and performance problems. Alloy provides novel solutions for the core problems: finalizer safety analysis rejects unsound destructors from automatically being reused as finalizers; finalizer elision optimises away unnecessary finalizers; and premature finalizer prevention ensures that finalizers are only run when it is provably safe to do so.

2025-04-02T15:46:58Z 33 pages, 13 figures Jacob Hughes Laurence Tratt 10.1145/3763179 http://arxiv.org/abs/2509.25879v1 Monoid Structures on Indexed Containers 2025-09-30T07:21:09Z

Containers represent a wide class of type constructions relevant for functional programming and (co)inductive reasoning. Indexed containers generalize this notion to better fit the scope of dependently typed programming. When interpreting types to be sets, a container describes an endofunctor on the category of sets while an I-indexed container describes an endofunctor on the category Set^I of I-indexed families of sets. We consider the monoidal structure on the category of I-indexed containers whose tensor product of containers describes the composition of the respective induced endofunctors. We then give a combinatorial characterization of monoids in this monoidal category, and we show how these monoids correspond precisely to monads on the induced endofunctors on Set^I. Lastly, we conclude by presenting some examples of monads on Set^I that fall under our characterization, including the product of two monads, indexed variants of the state and the writer monads and an example of a free monad. The technical results of this work are accompanied by a formalization in the proof assistant Cubical Agda.

2025-09-30T07:21:09Z In Proceedings LSFA 2025, arXiv:2509.23739 EPTCS 430, 2025, pp. 37-54 Michele De Pascalis Tarmo Uustalu Niccolò Veltrì 10.4204/EPTCS.430.4 http://arxiv.org/abs/2509.25873v1 Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs 2025-09-30T07:07:32Z

Large language models (LLMs) are increasingly being applied to programming tasks, ranging from single-turn code completion to autonomous agents. Current code agent designs frequently depend on complex, hand-crafted workflows and tool sets. However, this reliance on elaborate scaffolding presents several challenges: agent performance becomes overly dependent on prompt tuning and custom design choices, heavy human intervention obscures a model's true underlying capabilities, and intricate pipelines are costly to build and maintain. Furthermore, optimizing complex task prompts increases the risk of data leakage. Currently, when introducing new models, LLM providers like OpenAI and Anthropic often publish benchmark scores to demonstrate their models' coding proficiency, but keep their proprietary evaluation frameworks confidential. To address these limitations, we introduce Lita (Lite Agent), which operationalizes liteness, a principle of minimizing manual design while retaining the essential elements of a fully autonomous agent. Lita enables a more faithful and unified evaluation without elaborate scaffolding. Experiments on the Aider Polyglot and SWE-Bench with frontier models demonstrate that Lita achieves competitive or superior performance compared to workflow-based and agentic baselines. Crucially, Lita also consumes fewer tokens and requires significantly less design effort. Our results suggest that Lita is sufficient to reveal the underlying coding competence of modern LLMs. Finally, we propose the Agent Complexity Law: the performance gap between agents of varying complexity, from simple to sophisticated designs, will shrink as the core model improves, ultimately converging to a negligible difference.

2025-09-30T07:07:32Z Hankun Dai Maoquan Wang Mengnan Qi Yikai Zhang Zijian Jin Yongqiang Yao Yufan Huang Shengyu Fu Elsie Nallipogu http://arxiv.org/abs/2403.05302v3 Modeling Dynamic (De)Allocations of Local Memory for Translation Validation 2025-09-30T05:53:22Z

End-to-End Translation Validation is the problem of verifying the executable code generated by a compiler against the corresponding input source code for a single compilation. This becomes particularly hard in the presence of dynamically-allocated local memory where addresses of local memory may be observed by the program. In the context of validating the translation of a C procedure to executable code, a validator needs to tackle constant-length local arrays, address-taken local variables, address-taken formal parameters, variable-length local arrays, procedure-call arguments (including variadic arguments), and the alloca() operator. We provide an execution model, a definition of refinement, and an algorithm to soundly convert a refinement check into first-order logic queries that an off-the-shelf SMT solver can handle efficiently. In our experiments, we perform blackbox translation validation of C procedures (with up to 100+ SLOC), involving these local memory allocation constructs, against their corresponding assembly implementations (with up to 200+ instructions) generated by an optimizing compiler with complex loop and vectorizing transformations.

2024-03-08T13:31:50Z Abhishek Rose Sorav Bansal 10.1145/3649863 http://arxiv.org/abs/2503.12686v2 Understanding Formal Reasoning Failures in LLMs as Abstract Interpreters 2025-09-30T01:21:46Z

Large language models (LLMs) are increasingly used for program verification, and yet little is known about \emph{how} they reason about program semantics during this process. In this work, we focus on abstract interpretation based-reasoning for invariant generation and introduce two novel prompting strategies that aim to elicit such reasoning from LLMs. We evaluate these strategies across several state-of-the-art LLMs on 22 programs from the SV-COMP benchmark suite widely used in software verification. We analyze both the soundness of the generated invariants and the key thematic patterns in the models' reasoning errors. This work aims to highlight new research opportunities at the intersection of LLMs and program verification for applying LLMs to verification tasks and advancing their reasoning capabilities in this application.

2025-03-16T23:05:52Z Jacqueline L. Mitchell Brian Hyeongseok Kim Chenyu Zhou Chao Wang http://arxiv.org/abs/2503.21937v2 Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming 2025-09-29T18:34:13Z

Neurosymbolic programs combine deep learning with symbolic reasoning to achieve better data efficiency, interpretability, and generalizability compared to standalone deep learning approaches. However, existing neurosymbolic learning frameworks implement an uneasy marriage between a highly scalable, GPU-accelerated neural component and a slower symbolic component that runs on CPUs. We propose Lobster, a unified framework for harnessing GPUs in an end-to-end manner for neurosymbolic learning. Lobster maps a general neurosymbolic language based on Datalog to the GPU programming paradigm. This mapping is implemented via compilation to a new intermediate language called APM. The extra abstraction provided by apm allows Lobster to be both flexible, supporting discrete, probabilistic, and differentiable modes of reasoning on GPU hardware with a library of provenance semirings, and performant, implementing new optimization passes. We demonstrate that Lobster programs can solve interesting problems spanning the domains of natural language processing, image processing, program reasoning, bioinformatics, and planning. On a suite of 9 applications, Lobster achieves an average speedup of 3.9x over Scallop, a state-of-the-art neurosymbolic framework, and enables scaling of neurosymbolic solutions to previously infeasible tasks.

2025-03-27T19:32:58Z Accepted at ASPLOS 2026 Paul Biberstein Ziyang Li Joseph Devietti Mayur Naik 10.1145/3760250.3762232 http://arxiv.org/abs/2510.12803v1 AutoCode: LLMs as Problem Setters for Competitive Programming 2025-09-29T17:59:03Z

Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.

2025-09-29T17:59:03Z Project page: https://livecodebenchpro.com/projects/autocode/overview Shang Zhou Zihan Zheng Kaiyuan Liu Zeyu Shen Zerui Cheng Zexing Chen Hansen He Jianzhu Yao Huanzhi Mao Qiuyang Mang Tianfu Fu Beichen Li Dongruixuan Li Wenhao Chai Zhuang Liu Aleksandra Korolova Peter Henderson Natasha Jaques Pramod Viswanath Saining Xie Jingbo Shang http://arxiv.org/abs/2509.25114v1 From Affine to Polynomial: Synthesizing Loops with Branches via Algebraic Geometry 2025-09-29T17:41:16Z

Ensuring software correctness remains a fundamental challenge in formal program verification. One promising approach relies on finding polynomial invariants for loops. Polynomial invariants are properties of a program loop that hold before and after each iteration. Generating such invariants is a crucial task in loop analysis, but it is undecidable in the general case. Recently, an alternative approach to this problem has emerged, focusing on synthesizing loops from invariants. However, existing methods only synthesize affine loops without guard conditions from polynomial invariants. In this paper, we address a more general problem, allowing loops to have polynomial update maps with a given structure, inequations in the guard condition, and polynomial invariants of arbitrary form. We use algebraic geometry tools to design and implement an algorithm that computes a finite set of polynomial equations whose solutions correspond to all nondeterministic branching loops satisfying the given invariants. Furthermore, we introduce a new class of invariants for which we present a significantly more efficient algorithm. In other words, we reduce the problem of synthesizing loops to find solutions of multivariate polynomial systems with rational entries. This final step is handled in our software using an SMT solver.

2025-09-29T17:41:16Z Erdenebayar Bayarmagnai Fatemeh Mohammadi Rémi Prébet http://arxiv.org/abs/1208.4041v4 Ranking Functions for Linear-Constraint Loops 2025-09-29T10:52:38Z

In this paper we study the complexity of the problems: given a loop, described by linear constraints over a finite set of variables, is there a linear or lexicographical-linear ranking function for this loop? While existence of such functions implies termination, these problems are not equivalent to termination. When the variables range over the rationals (or reals), it is known that both problems are PTIME decidable. However, when they range over the integers, whether for single-path or multipath loops, the complexity has not yet been determined. We show that both problems are coNP-complete. However, we point out some special cases of importance of PTIME complexity. We also present complete algorithms for synthesizing linear and lexicographical-linear ranking functions, both for the general case and the special PTIME cases. Moreover, in the rational setting, our algorithm for synthesizing lexicographical-linear ranking functions extends existing ones, because our class of ranking functions is more general, yet it has polynomial time complexity.

2012-08-20T15:25:45Z This version explains (in the new Lemma 5.31) how the procedure of Lemma 5.29 can be implemented in polynomial time. This new lemma does not appear in the JACM version of this paper J. ACM 61(4): 26:1-26:55 (2014) Amir M. Ben-Amram Samir Genaim 10.1145/2629488 http://arxiv.org/abs/2509.24515v1 Agentic Specification Generator for Move Programs 2025-09-29T09:34:31Z

While LLM-based specification generation is gaining traction, existing tools primarily focus on mainstream programming languages like C, Java, and even Solidity, leaving emerging and yet verification-oriented languages like Move underexplored. In this paper, we introduce MSG, an automated specification generation tool designed for Move smart contracts. MSG aims to highlight key insights that uniquely present when applying LLM-based specification generation to a new ecosystem. Specifically, MSG demonstrates that LLMs exhibit robust code comprehension and generation capabilities even for non-mainstream languages. MSG successfully generates verifiable specifications for 84% of tested Move functions and even identifies clauses previously overlooked by experts. Additionally, MSG shows that explicitly leveraging specification language features through an agentic, modular design improves specification quality substantially (generating 57% more verifiable clauses than conventional designs). Incorporating feedback from the verification toolchain further enhances the effectiveness of MSG, leading to a 30% increase in generated verifiable specifications.

2025-09-29T09:34:31Z 18 pages; Extended version of ASE'25 paper with extra appendices Yu-Fu Fu Meng Xu Taesoo Kim http://arxiv.org/abs/2510.09633v1 Hound: Relation-First Knowledge Graphs for Complex-System Reasoning in Security Audits 2025-09-29T02:46:02Z

Hound introduces a relation-first graph engine that improves system-level reasoning across interrelated components in complex codebases. The agent designs flexible, analyst-defined views with compact annotations (e.g., monetary/value flows, authentication/authorization roles, call graphs, protocol invariants) and uses them to anchor exact retrieval: for any question, it loads precisely the code that matters (often across components) so it can zoom out to system structure and zoom in to the decisive lines. A second contribution is a persistent belief system: long-lived vulnerability hypotheses whose confidence is updated as evidence accrues. The agent employs coverage-versus-intuition planning and a QA finalizer to confirm or reject hypotheses. On a five-project subset of ScaBench[1], Hound improves recall and F1 over a baseline LLM analyzer (micro recall 31.2% vs. 8.3%; F1 14.2% vs. 9.8%) with a modest precision trade-off. We attribute these gains to flexible, relation-first graphs that extend model understanding beyond call/dataflow to abstract aspects, plus the hypothesis-centric loop; code and artifacts are released to support reproduction.

2025-09-29T02:46:02Z Bernhard Mueller http://arxiv.org/abs/2510.15914v1 VeriGRAG: Enhancing LLM-Based Verilog Code Generation with Structure-Aware Soft Prompts 2025-09-27T10:23:36Z

Large language models (LLMs) have demonstrated strong capabilities in generating Verilog code from natural language descriptions. However, Verilog code inherently encodes structural information of hardware circuits. Effectively leveraging this structural information to enhance the functional and syntactic correctness of LLM-generated Verilog code remains a significant challenge. To address this challenge, we propose VeriGRAG , a novel framework that extracts structural graph embeddings from Verilog code using graph neural networks (GNNs). A multimodal retriever then selects the graph embeddings most relevant to the given generation task, which are aligned with the code modality through the VeriFormer module to generate structure-aware soft prompts. Our experiments demonstrate that VeriGRAG substantially improves the correctness of Verilog code generation, achieving state-of-the-art or superior performance across both VerilogEval and RTLLM benchmarks.

2025-09-27T10:23:36Z 9 pages, 5 figures Jiayu Zhao Song Chen http://arxiv.org/abs/2509.25248v1 BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software 2025-09-27T03:02:46Z

Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction retrieval module that achieves state-of-the-art performance on BUILD-BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD-BENCH can faithfully reflect an agent's ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.

2025-09-27T03:02:46Z Zehua Zhang Ati Priya Bajaj Divij Handa Siyu Liu Arvind S Raj Hongkai Chen Hulin Wang Yibo Liu Zion Leonahenahe Basque Souradip Nath Vishal Juneja Nikhil Chapre Yan Shoshitaishvili Adam Doupé Chitta Baral Ruoyu Wang http://arxiv.org/abs/2509.23061v1 Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification 2025-09-27T02:33:08Z

We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.

2025-09-27T02:33:08Z Xu Xu Xin Li Xingwei Qu Jie Fu Binhang Yuan