https://arxiv.org/api/gxKzlgl7KiBc19VBLNKPpUV0uV82026-06-23T21:56:43Z9934112515http://arxiv.org/abs/2509.16215v2Discovering Software Parallelization Points Using Deep Neural Networks2025-10-01T23:33:46ZThis study proposes a deep learning-based approach for discovering loops in programming code according to their potential for parallelization. Two genetic algorithm-based code generators were developed to produce two distinct types of code: (i) independent loops, which are parallelizable, and (ii) ambiguous loops, whose dependencies are unclear, making them impossible to define if the loop is parallelizable or not. The generated code snippets were tokenized and preprocessed to ensure a robust dataset. Two deep learning models - a Deep Neural Network (DNN) and a Convolutional Neural Network (CNN) - were implemented to perform the classification. Based on 30 independent runs, a robust statistical analysis was employed to verify the expected performance of both models, DNN and CNN. The CNN showed a slightly higher mean performance, but the two models had a similar variability. Experiments with varying dataset sizes highlighted the importance of data diversity for model performance. These results demonstrate the feasibility of using deep learning to automate the identification of parallelizable structures in code, offering a promising tool for software optimization and performance improvement.2025-09-05T15:32:23Z17 pages, 10 figuresIzavan dos S. CorreiaHenrique C. T. SantosTiago A. E. Ferreirahttp://arxiv.org/abs/2504.01841v4Garbage Collection for Rust: The Finalizer Frontier2025-09-30T12:55:51ZRust is a non-Garbage Collected (GCed) language, but the lack of GC makes expressing data-structures that require shared ownership awkward, inefficient, or both. In this paper we explore a new design for, and implementation of, GC in Rust, called Alloy. Unlike previous approaches to GC in Rust, Alloy allows existing Rust destructors to be automatically used as GC finalizers: this makes Alloy integrate better with existing Rust code than previous solutions but introduces surprising soundness and performance problems. Alloy provides novel solutions for the core problems: finalizer safety analysis rejects unsound destructors from automatically being reused as finalizers; finalizer elision optimises away unnecessary finalizers; and premature finalizer prevention ensures that finalizers are only run when it is provably safe to do so.2025-04-02T15:46:58Z33 pages, 13 figuresJacob HughesLaurence Tratt10.1145/3763179http://arxiv.org/abs/2509.25879v1Monoid Structures on Indexed Containers2025-09-30T07:21:09ZContainers represent a wide class of type constructions relevant for functional programming and (co)inductive reasoning. Indexed containers generalize this notion to better fit the scope of dependently typed programming. When interpreting types to be sets, a container describes an endofunctor on the category of sets while an I-indexed container describes an endofunctor on the category Set^I of I-indexed families of sets.
We consider the monoidal structure on the category of I-indexed containers whose tensor product of containers describes the composition of the respective induced endofunctors. We then give a combinatorial characterization of monoids in this monoidal category, and we show how these monoids correspond precisely to monads on the induced endofunctors on Set^I. Lastly, we conclude by presenting some examples of monads on Set^I that fall under our characterization, including the product of two monads, indexed variants of the state and the writer monads and an example of a free monad. The technical results of this work are accompanied by a formalization in the proof assistant Cubical Agda.2025-09-30T07:21:09ZIn Proceedings LSFA 2025, arXiv:2509.23739EPTCS 430, 2025, pp. 37-54Michele De PascalisTarmo UustaluNiccolò Veltrì10.4204/EPTCS.430.4http://arxiv.org/abs/2509.25873v1Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs2025-09-30T07:07:32ZLarge language models (LLMs) are increasingly being applied to programming tasks, ranging from single-turn code completion to autonomous agents. Current code agent designs frequently depend on complex, hand-crafted workflows and tool sets. However, this reliance on elaborate scaffolding presents several challenges: agent performance becomes overly dependent on prompt tuning and custom design choices, heavy human intervention obscures a model's true underlying capabilities, and intricate pipelines are costly to build and maintain. Furthermore, optimizing complex task prompts increases the risk of data leakage. Currently, when introducing new models, LLM providers like OpenAI and Anthropic often publish benchmark scores to demonstrate their models' coding proficiency, but keep their proprietary evaluation frameworks confidential. To address these limitations, we introduce Lita (Lite Agent), which operationalizes liteness, a principle of minimizing manual design while retaining the essential elements of a fully autonomous agent. Lita enables a more faithful and unified evaluation without elaborate scaffolding. Experiments on the Aider Polyglot and SWE-Bench with frontier models demonstrate that Lita achieves competitive or superior performance compared to workflow-based and agentic baselines. Crucially, Lita also consumes fewer tokens and requires significantly less design effort. Our results suggest that Lita is sufficient to reveal the underlying coding competence of modern LLMs. Finally, we propose the Agent Complexity Law: the performance gap between agents of varying complexity, from simple to sophisticated designs, will shrink as the core model improves, ultimately converging to a negligible difference.2025-09-30T07:07:32ZHankun DaiMaoquan WangMengnan QiYikai ZhangZijian JinYongqiang YaoYufan HuangShengyu FuElsie Nallipoguhttp://arxiv.org/abs/2403.05302v3Modeling Dynamic (De)Allocations of Local Memory for Translation Validation2025-09-30T05:53:22ZEnd-to-End Translation Validation is the problem of verifying the executable code generated by a compiler against the corresponding input source code for a single compilation. This becomes particularly hard in the presence of dynamically-allocated local memory where addresses of local memory may be observed by the program. In the context of validating the translation of a C procedure to executable code, a validator needs to tackle constant-length local arrays, address-taken local variables, address-taken formal parameters, variable-length local arrays, procedure-call arguments (including variadic arguments), and the alloca() operator. We provide an execution model, a definition of refinement, and an algorithm to soundly convert a refinement check into first-order logic queries that an off-the-shelf SMT solver can handle efficiently. In our experiments, we perform blackbox translation validation of C procedures (with up to 100+ SLOC), involving these local memory allocation constructs, against their corresponding assembly implementations (with up to 200+ instructions) generated by an optimizing compiler with complex loop and vectorizing transformations.2024-03-08T13:31:50ZAbhishek RoseSorav Bansal10.1145/3649863http://arxiv.org/abs/2503.12686v2Understanding Formal Reasoning Failures in LLMs as Abstract Interpreters2025-09-30T01:21:46ZLarge language models (LLMs) are increasingly used for program verification, and yet little is known about \emph{how} they reason about program semantics during this process. In this work, we focus on abstract interpretation based-reasoning for invariant generation and introduce two novel prompting strategies that aim to elicit such reasoning from LLMs. We evaluate these strategies across several state-of-the-art LLMs on 22 programs from the SV-COMP benchmark suite widely used in software verification. We analyze both the soundness of the generated invariants and the key thematic patterns in the models' reasoning errors. This work aims to highlight new research opportunities at the intersection of LLMs and program verification for applying LLMs to verification tasks and advancing their reasoning capabilities in this application.2025-03-16T23:05:52ZJacqueline L. MitchellBrian Hyeongseok KimChenyu ZhouChao Wanghttp://arxiv.org/abs/2503.21937v2Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming2025-09-29T18:34:13ZNeurosymbolic programs combine deep learning with symbolic reasoning to achieve better data efficiency, interpretability, and generalizability compared to standalone deep learning approaches. However, existing neurosymbolic learning frameworks implement an uneasy marriage between a highly scalable, GPU-accelerated neural component and a slower symbolic component that runs on CPUs.
We propose Lobster, a unified framework for harnessing GPUs in an end-to-end manner for neurosymbolic learning. Lobster maps a general neurosymbolic language based on Datalog to the GPU programming paradigm. This mapping is implemented via compilation to a new intermediate language called APM. The extra abstraction provided by apm allows Lobster to be both flexible, supporting discrete, probabilistic, and differentiable modes of reasoning on GPU hardware with a library of provenance semirings, and performant, implementing new optimization passes.
We demonstrate that Lobster programs can solve interesting problems spanning the domains of natural language processing, image processing, program reasoning, bioinformatics, and planning. On a suite of 9 applications, Lobster achieves an average speedup of 3.9x over Scallop, a state-of-the-art neurosymbolic framework, and enables scaling of neurosymbolic solutions to previously infeasible tasks.2025-03-27T19:32:58ZAccepted at ASPLOS 2026Paul BibersteinZiyang LiJoseph DeviettiMayur Naik10.1145/3760250.3762232http://arxiv.org/abs/2510.12803v1AutoCode: LLMs as Problem Setters for Competitive Programming2025-09-29T17:59:03ZWriting competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.2025-09-29T17:59:03ZProject page: https://livecodebenchpro.com/projects/autocode/overviewShang ZhouZihan ZhengKaiyuan LiuZeyu ShenZerui ChengZexing ChenHansen HeJianzhu YaoHuanzhi MaoQiuyang MangTianfu FuBeichen LiDongruixuan LiWenhao ChaiZhuang LiuAleksandra KorolovaPeter HendersonNatasha JaquesPramod ViswanathSaining XieJingbo Shanghttp://arxiv.org/abs/2509.25114v1From Affine to Polynomial: Synthesizing Loops with Branches via Algebraic Geometry2025-09-29T17:41:16ZEnsuring software correctness remains a fundamental challenge in formal program verification. One promising approach relies on finding polynomial invariants for loops. Polynomial invariants are properties of a program loop that hold before and after each iteration. Generating such invariants is a crucial task in loop analysis, but it is undecidable in the general case. Recently, an alternative approach to this problem has emerged, focusing on synthesizing loops from invariants. However, existing methods only synthesize affine loops without guard conditions from polynomial invariants. In this paper, we address a more general problem, allowing loops to have polynomial update maps with a given structure, inequations in the guard condition, and polynomial invariants of arbitrary form.
We use algebraic geometry tools to design and implement an algorithm that computes a finite set of polynomial equations whose solutions correspond to all nondeterministic branching loops satisfying the given invariants. Furthermore, we introduce a new class of invariants for which we present a significantly more efficient algorithm. In other words, we reduce the problem of synthesizing loops to find solutions of multivariate polynomial systems with rational entries. This final step is handled in our software using an SMT solver.2025-09-29T17:41:16ZErdenebayar BayarmagnaiFatemeh MohammadiRémi Prébethttp://arxiv.org/abs/1208.4041v4Ranking Functions for Linear-Constraint Loops2025-09-29T10:52:38ZIn this paper we study the complexity of the problems: given a loop, described by linear constraints over a finite set of variables, is there a linear or lexicographical-linear ranking function for this loop? While existence of such functions implies termination, these problems are not equivalent to termination. When the variables range over the rationals (or reals), it is known that both problems are PTIME decidable. However, when they range over the integers, whether for single-path or multipath loops, the complexity has not yet been determined. We show that both problems are coNP-complete. However, we point out some special cases of importance of PTIME complexity. We also present complete algorithms for synthesizing linear and lexicographical-linear ranking functions, both for the general case and the special PTIME cases. Moreover, in the rational setting, our algorithm for synthesizing lexicographical-linear ranking functions extends existing ones, because our class of ranking functions is more general, yet it has polynomial time complexity.2012-08-20T15:25:45ZThis version explains (in the new Lemma 5.31) how the procedure of Lemma 5.29 can be implemented in polynomial time. This new lemma does not appear in the JACM version of this paperJ. ACM 61(4): 26:1-26:55 (2014)Amir M. Ben-AmramSamir Genaim10.1145/2629488http://arxiv.org/abs/2509.24515v1Agentic Specification Generator for Move Programs2025-09-29T09:34:31ZWhile LLM-based specification generation is gaining traction, existing tools primarily focus on mainstream programming languages like C, Java, and even Solidity, leaving emerging and yet verification-oriented languages like Move underexplored. In this paper, we introduce MSG, an automated specification generation tool designed for Move smart contracts. MSG aims to highlight key insights that uniquely present when applying LLM-based specification generation to a new ecosystem. Specifically, MSG demonstrates that LLMs exhibit robust code comprehension and generation capabilities even for non-mainstream languages. MSG successfully generates verifiable specifications for 84% of tested Move functions and even identifies clauses previously overlooked by experts. Additionally, MSG shows that explicitly leveraging specification language features through an agentic, modular design improves specification quality substantially (generating 57% more verifiable clauses than conventional designs). Incorporating feedback from the verification toolchain further enhances the effectiveness of MSG, leading to a 30% increase in generated verifiable specifications.2025-09-29T09:34:31Z18 pages; Extended version of ASE'25 paper with extra appendicesYu-Fu FuMeng XuTaesoo Kimhttp://arxiv.org/abs/2510.09633v1Hound: Relation-First Knowledge Graphs for Complex-System Reasoning in Security Audits2025-09-29T02:46:02ZHound introduces a relation-first graph engine that improves system-level reasoning across interrelated components in complex codebases. The agent designs flexible, analyst-defined views with compact annotations (e.g., monetary/value flows, authentication/authorization roles, call graphs, protocol invariants) and uses them to anchor exact retrieval: for any question, it loads precisely the code that matters (often across components) so it can zoom out to system structure and zoom in to the decisive lines. A second contribution is a persistent belief system: long-lived vulnerability hypotheses whose confidence is updated as evidence accrues. The agent employs coverage-versus-intuition planning and a QA finalizer to confirm or reject hypotheses. On a five-project subset of ScaBench[1], Hound improves recall and F1 over a baseline LLM analyzer (micro recall 31.2% vs. 8.3%; F1 14.2% vs. 9.8%) with a modest precision trade-off. We attribute these gains to flexible, relation-first graphs that extend model understanding beyond call/dataflow to abstract aspects, plus the hypothesis-centric loop; code and artifacts are released to support reproduction.2025-09-29T02:46:02ZBernhard Muellerhttp://arxiv.org/abs/2510.15914v1VeriGRAG: Enhancing LLM-Based Verilog Code Generation with Structure-Aware Soft Prompts2025-09-27T10:23:36ZLarge language models (LLMs) have demonstrated strong capabilities in generating Verilog code from natural language descriptions. However, Verilog code inherently encodes structural information of hardware circuits. Effectively leveraging this structural information to enhance the functional and syntactic correctness of LLM-generated Verilog code remains a significant challenge. To address this challenge, we propose VeriGRAG , a novel framework that extracts structural graph embeddings from Verilog code using graph neural networks (GNNs). A multimodal retriever then selects the graph embeddings most relevant to the given generation task, which are aligned with the code modality through the VeriFormer module to generate structure-aware soft prompts. Our experiments demonstrate that VeriGRAG substantially improves the correctness of Verilog code generation, achieving state-of-the-art or superior performance across both VerilogEval and RTLLM benchmarks.2025-09-27T10:23:36Z9 pages, 5 figuresJiayu ZhaoSong Chenhttp://arxiv.org/abs/2509.25248v1BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software2025-09-27T03:02:46ZAutomatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction retrieval module that achieves state-of-the-art performance on BUILD-BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD-BENCH can faithfully reflect an agent's ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.2025-09-27T03:02:46ZZehua ZhangAti Priya BajajDivij HandaSiyu LiuArvind S RajHongkai ChenHulin WangYibo LiuZion Leonahenahe BasqueSouradip NathVishal JunejaNikhil ChapreYan ShoshitaishviliAdam DoupéChitta BaralRuoyu Wanghttp://arxiv.org/abs/2509.23061v1Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification2025-09-27T02:33:08ZWe introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.2025-09-27T02:33:08ZXu XuXin LiXingwei QuJie FuBinhang Yuan