https://arxiv.org/api/bsn79oUMbN+DxRIbPhU1f3ygBtk 2026-06-13T15:35:44Z 9885 75 15 http://arxiv.org/abs/2510.05115v3 SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling 2026-05-29T07:07:01Z

Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solver-driven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.7%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.

2025-09-28T12:25:31Z ICML 2026 accepted Yansen Zhang Qingcan Kang Yujie Chen Yufei Wang Xiongwei Han Tao Zhong Mingxuan Yuan Chen Ma http://arxiv.org/abs/2502.04671v3 ProofWala: A Framework for Multilingual Proof Data Synthesis and Theorem-Proving 2026-05-29T04:28:35Z

Neural approaches to theorem proving require robust infrastructure for interfacing with interactive theorem provers (ITPs), extracting structured proof data, and executing proof search at scale. However, existing tooling is often assistant-specific and oriented toward file-level execution, making repository-scale analysis and parallel experimentation challenging. We present ProofWala, a multilingual proof engineering framework built around \texttt{itp-interface}, a reusable library for programmatic interaction with ITPs. For Lean 4, we implement a meta-programmed interaction layer executing inside the elaborator, enabling semantically faithful tactic-level tracing alongside declaration- and dependency-level extraction across entire repositories. This design extends beyond traditional REPL-style interaction by supporting project-wide analysis, environment cloning, and pooled execution of proof states. The same interface abstraction supports multiple versions of Rocq, yielding a unified cross-assistant pipeline. Built on this infrastructure, ProofWala provides standardized multilingual proof datasets, model training utilities, and parallel proof search algorithms. Using the framework, we demonstrate that multilingual training across Lean and Rocq enables cross-lingual and cross-domain transfer. We observe statistically significant improvements on Lean Mathlib and in domain adaptation (CategoryTheory), while other settings exhibit consistent upward trends. We open-source the full framework, parallel proof search module, datasets, and models across two repositories: ProofWala (https://github.com/trishullab/proof-wala) and the itp-interface library (https://github.com/trishullab/itp-interface).

2025-02-07T05:35:46Z Amitayush Thakur George Tsoukalas Greg Durrett Swarat Chaudhuri http://arxiv.org/abs/2606.00131v1 AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve 2026-05-28T20:43:55Z

Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant performance gains from heavily optimized binaries. However, these systems are currently restricted to intraprocedural techniques, leaving the global potential of interprocedural layout largely untapped. Interprocedural code layout is historically difficult due to a combinatorially intractable search space and complex call-return semantics that are challenging to model. Consequently, the performance potential of fine-grained interprocedural layout remains unproven in practice. AI-PROPELLER uses Magellan, an agentic workflow that evolves the compiler heuristic in Propeller into a fine-grained interprocedural optimizer and fine-tunes the resulting policy hyperparameters. To ensure high-fidelity, we move away from approximate static cost models and the agentic workflow generates multiple layout variants that are executed on actual hardware to measure real performance counters, providing a precise reward signal for the evolutionary loop. AI-PROPELLER has been evaluated on several benchmarks including large warehouse-scale applications and experiments show performance improvements of 0.23% to 1.6% optimized with state-of-the-art FDO and PLO which is significant for real-world binaries. This is the first time ever that large warehouse-scale applications in industrial settings have been optimized with fine-grained interprocedural code layout.

2026-05-28T20:43:55Z Chaitanya Mamatha Ananda Rajiv Gupta Mircea Trofin Aiden Grossman Sriraman Tallam Xinliang David Li Amir Yazdanbakhsh http://arxiv.org/abs/2605.30203v1 A Bayesian Approach to Membership Inference for Statistical Release 2026-05-28T16:42:04Z

The membership inference problem for publicly released statistics from a private dataset is well-studied. When developing and formally analyzing attack strategies, however, the focus has been on attacks that model the population using only its marginals. In practice, these attacks can perform well on various populations, however most formal analysis is for populations that follow a product distribution. These strategies may fail to leverage useful information about the population that is important for understanding a realistic privacy threat. In this work, we explore the impact of providing an attacker with additional information about the attribute dependency structure of the population, motivated by examples where multiple parties may have access to similarly structured data, for example the US Census and the IRS. To model this scenario, we re-frame the membership inference problem with respect to a population represented as a Bayesian network (BN). We develop a framework based on Bayesian decision-making which can incorporate prior information about the population to launch more effective, specialized attacks. To evaluate our framework, we introduce a specific attack instantiation which computes the Bayesian posterior using a probabilistic program, and prove its equivalence to an optimal variant of the likelihood ratio test attack for two populations with strong attribute dependency. We implement our program in the Roulette probabilistic programming language and show experimentally that it outperforms the likelihood ratio test and inner product attacks on five commonly used BNs, where the population dependency structure is too complex for the existing attacks to be manually adapted.

2026-05-28T16:42:04Z Lisa Oakley Sam Stites Cameron Moy Steven Holtzen Alina Oprea Marco Gaboardi http://arxiv.org/abs/2404.16077v4 CompilerDream: Learning a Compiler World Model for General Code Optimization 2026-05-28T09:38:13Z

Effective code optimization in compilers is crucial for computer and software engineering. The success of these optimizations primarily depends on the selection and ordering of the optimization passes applied to the code. While most compilers rely on a fixed sequence of optimization passes, current methods to find the optimal sequence either employ impractically slow search algorithms or learning methods that struggle to generalize to code unseen during training. We introduce CompilerDream, a model-based reinforcement learning approach to general code optimization. CompilerDream comprises a compiler world model that accurately simulates the intrinsic properties of optimization passes and an agent trained on this model to produce effective optimization strategies. By training on a large-scale program dataset, CompilerDream is equipped to serve as a general code optimizer across various application scenarios and source-code languages. Our extensive experiments first highlight CompilerDream's strong optimization capabilities for autotuning, where it leads the CompilerGym leaderboard. More importantly, the zero-shot generalization ability of large-scale trained compiler world model and agent, excels across diverse datasets, surpassing LLVM's built-in optimizations and other state-of-the-art methods in both settings of value prediction and end-to-end code optimization.

2024-04-24T09:20:33Z KDD 2025 camera-ready version with extended appendix. Code is available at https://github.com/thuml/CompilerDream. This update additionally fixes an issue in Table 6 where the dataset names in three rows were ordered incorrectly Chaoyi Deng Jialong Wu Ningya Feng Jianmin Wang Mingsheng Long http://arxiv.org/abs/2605.29357v1 PassNet: Scaling Large Language Models for Graph Compiler Pass Generation 2026-05-28T04:55:14Z

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

2026-05-28T04:55:14Z Code and data available at https://github.com/PaddlePaddle/PassNet Yiqun Liu Yingsheng Wu Ruqi Yang Enrong Zheng Honglei Qiu Sijun He Tai Liang Jingjing Wu Yuhan Zhou Yiwei Zhang Dongyan Chen Weihan Yi Xinqi Li Siqi Bao http://arxiv.org/abs/2601.17670v2 Grammar-Aware Literate Generative Mathematical Programming with Compiler-in-the-Loop 2026-05-28T00:44:24Z

Mathematical programming is widely employed across various sectors - such as logistics, energy, and workforce planning - to model and solve industrial optimisation problems, but its use requires substantial domain expertise. Large language models offer a promising way to translate natural-language problem descriptions into optimisation models, yet existing approaches are costly and generally produce models written in general-purpose computer code (e.g. Python), which can be difficult to inspect, validate, and reuse. In this work, we introduce SyntAGM, a system that generates optimisation models in a readable algebraic modelling language through an iterative generate-compile-assess-revise loop. SyntAGM leverages PyOPL, an OPL-like modelling language compiler designed to provide actionable feedback for iterative model repair. To obtain a valid PyOPL model that matches the problem description, SyntAGM mobilises compiler feedback and an LLM-based alignment judge. In addition, it combines in-context exposure to the target language grammar, and few-shot retrieval of modelling exemplars. Across multiple benchmarks, SyntAGM achieves a more favourable cost-quality trade-off compared to established prompting baselines.

2026-01-25T03:19:49Z 18 pages, 7 figures Roberto Rossi Steven D. Prestwich http://arxiv.org/abs/2606.00118v1 An Empirical Study on Logging Evolution On Stack Overflow: Trends, Topics, and Challenges 2026-05-27T19:42:01Z

Context: Logging is a crucial practice in software engineering, aiding developers in debugging applications when errors occur. While existing research has explored logging challenges from an academic perspective through literature reviews and source code analysis, a comprehensive study from the practitioners' perspective remains lacking. Objective: This paper aims to bridge this knowledge gap by presenting an in-depth analysis of trends, topics, and challenges in logging based on a dataset of 216,094 posts from Stack Overflow (SO), a popular Q\&A platform for developers. Method: We analyzed longitudinal trends by examining metadata related to users, questions, and tags associated with logging discussions. To identify prevalent discussion topics, we employed a Large Language Model (LLM)--based classification approach, based on a manually validated ground-truth sample. Topic popularity was assessed through average scores and views, while difficulty was measured using three community-driven metrics: the proportion of questions without accepted answers, the proportion of unanswered questions, and the median time to receive an accepted answer. Results: Our analysis identifies 11 distinct topics, with the top three (General Logging Practices, Error Handling and Debugging, and Logging Levels and Output) accounting for over 70\% of all logging-related discussions. Notably, Logging in Containerized Environments emerged as the most difficult topic: 64.9\% of its questions lack an accepted answer, and its median resolution time is among the highest. These findings highlight enduring practitioner struggles with logging in Docker or other containerized environments and the integration of logging pipelines into orchestrators such as Kubernetes and cloud environments. Conclusion: This study sheds light on the practical challenges of logging and provides actionable insights for developers, framework vendors, researchers, and educators.

2026-05-27T19:42:01Z Patrick Loic Foalem Andre Nguimbous Foutse Khomh Heng Li Ettore Merlo http://arxiv.org/abs/2605.28989v1 Generalized Software Product Line Extraction 2026-05-27T18:48:07Z

Software product line (SPL) engineering has been successfully applied to software development by obtaining software systems as compositions of modular features. Existing approaches to SPL engineering, however, are typically bound to a specific technological space (such as, a programming language and a composer) and integrated development environment (IDE), and rely on extraction mechanisms that make strong assumptions on the underlying technological space. This tight coupling hinders reuse, evolution, and adoption of heterogeneous development environments. We propose a general, workbench-agnostic protocol for extracting feature models from existing software artifacts and for configuring and deriving software products. The protocol follows a bottom-up approach based on lightweight dependency units called "atoms", and organizes the extraction and configuration process around an SPL server (workbench-independent) and an SPL client with a workbench-specific backend and a generic frontend. The protocol makes few assumptions on the underlying software artifacts and is therefore applicable to varied SPLs. The applicability of this approach is presented through a prototypical implementation of the architecture in which several subsystems interact and can be swapped freely without affecting the others. In particular, we focus on the application of such a protocol in the context of language product lines (LPLs), demonstrating its applicability to concrete scenarios while preserving workbench-agnosticism. From bottom to top, the implementation comprises: Neverlang language artifacts, a Java SPL client backend, an agnostic and reusable SPL server written in Go and Prolog, and a JavaScript SPL client frontend.

2026-05-27T18:48:07Z 17 pages, 3 figures, 4 listings Federico Bruzzone Walter Cazzola Luca Favalli http://arxiv.org/abs/2605.28694v1 E-Path: Equality Saturation for Control-Flow Graphs 2026-05-27T16:20:44Z

Modern equality saturation systems excel at expression-level rewrites by exploring large spaces of equivalent programs without suffering from the phase-ordering problem. How- ever, these systems struggle to represent equivalence directly over arbitrary control-flow graphs, often requiring normal- ization into structured or tree-like forms before optimization can occur. We present the E-Path data structure, a prototype frame- work for equality-saturation-style optimization over control- flow graphs. Instead of representing congruence between individual expressions, the E-Path records equivalence be- tween instruction sequences embedded within a compiler intermediate representation. In this prototype, E-Path is in- stantiated over a restricted ANF-based control-flow graph used in a compiler backend, but the model itself is intended to be IR-agnostic. By treating instruction sequences as the fundamental unit of congruence, the E-Path enables non-destructive optimiza- tion of loops and other control-flow structures while preserv- ing multiple equivalent program variants simultaneously. This allows classical CFG optimizations to be expressed as rewrite-driven transformations without destructive mutation of the underlying graph.

2026-05-27T16:20:44Z 4 pages Guillermo Garcia http://arxiv.org/abs/2605.28617v1 LACUNA: Safe Agents as Recursive Program Holes 2026-05-27T15:27:25Z

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $τ^2$-bench. On BrowseComp-Plus, $8.6\%$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\%$ accuracy. On $τ^2$-bench, LACUNA solves $76.0\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.

2026-05-27T15:27:25Z Yaoyu Zhao Yichen Xu Oliver Bračevac Cao Nguyen Pham Frank Zhengqing Wu Martin Odersky http://arxiv.org/abs/2605.27955v1 Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents 2026-05-27T04:48:40Z

Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused -> re-retrieve -> still confused" loop in which the agent issues a partially-correct action, receives uninformative environment feedback, and re-retrieves the same prose. We propose Skill-as-Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four-check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134-game ALFWorld unseen split with gpt-4o-mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph-of-Skills (GoS) baseline (pooled McNemar p = 8.2e-5), at -22.8 +/- 6.4% input tokens and -14.5 +/- 4.1% LLM calls per game.

2026-05-27T04:48:40Z Preprint. Code: https://github.com/InternLM/Skill-as-Pseudocode Xinze Li Yuhang Zang Yixin Cao Aixin Sun http://arxiv.org/abs/2605.27849v1 FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation 2026-05-27T02:06:11Z

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

2026-05-27T02:06:11Z Loc Pham Lang Hong Nguyet Anh Thanh Le-Cong http://arxiv.org/abs/2605.26527v2 A Formal Semantics of C with OpenMP Parallelism (Extended Version) 2026-05-27T02:00:04Z

OpenMP is a popular parallelization framework that lets users transform sequential code into parallel code with a few simple annotations. Unfortunately, it is also easy to inadvertently introduce errors by adding OpenMP pragmas into otherwise correct programs, including both logic errors and race conditions. We present a formal semantics for C code with OpenMP directives, building on the C semantics of the CompCert verified compiler and its extension to concurrency. Our semantics captures subtle interactions between OpenMP directives and variable state that have been obscured by previous OpenMP semantics, and provides a basis for detecting undesired behaviors introduced by incorrect annotations: in particular, any successful execution is guaranteed to be free of data races.

2026-05-26T04:19:20Z Ke Du Anshu Sharma Liyi Li William Mansky http://arxiv.org/abs/2605.27531v1 Agentic Separation Logic Specification Synthesis 2026-05-26T18:05:42Z

Specification synthesis, the task of automatically inferring formal specifications from program implementations and natural language, is important for refactoring, transpilation, optimization, and verification, yet remains an open challenge for large C++ repositories. Existing LLM-based approaches fail to simultaneously scale to such repositories, produce specifications expressive enough to capture systems-code features such as dynamic memory and heap-allocated data structures, and systematically validate those specifications to rule out incorrect candidates. We present Spec-Agent, an agentic system for synthesizing expressive, well-validated specifications across large C++ codebases. Spec-Agent targets a ladder of specification languages: propositional logic, first-order logic, propositional separation logic, and first-order separation logic. For each function, Spec-Agent uses static analysis and runtime heap tracing to select the appropriate target specification language, generalizes existing functional tests into fuzz harnesses, and iteratively refines LLM-generated candidates via counterexample-guided feedback. We evaluate Spec-Agent on open source C++ codebases comprising millions of lines of code. Spec-Agent synthesizes valid specifications for 85% of target functions, with no false positives observed under fuzzing and expert validation, outperforming Claude Code Opus 4.6 at 10x lower token cost.

2026-05-26T18:05:42Z 9 pages, 3 appendices Tarun Suresh David Korczynski Julien Vanegue