https://arxiv.org/api/bsn79oUMbN+DxRIbPhU1f3ygBtk2026-06-13T15:35:44Z98857515http://arxiv.org/abs/2510.05115v3SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling2026-05-29T07:07:01ZLarge language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solver-driven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.7%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.2025-09-28T12:25:31ZICML 2026 acceptedYansen ZhangQingcan KangYujie ChenYufei WangXiongwei HanTao ZhongMingxuan YuanChen Mahttp://arxiv.org/abs/2502.04671v3ProofWala: A Framework for Multilingual Proof Data Synthesis and Theorem-Proving2026-05-29T04:28:35ZNeural approaches to theorem proving require robust infrastructure for interfacing with interactive theorem provers (ITPs), extracting structured proof data, and executing proof search at scale. However, existing tooling is often assistant-specific and oriented toward file-level execution, making repository-scale analysis and parallel experimentation challenging. We present ProofWala, a multilingual proof engineering framework built around \texttt{itp-interface}, a reusable library for programmatic interaction with ITPs. For Lean 4, we implement a meta-programmed interaction layer executing inside the elaborator, enabling semantically faithful tactic-level tracing alongside declaration- and dependency-level extraction across entire repositories. This design extends beyond traditional REPL-style interaction by supporting project-wide analysis, environment cloning, and pooled execution of proof states. The same interface abstraction supports multiple versions of Rocq, yielding a unified cross-assistant pipeline.
Built on this infrastructure, ProofWala provides standardized multilingual proof datasets, model training utilities, and parallel proof search algorithms. Using the framework, we demonstrate that multilingual training across Lean and Rocq enables cross-lingual and cross-domain transfer. We observe statistically significant improvements on Lean Mathlib and in domain adaptation (CategoryTheory), while other settings exhibit consistent upward trends. We open-source the full framework, parallel proof search module, datasets, and models across two repositories: ProofWala (https://github.com/trishullab/proof-wala) and the itp-interface library (https://github.com/trishullab/itp-interface).2025-02-07T05:35:46ZAmitayush ThakurGeorge TsoukalasGreg DurrettSwarat Chaudhurihttp://arxiv.org/abs/2606.00131v1AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve2026-05-28T20:43:55ZPost-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant performance gains from heavily optimized binaries. However, these systems are currently restricted to intraprocedural techniques, leaving the global potential of interprocedural layout largely untapped. Interprocedural code layout is historically difficult due to a combinatorially intractable search space and complex call-return semantics that are challenging to model. Consequently, the performance potential of fine-grained interprocedural layout remains unproven in practice. AI-PROPELLER uses Magellan, an agentic workflow that evolves the compiler heuristic in Propeller into a fine-grained interprocedural optimizer and fine-tunes the resulting policy hyperparameters. To ensure high-fidelity, we move away from approximate static cost models and the agentic workflow generates multiple layout variants that are executed on actual hardware to measure real performance counters, providing a precise reward signal for the evolutionary loop. AI-PROPELLER has been evaluated on several benchmarks including large warehouse-scale applications and experiments show performance improvements of 0.23% to 1.6% optimized with state-of-the-art FDO and PLO which is significant for real-world binaries. This is the first time ever that large warehouse-scale applications in industrial settings have been optimized with fine-grained interprocedural code layout.2026-05-28T20:43:55ZChaitanya Mamatha AnandaRajiv GuptaMircea TrofinAiden GrossmanSriraman TallamXinliang David LiAmir Yazdanbakhshhttp://arxiv.org/abs/2605.30203v1A Bayesian Approach to Membership Inference for Statistical Release2026-05-28T16:42:04ZThe membership inference problem for publicly released statistics from a private dataset is well-studied. When developing and formally analyzing attack strategies, however, the focus has been on attacks that model the population using only its marginals. In practice, these attacks can perform well on various populations, however most formal analysis is for populations that follow a product distribution. These strategies may fail to leverage useful information about the population that is important for understanding a realistic privacy threat.
In this work, we explore the impact of providing an attacker with additional information about the attribute dependency structure of the population, motivated by examples where multiple parties may have access to similarly structured data, for example the US Census and the IRS. To model this scenario, we re-frame the membership inference problem with respect to a population represented as a Bayesian network (BN). We develop a framework based on Bayesian decision-making which can incorporate prior information about the population to launch more effective, specialized attacks.
To evaluate our framework, we introduce a specific attack instantiation which computes the Bayesian posterior using a probabilistic program, and prove its equivalence to an optimal variant of the likelihood ratio test attack for two populations with strong attribute dependency. We implement our program in the Roulette probabilistic programming language and show experimentally that it outperforms the likelihood ratio test and inner product attacks on five commonly used BNs, where the population dependency structure is too complex for the existing attacks to be manually adapted.2026-05-28T16:42:04ZLisa OakleySam StitesCameron MoySteven HoltzenAlina OpreaMarco Gaboardihttp://arxiv.org/abs/2404.16077v4CompilerDream: Learning a Compiler World Model for General Code Optimization2026-05-28T09:38:13ZEffective code optimization in compilers is crucial for computer and software engineering. The success of these optimizations primarily depends on the selection and ordering of the optimization passes applied to the code. While most compilers rely on a fixed sequence of optimization passes, current methods to find the optimal sequence either employ impractically slow search algorithms or learning methods that struggle to generalize to code unseen during training. We introduce CompilerDream, a model-based reinforcement learning approach to general code optimization. CompilerDream comprises a compiler world model that accurately simulates the intrinsic properties of optimization passes and an agent trained on this model to produce effective optimization strategies. By training on a large-scale program dataset, CompilerDream is equipped to serve as a general code optimizer across various application scenarios and source-code languages. Our extensive experiments first highlight CompilerDream's strong optimization capabilities for autotuning, where it leads the CompilerGym leaderboard. More importantly, the zero-shot generalization ability of large-scale trained compiler world model and agent, excels across diverse datasets, surpassing LLVM's built-in optimizations and other state-of-the-art methods in both settings of value prediction and end-to-end code optimization.2024-04-24T09:20:33ZKDD 2025 camera-ready version with extended appendix. Code is available at https://github.com/thuml/CompilerDream. This update additionally fixes an issue in Table 6 where the dataset names in three rows were ordered incorrectlyChaoyi DengJialong WuNingya FengJianmin WangMingsheng Longhttp://arxiv.org/abs/2605.29357v1PassNet: Scaling Large Language Models for Graph Compiler Pass Generation2026-05-28T04:55:14ZModern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.2026-05-28T04:55:14ZCode and data available at https://github.com/PaddlePaddle/PassNetYiqun LiuYingsheng WuRuqi YangEnrong ZhengHonglei QiuSijun HeTai LiangJingjing WuYuhan ZhouYiwei ZhangDongyan ChenWeihan YiXinqi LiSiqi Baohttp://arxiv.org/abs/2601.17670v2Grammar-Aware Literate Generative Mathematical Programming with Compiler-in-the-Loop2026-05-28T00:44:24ZMathematical programming is widely employed across various sectors - such as logistics, energy, and workforce planning - to model and solve industrial optimisation problems, but its use requires substantial domain expertise. Large language models offer a promising way to translate natural-language problem descriptions into optimisation models, yet existing approaches are costly and generally produce models written in general-purpose computer code (e.g. Python), which can be difficult to inspect, validate, and reuse. In this work, we introduce SyntAGM, a system that generates optimisation models in a readable algebraic modelling language through an iterative generate-compile-assess-revise loop. SyntAGM leverages PyOPL, an OPL-like modelling language compiler designed to provide actionable feedback for iterative model repair. To obtain a valid PyOPL model that matches the problem description, SyntAGM mobilises compiler feedback and an LLM-based alignment judge. In addition, it combines in-context exposure to the target language grammar, and few-shot retrieval of modelling exemplars. Across multiple benchmarks, SyntAGM achieves a more favourable cost-quality trade-off compared to established prompting baselines.2026-01-25T03:19:49Z18 pages, 7 figuresRoberto RossiSteven D. Prestwichhttp://arxiv.org/abs/2606.00118v1An Empirical Study on Logging Evolution On Stack Overflow: Trends, Topics, and Challenges2026-05-27T19:42:01ZContext: Logging is a crucial practice in software engineering, aiding developers in debugging applications when errors occur. While existing research has explored logging challenges from an academic perspective through literature reviews and source code analysis, a comprehensive study from the practitioners' perspective remains lacking.
Objective: This paper aims to bridge this knowledge gap by presenting an in-depth analysis of trends, topics, and challenges in logging based on a dataset of 216,094 posts from Stack Overflow (SO), a popular Q\&A platform for developers.
Method: We analyzed longitudinal trends by examining metadata related to users, questions, and tags associated with logging discussions. To identify prevalent discussion topics, we employed a Large Language Model (LLM)--based classification approach, based on a manually validated ground-truth sample. Topic popularity was assessed through average scores and views, while difficulty was measured using three community-driven metrics: the proportion of questions without accepted answers, the proportion of unanswered questions, and the median time to receive an accepted answer.
Results: Our analysis identifies 11 distinct topics, with the top three (General Logging Practices, Error Handling and Debugging, and Logging Levels and Output) accounting for over 70\% of all logging-related discussions. Notably, Logging in Containerized Environments emerged as the most difficult topic: 64.9\% of its questions lack an accepted answer, and its median resolution time is among the highest. These findings highlight enduring practitioner struggles with logging in Docker or other containerized environments and the integration of logging pipelines into orchestrators such as Kubernetes and cloud environments.
Conclusion: This study sheds light on the practical challenges of logging and provides actionable insights for developers, framework vendors, researchers, and educators.2026-05-27T19:42:01ZPatrick Loic FoalemAndre NguimbousFoutse KhomhHeng LiEttore Merlohttp://arxiv.org/abs/2605.28989v1Generalized Software Product Line Extraction2026-05-27T18:48:07ZSoftware product line (SPL) engineering has been successfully applied to software development by obtaining software systems as compositions of modular features. Existing approaches to SPL engineering, however, are typically bound to a specific technological space (such as, a programming language and a composer) and integrated development environment (IDE), and rely on extraction mechanisms that make strong assumptions on the underlying technological space. This tight coupling hinders reuse, evolution, and adoption of heterogeneous development environments. We propose a general, workbench-agnostic protocol for extracting feature models from existing software artifacts and for configuring and deriving software products. The protocol follows a bottom-up approach based on lightweight dependency units called "atoms", and organizes the extraction and configuration process around an SPL server (workbench-independent) and an SPL client with a workbench-specific backend and a generic frontend. The protocol makes few assumptions on the underlying software artifacts and is therefore applicable to varied SPLs. The applicability of this approach is presented through a prototypical implementation of the architecture in which several subsystems interact and can be swapped freely without affecting the others. In particular, we focus on the application of such a protocol in the context of language product lines (LPLs), demonstrating its applicability to concrete scenarios while preserving workbench-agnosticism. From bottom to top, the implementation comprises: Neverlang language artifacts, a Java SPL client backend, an agnostic and reusable SPL server written in Go and Prolog, and a JavaScript SPL client frontend.2026-05-27T18:48:07Z17 pages, 3 figures, 4 listingsFederico BruzzoneWalter CazzolaLuca Favallihttp://arxiv.org/abs/2605.28694v1E-Path: Equality Saturation for Control-Flow Graphs2026-05-27T16:20:44ZModern equality saturation systems excel at expression-level rewrites by exploring large spaces of equivalent programs without suffering from the phase-ordering problem. How- ever, these systems struggle to represent equivalence directly over arbitrary control-flow graphs, often requiring normal- ization into structured or tree-like forms before optimization can occur. We present the E-Path data structure, a prototype frame- work for equality-saturation-style optimization over control- flow graphs. Instead of representing congruence between individual expressions, the E-Path records equivalence be- tween instruction sequences embedded within a compiler intermediate representation. In this prototype, E-Path is in- stantiated over a restricted ANF-based control-flow graph used in a compiler backend, but the model itself is intended to be IR-agnostic. By treating instruction sequences as the fundamental unit of congruence, the E-Path enables non-destructive optimiza- tion of loops and other control-flow structures while preserv- ing multiple equivalent program variants simultaneously. This allows classical CFG optimizations to be expressed as rewrite-driven transformations without destructive mutation of the underlying graph.2026-05-27T16:20:44Z4 pagesGuillermo Garciahttp://arxiv.org/abs/2605.28617v1LACUNA: Safe Agents as Recursive Program Holes2026-05-27T15:27:25ZLLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $τ^2$-bench. On BrowseComp-Plus, $8.6\%$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\%$ accuracy. On $τ^2$-bench, LACUNA solves $76.0\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.2026-05-27T15:27:25ZYaoyu ZhaoYichen XuOliver BračevacCao Nguyen PhamFrank Zhengqing WuMartin Oderskyhttp://arxiv.org/abs/2605.27955v1Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents2026-05-27T04:48:40ZMarkdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused -> re-retrieve -> still confused" loop in which the agent issues a partially-correct action, receives uninformative environment feedback, and re-retrieves the same prose. We propose Skill-as-Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four-check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134-game ALFWorld unseen split with gpt-4o-mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph-of-Skills (GoS) baseline (pooled McNemar p = 8.2e-5), at -22.8 +/- 6.4% input tokens and -14.5 +/- 4.1% LLM calls per game.2026-05-27T04:48:40ZPreprint. Code: https://github.com/InternLM/Skill-as-PseudocodeXinze LiYuhang ZangYixin CaoAixin Sunhttp://arxiv.org/abs/2605.27849v1FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation2026-05-27T02:06:11ZDespite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.2026-05-27T02:06:11ZLoc PhamLang Hong Nguyet AnhThanh Le-Conghttp://arxiv.org/abs/2605.26527v2A Formal Semantics of C with OpenMP Parallelism (Extended Version)2026-05-27T02:00:04ZOpenMP is a popular parallelization framework that lets users transform sequential code into parallel code with a few simple annotations. Unfortunately, it is also easy to inadvertently introduce errors by adding OpenMP pragmas into otherwise correct programs, including both logic errors and race conditions. We present a formal semantics for C code with OpenMP directives, building on the C semantics of the CompCert verified compiler and its extension to concurrency. Our semantics captures subtle interactions between OpenMP directives and variable state that have been obscured by previous OpenMP semantics, and provides a basis for detecting undesired behaviors introduced by incorrect annotations: in particular, any successful execution is guaranteed to be free of data races.2026-05-26T04:19:20ZKe DuAnshu SharmaLiyi LiWilliam Manskyhttp://arxiv.org/abs/2605.27531v1Agentic Separation Logic Specification Synthesis2026-05-26T18:05:42ZSpecification synthesis, the task of automatically inferring formal specifications from program implementations and natural language, is important for refactoring, transpilation, optimization, and verification, yet remains an open challenge for large C++ repositories. Existing LLM-based approaches fail to simultaneously scale to such repositories, produce specifications expressive enough to capture systems-code features such as dynamic memory and heap-allocated data structures, and systematically validate those specifications to rule out incorrect candidates. We present Spec-Agent, an agentic system for synthesizing expressive, well-validated specifications across large C++ codebases. Spec-Agent targets a ladder of specification languages: propositional logic, first-order logic, propositional separation logic, and first-order separation logic. For each function, Spec-Agent uses static analysis and runtime heap tracing to select the appropriate target specification language, generalizes existing functional tests into fuzz harnesses, and iteratively refines LLM-generated candidates via counterexample-guided feedback. We evaluate Spec-Agent on open source C++ codebases comprising millions of lines of code. Spec-Agent synthesizes valid specifications for 85% of target functions, with no false positives observed under fuzzing and expert validation, outperforming Claude Code Opus 4.6 at 10x lower token cost.2026-05-26T18:05:42Z9 pages, 3 appendicesTarun SureshDavid KorczynskiJulien Vanegue