https://arxiv.org/api/M/PC3bIKaZZEID2o/wDPPRsudLI 2026-06-23T12:47:54Z 9934 990 15 http://arxiv.org/abs/2511.04115v1 How Natural Language Proficiency Shapes GenAI Code for Software Engineering Tasks 2025-11-06T07:06:20Z

With the widespread adoption of Foundation Model (FM)-powered tools in software engineering, the natural language prompt has become a critical interface between developers and Large Language Models (LLMs). While much research has focused on prompt structure, the natural language proficiency is an underexplored factor that can influence the quality of generated code. This paper investigates whether the English language proficiency itself independent of the prompting technique affects the proficiency and correctness of code generated by LLMs. Using the HumanEval dataset, we systematically varied the English proficiency of prompts from basic to advanced for 164 programming tasks and measured the resulting code proficiency and correctness. Our findings show that LLMs default to an intermediate (B2) natural language level. While the effect on the resulting code proficiency was model-dependent, we found that higher-proficiency prompts consistently yielded more correct code across all models. These results demonstrate that natural language proficiency is a key lever for controlling code generation, helping developers tailor AI output and improve the reliability of solutions.

2025-11-06T07:06:20Z 7 pages, 4 tables, 1 figure Ruksit Rojpaisarnkit Youmei Fan Kenichi Matsumoto Raula Gaikovina Kula 10.1109/MS.2025.3622690 http://arxiv.org/abs/2505.18574v5 Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators 2025-11-05T23:37:18Z

Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.

2025-05-24T07:35:34Z 10 pages + appendices Charles Hong Sahil Bhatia Alvin Cheung Yakun Sophia Shao http://arxiv.org/abs/2511.02595v1 Nominal Algebraic-Coalgebraic Data Types, with Applications to Infinitary Lambda-Calculi 2025-11-04T14:18:32Z

Ten years ago, it was shown that nominal techniques can be used to design coalgebraic data types with variable binding, so that alpha-equivalence classes of infinitary terms are directly endowed with a corecursion principle. We introduce "mixed" binding signatures, as well as the corresponding type of mixed inductive-coinductive terms. We extend the aforementioned work to this setting. In particular, this allows for a nominal description of the sets Lambda_abc of abc-infinitary lambda-terms (for a, b, c in {0,1}) and of capture-avoiding substitution on alpha-equivalence classes of such terms.

2025-11-04T14:18:32Z In Proceedings FICS 2024, arXiv:2511.00626 EPTCS 435, 2025, pp. 59-70 Rémy Cerda Aix-Marseille Université, CNRS, I2M 10.4204/EPTCS.435.5 http://arxiv.org/abs/2508.09856v2 Invertible Syntax without the Tuples (Functional Pearl) 2025-11-04T12:38:57Z

In the seminal paper Functional unparsing, Olivier Danvy used continuation passing to reanalyse printf-like format strings as combinators. In the intervening decades, the conversation shifted towards a concurrent line of work -- applicative, monadic or arrow-based combinator libraries -- in an effort to find combinators for invertible syntax descriptions that simultaneously determine a parser as well as a printer, and with more expressive power, able to handle inductive structures such as lists and trees. Along the way, continuation passing got lost. This paper argues that Danvy's insight remains as relevant to the general setting as it was to the restricted setting of his original paper. Like him, we present three solutions that exploit continuation-passing style as an alternative to both dependent types and monoidal aggregation via nested pairs, in our case to parse and print structured data with increasing expressive power.

2025-08-13T14:49:16Z Proceedings of the Workshop Dedicated to Olivier Danvy on the Occasion of His 64th Birthday (OLIVIERFEST '25), October 12--18, 2025, Singapore, Singapore Mathieu Boespflug Arnaud Spiwack 10.1145/3759427.3760381 http://arxiv.org/abs/2511.02491v1 Oriented Metrics for Bottom-Up Enumerative Synthesis 2025-11-04T11:27:03Z

In syntax-guided synthesis, one of the challenges is to reduce the enormous size of the search space. We observe that most search spaces are not just flat sets of programs, but can be endowed with a structure that we call an oriented metric. Oriented metrics measure the distance between programs, like ordinary metrics do, but are designed for settings in which operations have an orientation. Our focus is on the string and the bitvector domains, where operations like concatenation and bitwise conjunction transform an input into an output in a way that is not symmetric. We develop several new oriented metrics for these domains. Oriented metrics are designed for search space reduction, and we present four techniques: (i) pruning the search space to a ball around the ground truth, (ii) factorizing the search space by an equivalence that is induced by the oriented metric, (iii) abstracting the oriented metric (and hence the equivalence) and refining it, and (iv) improving the enumeration order by learning from abstract information. We acknowledge that these techniques are inspired by developments in the literature. By understanding their roots in oriented metrics, we can substantially increase their applicability and efficiency. We have integrated these techniques into a new synthesis algorithm and implemented the algorithm in a new solver. Notably, our solver is generic in the oriented metric over which it computes. We conducted experiments in the string and the bitvector domains, and consistently improve the performance over the state-of-the-art by more than an order of magnitude.

2025-11-04T11:27:03Z 37 pages, 16 figures Roland Meyer Jakob Tepe 10.1145/3776717 http://arxiv.org/abs/2511.02285v1 VFocus: Better Verilog Generation from Large Language Model via Focused Reasoning 2025-11-04T05:54:31Z

Large Language Models (LLMs) have shown impressive potential in generating Verilog codes, but ensuring functional correctness remains a challenge. Existing approaches often rely on self-consistency or simulation feedback to select the best candidate, but they miss opportunities to focus LLM reasoning on the most informative parts of the design. We propose VFocus, a three-stage framework that enhances Verilog generation by sharpening the focus of LLM reasoning onto critical decision points in the code generation process. In the \textbf{pre-ranking stage}, VFocus generates multiple code candidates through LLM prompting, retries for syntactically valid outputs, and introduces a \textit{Density-guided Filtering} to retain candidates that fall within the "reasoning sweet spot" for functional correctness. In the \textbf{ranking stage}, we simulate each code candidate using an automatically generated testbench and apply self-consistency-based clustering to identify the most consistent outputs. Finally, in the \textbf{post-ranking refinement stage}, VFocus performs inconsistency mining on top-ranked candidates and invokes reasoning-augmented LLM prompts for candidate refinement. Experiments on the VerilogEval-Human benchmark show that VFocus significantly improves the pass@1 correctness across multiple reasoning LLMs, demonstrating its effectiveness in enhancing Verilog generation for complex hardware design tasks.

2025-11-04T05:54:31Z accepted by SOCC 2025 Zhuorui Zhao Bing Li Grace Li Zhang Ulf Schlichtmann http://arxiv.org/abs/2511.02164v1 ScenicProver: A Framework for Compositional Probabilistic Verification of Learning-Enabled Systems 2025-11-04T01:09:08Z

Full verification of learning-enabled cyber-physical systems (CPS) has long been intractable due to challenges including black-box components and complex real-world environments. Existing tools either provide formal guarantees for limited types of systems or test the system as a monolith, but no general framework exists for compositional analysis of learning-enabled CPS using varied verification techniques over complex real-world environments. This paper introduces ScenicProver, a verification framework that aims to fill this gap. Built upon the Scenic probabilistic programming language, the framework supports: (1) compositional system description with clear component interfaces, ranging from interpretable code to black boxes; (2) assume-guarantee contracts over those components using an extension of Linear Temporal Logic containing arbitrary Scenic expressions; (3) evidence generation through testing, formal proofs via Lean 4 integration, and importing external assumptions; (4) systematic combination of generated evidence using contract operators; and (5) automatic generation of assurance cases tracking the provenance of system-level guarantees. We demonstrate the framework's effectiveness through a case study on an autonomous vehicle's automatic emergency braking system with sensor fusion. By leveraging manufacturer guarantees for radar and laser sensors and focusing testing efforts on uncertain conditions, our approach enables stronger probabilistic guarantees than monolithic testing with the same computational budget.

2025-11-04T01:09:08Z 26 pages, 4 figures. Full version (including appendices) of a paper submitted to TACAS 2026 Eric Vin Kyle A. Miller Inigo Incer Sanjit A. Seshia Daniel J. Fremont http://arxiv.org/abs/2511.02869v1 Analysis of AdvFusion: Adapter-based Multilingual Learning for Code Large Language Models 2025-11-03T23:45:27Z

Programming languages can benefit from one another by utilizing a language model for software engineering tasks. Full fine-tuning and Parameter Efficient Fine-Tuning (PEFT) of Code Language Models (Code-LMs) has been explored for multilingual knowledge transfer. AdapterFusion is a PEFT architecture that aims to enhance task performance by leveraging information from multiple programming languages, but primarily focuses on the target programming language. In our previous work, we proposed AdvFusion, a novel PEFT-based approach that effectively learns from other programming languages before adapting to the target task. Though previous experiments showed that AdvFusion outperformed AdapterFusion and LoRA, it was applied on pre-trained Code-LMs and was limited to only two tasks, code summarization and method name prediction. In this study, we expanded our work and investigated AdvFusion on Code Large Language Models (Code-LLMs), considering three new tasks: code generation, code translation, and commit message generation. We observed that different Code-LLMs/tasks exhibit different characteristics. In code generation, AdvFusion outperformed AdapterFusion but not other PEFT methods (LoRA, Compacter, and TaskAdapter). In commit message generation, AdapterFusion performed better than AdvFusion, and contrary to code generation, we found that the other PEFT methods do not have better performance. In code translation, AdvFusion performed worse than AdapterFusion overall, with the performance gap marginally widening as the model size increases. However, consistent with code generation, other PEFT methods showed better performance.

2025-11-03T23:45:27Z Amirreza Esmaeili Fahd Seddik Yongyi Ji Fatemeh Fard Fuxiang Chen http://arxiv.org/abs/2504.04365v5 AutoPDL: Automatic Prompt Optimization for LLM Agents 2025-11-03T21:46:50Z

The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains ($9.21\pm15.46$ percentage points), up to 67.5pp, and reveal that selected prompting strategies vary across models and tasks.

2025-04-06T05:30:10Z An earlier version of this paper was published in AutoML 2025 Methods Track. This version adds missing standard deviations in Table 1 Claudio Spiess Mandana Vaziri Louis Mandel Martin Hirzel http://arxiv.org/abs/2511.01753v1 SM-based Semantics for Answer Set Programs Containing Conditional Literals and Arithmetic 2025-11-03T17:03:29Z

Modern answer set programming solvers such as CLINGO support advanced language constructs that improve the expressivity and conciseness of logic programs. Conditional literals are one such construct. They form "subformulas" that behave as nested implications within the bodies of logic rules. Their inclusion brings the form of rules closer to the less restrictive syntax of first-order logic. These qualities make conditional literals useful tools for knowledge representation. In this paper, we propose a semantics for logic programs with conditional literals and arithmetic based on the SM operator. These semantics do not require grounding, unlike the established semantics for such programs that relies on a translation to infinitary propositional logic. The main result of this paper establishes the precise correspondence between the proposed and existing semantics.

2025-11-03T17:03:29Z This version corrects the review of tau for negated atoms, and clarifies the distinction between global and local variables in conditional literals (the supporting proofs are also updated accordingly) In Practical Aspects of Declarative Languages: 27th International Symposium, PADL 2025, Denver, CO, USA, January 20-21, 2025, Proceedings. Springer-Verlag, Berlin, Heidelberg, 71-87 Zachary Hansen Yuliya Lierler 10.1007/978-3-031-84924-4_5 http://arxiv.org/abs/2511.01529v1 Hidden in Plain Sight: Where Developers Confess Self-Admitted Technical Debt 2025-11-03T12:47:19Z

Context. Detecting Self-Admitted Technical Debt (SATD) is crucial for proactive software maintenance. Previous research has primarily targeted detecting and prioritizing SATD, with little focus on the source code afflicted with SATD. Our goal in this work is to connect the SATD comments with source code constructs that surround them. Method. We leverage the extensive SATD dataset PENTACET, containing code comments from over 9000 Java Open Source Software (OSS) repositories. We quantitatively infer where SATD most commonly occurs and which code constructs/statements it most frequently affects. Results and Conclusions. Our large-scale study links over 225,000 SATD comments to their surrounding code, showing that SATD mainly arises in inline code near definitions, conditionals, and exception handling, where developers face uncertainty and trade-offs, revealing it as an intentional signal of awareness during change rather than mere neglect.

2025-11-03T12:47:19Z Murali Sridharan Mikel Robredo Leevi Rantala Matteo Esposito Valentina Lenarduzzi Mika Mantyla http://arxiv.org/abs/2511.01183v1 QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code 2025-11-03T03:20:26Z

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86_64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86_64 programs using our method, 14 (87.5%) surpassed clang-O3 performance.

2025-11-03T03:20:26Z Accepted at NeurIPS 2025 Hainan Fang Yuanbo Wen Jun Bi Yihan Wang Tonghui He Yanlin Tang Di Huang Jiaming Guo Rui Zhang Qi Guo Yunji Chen http://arxiv.org/abs/2511.00740v1 Typed Embedding of miniKanren for Functional Conversion 2025-11-01T23:50:05Z

Relational programming enables program synthesis through a verifier-to-solver approach. An earlier paper introduced a functional conversion that mitigated some of the inherent performance overhead. However, the conversion was inelegant: it was oblivious to types, demanded determinism annotations, and implicit generator threading. In this paper, we address these issues by providing a typed tagless-final embedding of miniKanren into Haskell. This improvement significantly reduces boilerplate while preserving, and sometimes enhancing, earlier speedups.

2025-11-01T23:50:05Z Igor Engel Ekaterina Verbitskaia http://arxiv.org/abs/2511.00626v1 Proceedings Twelfth Workshop on Fixed Points in Computer Science 2025-11-01T17:08:22Z

This EPTCS volume contains the post-proceedings of the Twelfth International Workshop on Fixed Points in Computer Science, presenting a selection of the works presented during the workshop that took place in Naples (Italy) on the 19th and 20th of February 2024 as a satellite of the International Conference on Computer Science Logic (CSL 2024).

2025-11-01T17:08:22Z EPTCS 435, 2025 Alexis Saurin 10.4204/EPTCS.435 http://arxiv.org/abs/2511.00488v1 \texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs 2025-11-01T10:42:40Z

Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero-shot generalization on complex benchmarks. In light of these challenges, we propose \texttt{ReMind}, a multi-agent framework composed of \texttt{Mutator}, \texttt{Executor}, and \texttt{Inspector}. The \texttt{Mutator} generates code variants to mitigate bias towards code sources, the \texttt{Executor} traces variable states step-by-step to expose inconsistency, and the \texttt{Inspector} identifies problematic reasoning steps and provides control-flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, \texttt{ReMind} systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero-shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of \texttt{ReMind} compared to baseline approaches in deductive code reasoning.

2025-11-01T10:42:40Z Jun Gao Yun Peng Xiaoxue Ren