https://arxiv.org/api/k8t8TiHFyheAAHC61uUyNX/oXHs2026-03-22T08:50:27Z253021515http://arxiv.org/abs/2603.15159v3To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation2026-03-19T05:08:18ZLarge Language Models (LLMs) have shown strong potential for code generation, yet they remain limited in private-library-oriented code generation, where the goal is to generate code using APIs from private libraries. Existing approaches mainly rely on retrieving private-library API documentation and injecting relevant knowledge into the context at inference time. However, our study shows that this is insufficient: even given accurate required knowledge, LLMs still struggle to invoke private-library APIs effectively.
To address this limitation, we propose PriCoder, an approach that teaches LLMs to invoke private-library APIs through automatically synthesized data. Specifically, PriCoder models private-library data synthesis as the construction of a graph, and alternates between two graph operators: (1) Progressive Graph Evolution, which improves data diversity by progressively synthesizing more diverse training samples from basic ones, and (2) Multidimensional Graph Pruning, which improves data quality through a rigorous filtering pipeline. To support rigorous evaluation, we construct two new benchmarks based on recently released libraries that are unfamiliar to the tested models. Experiments on three mainstream LLMs show that PriCoder substantially improves private-library-oriented code generation, yielding gains of over 20% in pass@1 in many settings, while causing negligible impact on general code generation capability. Our code and benchmarks are publicly available at https://github.com/eniacode/PriCoder.2026-03-16T11:53:39Z12 pagesYitong ZhangChengze LiRuize ChenGuowei YangXiaoran JiaYijie RenJia Lihttp://arxiv.org/abs/2603.17821v2CodeT5-RNN: Reinforcing Contextual Embeddings for Enhanced Code Comprehension2026-03-19T04:48:28ZContextual embeddings generated by LLMs exhibit strong positional inductive biases, which can limit their ability to fully capture long-range, order-sensitive dependencies in highly structured source code. Consequently, how to further refine and enhance LLM embeddings for improved code understanding remains an open research question. To address this gap, we propose a hybrid LLM-RNN framework that reinforces LLM-generated contextual embeddings with a sequential RNN architecture. The embeddings reprocessing step aims to reinforce sequential semantics and strengthen order-aware dependencies inherent in source code. We evaluate the proposed hybrid models on both benchmark and real-world coding datasets. The experimental results show that the RoBERTa-BiGRU and CodeBERT-GRU models achieved accuracies of 66.40% and 66.03%, respectively, on the defect detection benchmark dataset, representing improvements of approximately 5.35% and 3.95% over the standalone RoBERTa and CodeBERT models. Furthermore, the CodeT5-GRU and CodeT5+-BiGRU models achieved accuracies of 67.90% and 67.79%, respectively, surpassing their base models and outperforming RoBERTa-BiGRU and CodeBERT-GRU by a notable margin. In addition, CodeT5-GRU model attains weighted and macro F1-scores of 67.18% and 67.00%, respectively, on the same dataset. Extensive experiments across three real-world datasets further demonstrate consistent and statistically significant improvements over standalone LLMs. Overall, our findings indicate that reprocessing contextual embeddings with RNN architectures enhances code understanding performance in LLM-based models.2026-03-18T15:12:33ZMd Mostafizer RahmanAriful Islam ShipluYutaka WatanobeMd Faizul Ibne AminSyed Rameez NaqviFang Liuhttp://arxiv.org/abs/2508.11468v2TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation2026-03-19T03:49:51ZWhile Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \textsc{trace} as a principled foundation for efficiency-oriented evaluation.2025-08-15T13:33:52ZZhihao GongZeyu SunDong HuangQingyuan LiangJie M. ZhangDan Haohttp://arxiv.org/abs/2603.18449v1CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer2026-03-19T03:21:56ZThe widespread deployment of large language models (LLMs) calls for post-hoc methods that can flexibly adapt models to evolving safety requirements. Meanwhile, the rapidly expanding open-source LLM ecosystem has produced a diverse collection of models that already exhibit various safety-related functionalities. This motivates a shift from constructing safety functionality from scratch to reusing existing functionality from external models, thereby avoiding costly data collection and training procedures.
In this paper, we present Cross-Model Neuron Transfer (CNT), a post-hoc method that reuses safety-oriented functionality by transferring a minimal subset of neurons from an open-source donor LLM to a target LLM. By operating at the neuron level, CNT enables modular function-level adaptation, supporting both function addition andfunction deletion. We evaluate CNT on seven popular LLMs across three representative applications: safety disalignment, alignment enhancement, and bias removal. Experimental results show that CNT achieves targeted safety-oriented functionality transfer with minimal performance degradation (less than 1% for most models), consistently outperforming five baselines, demonstrating its generality and practical effectiveness.2026-03-19T03:21:56ZYue ZhaoYujia GongRuigang LiangShenchen ZhuKai ChenXuejing YuanWangjun Zhanghttp://arxiv.org/abs/2603.16479v2TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation2026-03-19T03:05:38ZWhile Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \textsc{trace} as a principled foundation for efficiency-oriented evaluation.2026-03-17T13:05:54ZSubmitted in error as a new submission instead of a replacement for arXiv:2508.11468Zhihao GongZeyu SunDong HuangQingyuan LiangJie M. ZhangDan Haohttp://arxiv.org/abs/2603.17060v2LLM Use, Cheating, and Academic Integrity in Software Engineering Education2026-03-19T02:03:40ZBackground: Cheating in university education is commonly described as context dependent and influenced by assessment design, institutional norms, and student interpretation. In software engineering education, programming oriented coursework has historically involved ambiguity around collaboration, reuse, and external assistance. Recently, large language models (LLMs) have introduced additional mediation in the production of code and related artifacts. Aims: This study investigates how software engineering students describe experiences of using LLMs in ways they perceived as inappropriate, disallowed, or misaligned with course expectations. Method: A cross sectional survey was conducted with 116 undergraduate software engineering students from multiple countries, combining quantitative summaries with qualitative data. Results: Reported LLM cheating practices occurred primarily in programming assignments, routine coursework, and documentation tasks, often in contexts of time pressure and unclear guidance. Use during quizzes and exams was less frequent and more consistently identified as a violation. Students reported awareness of academic and professional consequences regarding LLM cheating, while formal sanctions were perceived as limited. Conclusions: Our study indicates that reported LLM misuse in software engineering is associated with assessment and instructional conditions, suggesting a need for clearer alignment between assessment design, learning objectives, and expectations for LLM use.2026-03-17T18:48:33ZRonnie de Souza SantosItalo SantosMariana BentoGiuseppe DestefanisCleyton MagalhãesMairieli Wesselhttp://arxiv.org/abs/2603.18393v1Where are the Hidden Gems? Applying Transformer Models for Design Discussion Detection2026-03-19T01:27:51ZDesign decisions are at the core of software engineering and appear in Q\&A forums, mailing lists, pull requests, issue trackers, and commit messages. Design discussions spanning a project's history provide valuable information for informed decision-making, such as refactoring and software modernization. Machine learning techniques have been used to detect design decisions in natural language discussions; however, their effectiveness is limited by the scarcity of labeled data and the high cost of annotation. Prior work adopted cross-domain strategies with traditional classifiers, training on one domain and testing on another. Despite their success, transformer-based models, which often outperform traditional methods, remain largely unexplored in this setting. The goal of this work is to investigate the performance of transformer-based models (i.e., BERT, RoBERTa, XLNet, LaMini-Flan-T5-77M, and ChatGPT-4o-mini) for detecting design-related discussions. To this end, we conduct a conceptual replication of prior cross-domain studies while extending them with modern transformer architectures and addressing methodological issues in earlier work. The models were fine-tuned on Stack Overflow and evaluated on GitHub artifacts (i.e., pull requests, issues, and commits). BERT and RoBERTa show strong recall across domains, while XLNet achieves higher precision but lower recall. ChatGPT-4o-mini yields the highest recall and competitive overall performance, whereas LaMini-Flan-T5-77M provides a lightweight alternative with stronger precision but less balanced performance. We also evaluated similar-word injection for data augmentation, but unlike prior findings, it did not yield meaningful improvements. Overall, these results highlight both the opportunities and trade-offs of using modern language models for detecting design discussion.2026-03-19T01:27:51ZLawrence ArkohDaniel FeitosaWesley K. G. Assunçãohttp://arxiv.org/abs/2603.18372v1TENSURE: Fuzzing Sparse Tensor Compilers (Registered Report)2026-03-19T00:13:14ZSparse Tensor Compilers (STCs) have emerged as critical infrastructure for optimizing high-dimensional data analytics and machine learning workloads. The STCs must synthesize complex, irregular control flow for various compressed storage formats directly from high-level declarative specifications, thereby making them highly susceptible to subtle correctness defects. Existing testing frameworks, which rely on mutating computation graphs restricted to a standard vocabulary of operators, fail to exercise the arbitrary loop synthesis capabilities of these compilers. Furthermore, generic grammar-based fuzzers struggle to generate valid inputs due to the strict rules governing how indices are reused across multiple tensors.
In this paper, we present TENSURE, the first extensible black-box fuzzing framework specifically designed for the testing of STCs. TENSURE leverages Einstein Summation (Einsum) notation as a general input abstraction, enabling the generation of complex, unconventional tensor contractions that expose corner cases in the code-generation phases of STCs. We propose a novel constraint-based generation algorithm that guarantees 100% semantic validity of synthesized kernels, significantly outperforming the ~3.3% validity rate of baseline grammar fuzzers. To enable metamorphic testing without a trusted reference, we introduce a set of semantic-preserving mutation operators that exploit algebraic commutativity and heterogeneity in storage formats. Our evaluation on two state-of-the-art systems, TACO and Finch, reveals widespread fragility, particularly in TACO, where TENSURE exposed crashes or silent miscompilations in a majority of generated test cases. These findings underscore the critical need for specialized testing tools in the sparse compilation ecosystem.2026-03-19T00:13:14ZKabilan MahathevanYining ZhangMuhammad Ali GulzarKirshanthan Sundararajah10.14722/fuzzing.2026.23006http://arxiv.org/abs/2602.24055v3CIRCLE: A Framework for Evaluating AI from a Real-World Lens2026-03-18T23:34:32ZThis paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI's materialized outcomes in deployment. Current approaches such as MLOps frameworks and AI model benchmarks offer detailed insights into system stability and model capabilities, but they do not provide decision-makers outside the AI stack with systematic evidence of how these systems actually behave in real-world contexts or affect their organizations over time. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This, in turn, can enable governance based on materialized downstream effects rather than theoretical capabilities.2026-02-27T14:43:23ZAccepted at Intelligent Systems Conference (IntelliSys) 2026Reva SchwartzCarina WestlingMorgan BriggsMarzieh FadaeeIsar NejadgholiMatthew HolmesFariza RashidMaya CarlyleAfaf TaïkKyra WilsonPeter DouglasTheodora SkeadasGabriella WatersRumman ChowdhuryThiago Lacerdahttp://arxiv.org/abs/2603.18334v1Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought2026-03-18T22:42:20ZAs Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs' understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.2026-03-18T22:42:20ZZichen XieWenxi Wanghttp://arxiv.org/abs/2603.00601v4Theory of Code Space: Do Code Agents Understand Software Architecture?2026-03-18T22:36:55ZAI code agents excel at isolated tasks yet struggle with multi-file software engineering requiring architectural understanding. We introduce Theory of Code Space (ToCS), a benchmark that evaluates whether agents can construct, maintain, and update coherent architectural beliefs during codebase exploration. Agents explore procedurally generated codebases under partial observability -- opening files under a budget -- and periodically externalize their belief state as structured JSON, producing a time-series of architectural understanding. Three findings emerge from experiments with four baselines and six frontier LLMs. First, the Active-Passive Gap is model-dependent: one model builds better maps through active exploration than from seeing all files at once, while another shows the opposite -- revealing that active exploration is itself a non-trivial capability absent from some models. Second, retaining structured belief maps in context acts as self-scaffolding for some models but not others, showing that the mechanism is model-dependent. Third, belief state maintenance varies dramatically: a smaller model maintains perfectly stable beliefs across probes while its larger sibling suffers catastrophic belief collapse -- forgetting previously-discovered components between probes. We release ToCS as open-source software. Code: https://github.com/che-shr-cat/tocs2026-02-28T11:40:17Zupdated figures and numbersGrigory Sapunovhttp://arxiv.org/abs/2603.18245v1Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety2026-03-18T20:06:47ZLarge Language Model (LLM) agents increasingly act through external tools, making their safety contingent on tool-call workflows rather than text generation alone. While recent benchmarks evaluate agents across diverse environments and risk categories, a fundamental question remains unanswered: how complete are existing test suites, and what unsafe interaction patterns persist even after an agent passes the benchmark? We propose SafeAudit, a meta-audit framework that addresses this gap through two contributions. First, an LLM-based enumerator that systematically generates test cases by enumerating valid tool-call workflows and diverse user scenarios. Second, we introduce rule-resistance, a non-semantic, quantitative metric that distills compact safety rules from existing benchmarks and identifies unsafe interaction patterns that remain uncovered under those rules. Across 3 benchmarks and 12 environments, SafeAudit uncovers more than 20% residual unsafe behaviors that existing benchmarks fail to expose, with coverage growing monotonically as the testing budget increases. Our results highlight significant completeness gaps in current safety evaluation and motivate meta-auditing as a necessary complement to benchmark-based agent safety testing.2026-03-18T20:06:47ZXuan ChenLu YanRuqi ZhangXiangyu Zhanghttp://arxiv.org/abs/2603.00331v2Linting Style and Substance in READMEs2026-03-18T17:50:48ZREADMEs shape first impressions of software projects, yet what constitutes a good README varies across audiences and contexts. Research software needs reproducibility details, while open-source libraries might prioritize quick-start guides. Through a design probe, LintMe, we explore how linting can be used to improve READMEs given these diverse contexts, aiding style and content issues while preserving authorial agency. Users create context-specific checks using a lightweight DSL that uses a novel combination of programmatic operations (e.g., for broken links) with LLM-based content evaluation (e.g., for detecting jargon), yielding checks that would be challenging for prior linters. Through a user study (N=11), comparison with naive LLM usage, and an extensibility case study, we find that our design is approachable, flexible, and well matched with the needs of this domain. This work opens the door for linting more complex documentation and other culturally mediated text-based documents.2026-02-27T22:00:56ZAccepted to CHI2026Hima MynampatyNathania JosephineKatherine E. IsaacsAndrew M. McNutthttp://arxiv.org/abs/2603.17974v1Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection2026-03-18T17:38:35ZSoftware vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.2026-03-18T17:38:35ZSupervisor: Prof. Massih-Reza AminiAmine Lbathhttp://arxiv.org/abs/2603.17924v1CodeGreen: Towards Improving Precision and Portability in Software Energy Measurement2026-03-18T17:01:12ZAccurate software energy measurement is critical for optimizing energy, yet existing profilers force a trade-off between measurement accuracy and overhead due to tight coupling with supported specific hardware or languages. We present CodeGreen, a modular energy measurement platform that decouples instrumentation from measurement via an asynchronous producer-consumer architecture. We implement a Native Energy Measurement Backend (NEMB) that polls hardware sensors (Intel RAPL, NVIDIA NVML, AMD ROCm) independently, while lightweight timestamp markers enable tunable granularity. CodeGreen leverages Tree-sitter AST queries for automated instrumentation across Python, C++, C, and Java, with straightforward extension to any Tree-sitter-supported grammar, enabling developers to target specific scopes (loops, methods, classes) without manual intervention. Validation against "Computer Language Benchmarks Game" demonstrates $R^2 = 0.9934$ correlation with RAPL ground truth and $R^2 = 0.9997$ energy-workload linearity. By bridging fine-grained measurement precision with cross-platform portability, CodeGreen enables practical algorithmic energy optimization across heterogeneous environments. Source code, video demonstration, and documentation for the tool are publicly available at: https://smart-dal.github.io/codegreen/.2026-03-18T17:01:12ZSaurabhsingh RajputTushar Sharma