https://arxiv.org/api/dTReFW3858eXBjqiQGH9wimL8uM 2026-06-28T23:30:06Z 9951 1710 15 http://arxiv.org/abs/2503.22760v1 Malicious and Unintentional Disclosure Risks in Large Language Models for Code Generation 2025-03-27T16:09:23Z

This paper explores the risk that a large language model (LLM) trained for code generation on data mined from software repositories will generate content that discloses sensitive information included in its training data. We decompose this risk, known in the literature as ``unintended memorization,'' into two components: unintentional disclosure (where an LLM presents secrets to users without the user seeking them out) and malicious disclosure (where an LLM presents secrets to an attacker equipped with partial knowledge of the training data). We observe that while existing work mostly anticipates malicious disclosure, unintentional disclosure is also a concern. We describe methods to assess unintentional and malicious disclosure risks side-by-side across different releases of training datasets and models. We demonstrate these methods through an independent assessment of the Open Language Model (OLMo) family of models and its Dolma training datasets. Our results show, first, that changes in data source and processing are associated with substantial changes in unintended memorization risk; second, that the same set of operational changes may increase one risk while mitigating another; and, third, that the risk of disclosing sensitive information varies not only by prompt strategies or test datasets but also by the types of sensitive information. These contributions rely on data mining to enable greater privacy and security testing required for the LLM training data supply chain.

2025-03-27T16:09:23Z The 3rd International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S), co-located with SANER 2025 Rafiqul Rabin Sean McGregor Nick Judd 10.1109/SANER-C66551.2025.00016 http://arxiv.org/abs/2503.21557v1 debug-gym: A Text-Based Environment for Interactive Debugging 2025-03-27T14:43:28Z

Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.

2025-03-27T14:43:28Z Xingdi Yuan Morgane M Moss Charbel El Feghali Chinmay Singh Darya Moldavskaya Drew MacPhee Lucas Caccia Matheus Pereira Minseon Kim Alessandro Sordoni Marc-Alexandre Côté http://arxiv.org/abs/2503.20868v1 Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle 2025-03-26T18:00:01Z

Currently, the most energy-efficient hardware platforms for floating point-intensive calculations (also known as High Performance Computing, or HPC) are graphical processing units (GPUs). However, porting existing scientific codes to GPUs can be far from trivial. This article summarizes our recent advances in enabling machine-assisted, HPC-oriented refactorings with reference to existing APIs and programming idioms available in C and C++. The tool we are extending and using for the purpose is called Coccinelle. An important workflow we aim to support is that of writing and maintaining tersely written application code, while deferring circumstantial, ad-hoc, performance-related changes to specific, separate rules called semantic patches. GPUs currently offer very limited debugging facilities. The approach we are developing aims at preserving intelligibility, longevity, and relatedly, debuggability of existing code on CPUs, while at the same time enabling HPC-oriented code evolutions such as introducing support for GPUs, in a scriptable and possibly parametric manner. This article sketches a number of self-contained use cases, including further HPC-oriented cases which are independent from GPUs.

2025-03-26T18:00:01Z Michele Martone Julia Lawall http://arxiv.org/abs/2503.20469v1 Pedagogy of Teaching Pointers in the C Programming Language using Graph Transformations 2025-03-26T11:52:19Z

Visual learners think in pictures rather than words and learn best when they utilize representations based on graphs, tables, charts, maps, colors and diagrams. We propose a new pedagogy for teaching pointers in the C programming language using graph transformation systems to visually simulate pointer manipulation. In an Introduction to C course, the topic of pointers is often the most difficult one for students to understand; therefore, we experiment with graph-based representations of dynamic pointer structures to reinforce the learning. Groove, a graph transformation tool, is used to illustrate the behaviour of pointers through modelling and simulation. A study is presented to evaluate the effectiveness of the approach. This paper will also provide a comparison to other teaching methods in this area.

2025-03-26T11:52:19Z In Proceedings GCM 2023 and 2024, arXiv:2503.19632 EPTCS 417, 2025, pp. 117-133 Adwoa Donyina University of New Haven Reiko Heckel University of Leicester 10.4204/EPTCS.417.7 http://arxiv.org/abs/2503.20465v1 Linear-Time Graph Programs without Preconditions 2025-03-26T11:51:12Z

We report on a recent breakthrough in rule-based graph programming, which allows us to reach the time complexity of imperative linear-time algorithms. In general, achieving the complexity of graph algorithms in conventional languages using graph transformation rules is challenging due to the cost of graph matching. Previous work demonstrated that with rooted rules, certain algorithms can be executed in linear time using the graph programming language GP 2. However, for non-destructive algorithms that retain the structure of input graphs, achieving linear runtime required input graphs to be connected and of bounded node degree. In this paper, we overcome these preconditions by enhancing the graph data structure generated by the GP 2 compiler and exploiting the new structure in programs. We present three case studies, a cycle detection program, a program for numbering the connected components of a graph, and a breadth-first search program. Each of these programs runs in linear time on both connected and disconnected input graphs with arbitrary node degrees. We give empirical evidence for the linear time complexity by using timings for various classes of input graphs.

2025-03-26T11:51:12Z In Proceedings GCM 2023 and 2024, arXiv:2503.19632. arXiv admin note: substantial text overlap with arXiv:2501.09144 EPTCS 417, 2025, pp. 39-54 Ziad Ismaili Alaoui Department of Computer Science, University of Liverpool, Liverpool, United Kingdom Detlef Plump Department of Computer Science, University of York, York, United Kingdom 10.4204/EPTCS.417.3 http://arxiv.org/abs/2503.20463v1 An Encoding of Interaction Nets in OCaml 2025-03-26T11:50:39Z

Interaction nets constitute a visual programming language grounded in graph transformation. Owing to their distinctive properties, they inherently facilitate parallelism in the rewriting step. This paper showcases a simple and concise approach to encoding interaction nets within the programming language OCaml, emphasising correctness guarantees. To achieve this objective, we encode not only the interaction net primitives, but also Lafont's original type system.

2025-03-26T11:50:39Z In Proceedings GCM 2023 and 2024, arXiv:2503.19632 EPTCS 417, 2025, pp. 1-16 Nikolaus Huber Wang Yi 10.4204/EPTCS.417.1 http://arxiv.org/abs/2503.20413v1 Zippy -- Generic White-Box Proof Search with Zippers 2025-03-26T10:41:23Z

We present a framework for tree-based proof search, called Zippy. Unlike existing proof search tools, Zippy is largely independent of concrete search tree representations, search-algorithms, states and effects. It is designed to create analysable and navigable proof searches that are open to customisation and extensions by users. Zippy is founded on concepts from functional programming theory, particularly zippers, arrows, monads, and lenses. We implemented the framework in Isabelle's metaprogramming language Isabelle/ML.

2025-03-26T10:41:23Z Kevin Kappelmann http://arxiv.org/abs/2503.04408v2 Linearization via Rewriting (Long Version) 2025-03-25T14:45:51Z

We introduce the structural resource lambda-calculus, a new formalism in which strongly normalizing terms of the lambda-calculus can naturally be represented, and at the same time any type derivation can be internally rewritten to its linearization. The calculus is shown to be normalizing and confluent. Noticeably, every strongly normalizable lambda-term can be represented by a type derivation. This is the first example of a system where the linearization process takes place internally, while remaining purely finitary and rewrite-based.

2025-03-06T13:04:44Z Ugo Dal Lago Federico Olimpieri http://arxiv.org/abs/2503.19632v1 Proceedings of the Fourteenth and Fifteenth International Workshop on Graph Computation Models 2025-03-25T13:19:26Z

This volume contains the post-proceedings of the Fourteenth and the Fifteenth International Workshops on Graph Computation Models (GCM 2023 and 2024). The workshops took place in Leicester, UK on 18th July 2023 and Enschede, the Netherlands on 9th July 2024, in each case as part of STAF (Software Technologies: Applications and Foundations). Graphs are common mathematical structures that are visual and intuitive. They constitute a natural and seamless way for system modeling in science, engineering, and beyond, including computer science, biology, and business process modeling. Graph computation models constitute a class of very high-level models where graphs are first-class citizens. The aim of the International GCM Workshop series is to bring together researchers interested in all aspects of computation models based on graphs and graph transformation. It promotes the cross-fertilizing exchange of ideas and experiences among senior and young researchers from the different communities interested in the foundations, applications, and implementations of graph computation models and related areas.

2025-03-25T13:19:26Z EPTCS 417, 2025 Jörg Endrullis Dominik Grzelak Tobias Heindel Jens Kosiol 10.4204/EPTCS.417 http://arxiv.org/abs/2503.17004v1 Text2Model: Generating dynamic chemical reactor models using large language models (LLMs) 2025-03-21T10:09:34Z

As large language models have shown remarkable capabilities in conversing via natural language, the question arises as to how LLMs could potentially assist chemical engineers in research and industry with domain-specific tasks. We generate dynamic chemical reactor models in Modelica code format from textual descriptions as user input. We fine-tune Llama 3.1 8B Instruct on synthetically generated Modelica code for different reactor scenarios. We compare the performance of our fine-tuned model to the baseline Llama 3.1 8B Instruct model and GPT4o. We manually assess the models' predictions regarding the syntactic and semantic accuracy of the generated dynamic models. We find that considerable improvements are achieved by the fine-tuned model with respect to both the semantic and the syntactic accuracy of the Modelica models. However, the fine-tuned model lacks a satisfactory ability to generalize to unseen scenarios compared to GPT4o.

2025-03-21T10:09:34Z Sophia Rupprecht Yassine Hounat Monisha Kumar Giacomo Lastrucci Artur M. Schweidtmann http://arxiv.org/abs/2503.16971v1 Nofl: A Precise Immix 2025-03-21T09:38:16Z

Can a memory manager be built with fast bump-pointer allocation, single-pass heap tracing, and a low upper bound on memory overhead? The Immix collector answered in the affirmative for the first two, but the granularity at which it reclaims memory means that in the worst case a tiny object can keep two 128-byte lines of memory from being re-used for allocation. This paper takes Immix to an extreme of precision, allowing all free space between objects to be reclaimed, down to the limit of the allocator's minimum alignment. We present the design of this Nofl layout, build a collector library around it, and build a new Scheme-to-C compiler as a workbench. We make a first evaluation of the Nofl-based mostly-marking collector when compared to standard copying and mark-sweep collectors and run against a limited set of microbenchmarks, finding that Nofl outperforms the others for tight-to-adequate heap sizes.

2025-03-21T09:38:16Z 10 pages, 5 figures; submitted to ISMM'25 Andy Wingo http://arxiv.org/abs/2503.16686v1 Spatial Data Science Languages: commonalities and needs 2025-03-20T20:06:10Z

Recent workshops brought together several developers, educators and users of software packages extending popular languages for spatial data handling, with a primary focus on R, Python and Julia. Common challenges discussed included handling of spatial or spatio-temporal support, geodetic coordinates, in-memory vector data formats, data cubes, inter-package dependencies, packaging upstream libraries, differences in habits or conventions between the GIS and physical modelling communities, and statistical models. The following set of insights have been formulated: (i) considering software problems across data science language silos helps to understand and standardise analysis approaches, also outside the domain of formal standardisation bodies; (ii) whether attribute variables have block or point support, and whether they are spatially intensive or extensive has consequences for permitted operations, and hence for software implementing those; (iii) handling geometries on the sphere rather than on the flat plane requires modifications to the logic of {\em simple features}, (iv) managing communities and fostering diversity is a necessary, on-going effort, and (v) tools for cross-language development need more attention and support.

2025-03-20T20:06:10Z Edzer Pebesma Martin Fleischmann Josiah Parry Jakub Nowosad Anita Graser Dewey Dunnington Maarten Pronk Rafael Schouten Robin Lovelace Marius Appel Lorena Abad http://arxiv.org/abs/2503.08738v3 Shedding Light in Task Decomposition in Program Synthesis: The Driving Force of the Synthesizer Model 2025-03-20T08:23:40Z

Task decomposition is a fundamental mechanism in program synthesis, enabling complex problems to be broken down into manageable subtasks. ExeDec, a state-of-the-art program synthesis framework, employs this approach by combining a Subgoal Model for decomposition and a Synthesizer Model for program generation to facilitate compositional generalization. In this work, we develop REGISM, an adaptation of ExeDec that removes decomposition guidance and relies solely on iterative execution-driven synthesis. By comparing these two exemplary approaches-ExeDec, which leverages task decomposition, and REGISM, which does not-we investigate the interplay between task decomposition and program generation. Our findings indicate that ExeDec exhibits significant advantages in length generalization and concept composition tasks, likely due to its explicit decomposition strategies. At the same time, REGISM frequently matches or surpasses ExeDec's performance across various scenarios, with its solutions often aligning more closely with ground truth decompositions. These observations highlight the importance of repeated execution-guided synthesis in driving task-solving performance, even within frameworks that incorporate explicit decomposition strategies. Our analysis suggests that task decomposition approaches like ExeDec hold significant potential for advancing program synthesis, though further work is needed to clarify when and why these strategies are most effective.

2025-03-11T06:30:49Z Accepted at ICLR 2025 Workshop Deep Learning for Code Janis Zenkner Tobias Sesterhenn Christian Bartelt http://arxiv.org/abs/2411.04905v3 OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models 2025-03-20T03:28:56Z

Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Through this comprehensive release, we identify the key ingredients for building a top-tier code LLM: (1) code optimized heuristic rules for data cleaning and methods for data deduplication, (2) recall of text corpus related to code and (3) high-quality synthetic data in both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research, and enable reproducible advancements in code AI.

2024-11-07T17:47:25Z Siming Huang Tianhao Cheng J. K. Liu Jiaran Hao Liuyihan Song Yang Xu J. Yang Jiaheng Liu Chenchen Zhang Linzheng Chai Ruifeng Yuan Zhaoxiang Zhang Jie Fu Qian Liu Ge Zhang Zili Wang Yuan Qi Yinghui Xu Wei Chu http://arxiv.org/abs/2101.08939v5 Hoare meets Heisenberg: A Lightweight Logic for Quantum Programs 2025-03-20T02:17:35Z

We show that Gottesman's (1998) semantics for Clifford circuits based on the Heisenberg representation gives rise to a lightweight Hoare-like logic for efficiently characterizing a common subset of quantum programs. Our applications include (i) certifying whether auxiliary qubits can be safely disposed of, (ii) determining if a system is separable across a given bipartition, (iii) checking the transversality of a gate with respect to a given stabilizer code, and (iv) computing post-measurement states for computational basis measurements. Further, this logic is extended to accommodate universal quantum computing by deriving Hoare triples for the $T$-gate, multiply-controlled unitaries such as the Toffoli gate, and some gate injection circuits that use associated magic states. A number of interesting results emerge from this logic, including a lower bound on the number of $T$ gates necessary to perform a multiply-controlled $Z$ gate.

2021-01-22T04:07:12Z 52 pages, 3 figures Aarthi Sundaram Robert Rand Kartik Singhal Brad Lackey