https://arxiv.org/api/0m84F0FYc6DU9E9T3thTWf/X1VQ 2026-06-23T15:36:15Z 9934 1035 15 http://arxiv.org/abs/2509.11145v2 Text2Mem: A Unified Memory Operation Language for Memory Operating System 2025-10-23T17:53:03Z

Large language model agents increasingly depend on memory to sustain long horizon interaction, but existing frameworks remain limited. Most expose only a few basic primitives such as encode, retrieve, and delete, while higher order operations like merge, promote, demote, split, lock, and expire are missing or inconsistently supported. Moreover, there is no formal and executable specification for memory commands, leaving scope and lifecycle rules implicit and causing unpredictable behavior across systems. We introduce Text2Mem, a unified memory operation language that provides a standardized pathway from natural language to reliable execution. Text2Mem defines a compact yet expressive operation set aligned with encoding, storage, and retrieval. Each instruction is represented as a JSON based schema instance with required fields and semantic invariants, which a parser transforms into typed operation objects with normalized parameters. A validator ensures correctness before execution, while adapters map typed objects either to a SQL prototype backend or to real memory frameworks. Model based services such as embeddings or summarization are integrated when required. All results are returned through a unified execution contract. This design ensures safety, determinism, and portability across heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark that separates schema generation from backend execution to enable systematic evaluation. Together, these components establish the first standardized foundation for memory control in agents.

2025-09-14T07:30:09Z 12 pages, 3 figures, 2 tables Yi Wang Lihai Yang Boyu Chen Gongyi Zou Kerun Xu Bo Tang Feiyu Xiong Siheng Chen Zhiyu Li http://arxiv.org/abs/2505.13938v4 CLEVER: A Curated Benchmark for Formally Verified Code Generation 2025-10-23T16:29:07Z

We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).

2025-05-20T05:15:47Z Amitayush Thakur Jasper Lee George Tsoukalas Meghana Sistla Matthew Zhao Stefan Zetzsche Greg Durrett Yisong Yue Swarat Chaudhuri http://arxiv.org/abs/2510.20688v1 SafeFFI: Efficient Sanitization at the Boundary Between Safe and Unsafe Code in Rust and Mixed-Language Applications 2025-10-23T16:02:45Z

Unsafe Rust code is necessary for interoperability with C/C++ libraries and implementing low-level data structures, but it can cause memory safety violations in otherwise memory-safe Rust programs. Sanitizers can catch such memory errors at runtime, but introduce many unnecessary checks even for memory accesses guaranteed safe by the Rust type system. We introduce SafeFFI, a system for optimizing memory safety instrumentation in Rust binaries such that checks occur at the boundary between unsafe and safe code, handing over the enforcement of memory safety from the sanitizer to the Rust type system. Unlike previous approaches, our design avoids expensive whole-program analysis and adds much less compile-time overhead (2.64x compared to over 8.83x). On a collection of popular Rust crates and known vulnerable Rust code, SafeFFI achieves superior performance compared to state-of-the-art systems, reducing sanitizer checks by up to 98%, while maintaining correctness and flagging all spatial and temporal memory safety violations.

2025-10-23T16:02:45Z Oliver Braunsdorf Tim Lange Konrad Hohentanner Julian Horsch Johannes Kinder http://arxiv.org/abs/2510.20547v1 Compiling the Mimosa programming language to RTOS tasks 2025-10-23T13:33:17Z

This paper introduces a compilation scheme for programs written in the Mimosa programming language, which builds upon the MIMOS model of computation. Mimosa describes embedded systems software as a collection of time-triggered processes which communicate through FIFO queues. We formally describe an adaptation of the Lustre compilation scheme to the semantics of Mimosa and show how the coordination layer can be mapped to real-time operating system primitives.

2025-10-23T13:33:17Z Nikolaus Huber Susanne Graf Philipp Rümmer Wang Yi http://arxiv.org/abs/2510.20532v1 Deciding not to Decide: Sound and Complete Effect Inference in the Presence of Higher-Rank Polymorphism 2025-10-23T13:16:17Z

Type-and-effect systems help the programmer to organize data and computational effects in a program. While for traditional type systems expressive variants with sophisticated inference algorithms have been developed and widely used in programming languages, type-and-effect systems did not yet gain widespread adoption. One reason for this is that type-and-effect systems are more complex and the existing inference algorithms make compromises between expressiveness, intuitiveness, and decidability. In this work, we present an effect inference algorithm for a type-and-effect system with subtyping, expressive higher-rank polymorphism, and intuitive set-like semantics of effects. In order to deal with scoping issues of higher-rank polymorphism, we delay solving of effect constraints by transforming them into formulae of propositional logic. We prove soundness and completeness of our algorithm with respect to a declarative type-and-effect system. All the presented results have been formalized in the Rocq proof assistant, and the algorithm has been successfully implemented in a realistic programming language.

2025-10-23T13:16:17Z Patrycja Balik Szymon Jędras Piotr Polesiuk http://arxiv.org/abs/2507.19271v2 Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects 2025-10-23T12:17:37Z

Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C\# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.

2025-07-25T13:49:24Z Igli Begolli Meltem Aksoy Daniel Neider http://arxiv.org/abs/2510.20018v1 Deconstructed Proto-Quipper: A Rational Reconstruction 2025-10-22T20:40:42Z

The Proto-Quipper family of programming languages aims to provide a formal foundation for the Quipper quantum programming language. Unfortunately, Proto-Quipper languages have complex operational semantics: they are inherently effectful, and they rely on set-theoretic operations and fresh name generation to manipulate quantum circuits. This makes them difficult to reason about using standard programming language techniques and, ultimately, to mechanize. We introduce Proto-Quipper-A, a rational reconstruction of Proto-Quipper languages for static circuit generation. It uses a linear $λ$-calculus to describe quantum circuits with normal forms that closely correspond to box-and-wire circuit diagrams. Adjoint-logical foundations integrate this circuit language with a linear/non-linear functional language and let us reconstruct Proto-Quipper's circuit programming abstractions using more primitive adjoint-logical operations. Proto-Quipper-A enjoys a simple call-by-value reduction semantics, and to illustrate its tractability as a foundation for Proto-Quipper languages, we show that it is normalizing. We show how to use standard logical relations to prove normalization of linear and substructural systems, thereby avoiding the inherent complexity of existing linear logical relations.

2025-10-22T20:40:42Z Submitted to the 35th European Symposium on Programming (ESOP 2026) Ryan Kavanagh Chuta Sano Brigitte Pientka http://arxiv.org/abs/2510.19765v1 Tidying Up the Address Space 2025-10-22T16:50:49Z

Memory tiering in datacenters does not achieve its full potential due to hotness fragmentation -- the intermingling of hot and cold objects within memory pages. This fragmentation prevents page-based reclamation systems from distinguishing truly hot pages from pages containing mostly cold objects, fundamentally limiting memory efficiency despite highly skewed accesses. We introduce address-space engineering: dynamically reorganizing application virtual address spaces to create uniformly hot and cold regions that any page-level tiering backend can manage effectively. HADES demonstrates this frontend/backend approach through a compiler-runtime system that tracks and migrates objects based on access patterns, requiring minimal developer intervention. Evaluations across ten data structures achieve up to 70% memory reduction with 3% performance overhead, showing that address space engineering enables existing reclamation systems to reclaim memory aggressively without performance degradation.

2025-10-22T16:50:49Z Vinay Banakar Suli Yang Kan Wu Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau Kimberly Keeton 10.1145/3764862.3768179 http://arxiv.org/abs/2407.01742v3 The Continuous Tensor Abstraction: Where Indices are Real 2025-10-22T12:24:23Z

This paper introduces the continuous tensor abstraction, allowing indices to take real-number values (for example, A[3.14]). It also presents continuous tensor algebra expressions, such as C(x,y) = A(x,y) * B(x,y), where indices are defined over a continuous domain. This work expands the traditional tensor model to include continuous tensors. Our implementation supports piecewise-constant tensors, enabling infinite domains to be processed in finite time. We also introduce a new tensor format for efficient storage and a code generation technique for automatic kernel generation. For the first time, our abstraction expresses domains like computational geometry and computer graphics in the language of tensor programming. Our approach demonstrates competitive or better performance than hand-optimized kernels in leading libraries across diverse applications. Compared to hand-implemented libraries on a CPU, our compiler-based implementation achieves an average speedup of 9.20x on 2D radius search with approximately 60x fewer lines of code (LoC), 1.22x on genomic interval overlapping queries (with approximately 18x LoC saving), and 1.69x on trilinear interpolation in Neural Radiance Field (with approximately 6x LoC saving).

2024-07-01T19:19:30Z Proceedings of the ACM on Programming Languages, Volume 9, Issue OOPSLA2, 2025 Jaeyeon Won Willow Ahrens Teodoro Fields Collin Joel S. Emer Saman Amarasinghe 10.1145/3763146 http://arxiv.org/abs/2510.19873v1 From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph 2025-10-22T08:33:44Z

Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments. To bridge this gap, we propose ReGraphT, a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. ReGraphT organizes CUDA optimization trajectories into a structured reasoning graph, modeling the combined CUDA optimizations as state transitions, and leverages Monte Carlo Graph Search (MCGS) for efficient exploration. We also present a CUDA-specific benchmark with difficulty tiers defined by reasoning complexity to evaluate models more comprehensively. Experiments show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval. When paired with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without the associated privacy risks or excessive computing overhead.

2025-10-22T08:33:44Z Junfeng Gong Zhiyi Wei Junying Chen Cheng Liu Huawei Li http://arxiv.org/abs/2510.19279v1 Code Sharing in Healthcare Research: A Practical Guide and Recommendations for Good Practice 2025-10-22T06:29:08Z

As computational analysis becomes increasingly more complex in health research, transparent sharing of analytical code is vital for reproducibility and trust. This practical guide, aligned to open science practices, outlines actionable recommendations for code sharing in healthcare research. Emphasising the FAIR (Findable, Accessible, Interoperable, Reusable) principles, the authors address common barriers and provide clear guidance to help make code more robust, reusable, and scrutinised as part of the scientific record. This supports better science and more reliable evidence for computationally-driven practice and helps to adhere to new standards and guidelines of codesharing mandated by publishers and funding bodies.

2025-10-22T06:29:08Z Lukas Hughes-Noehrer Matthew J Parkes Andrew Stewart Anthony J Wilson Gary S Collins Richard D Riley Maya Mathur Matthew P Fox Nazrul Islam Paul N Zivich Timothy J Feeney http://arxiv.org/abs/2508.15157v3 Big-Stop Semantics: Small-Step Semantics in a Big-Step Judgment 2025-10-22T02:54:31Z

As is evident in the programming language literature, many practitioners favor specifying dynamic program behavior using big-step over small-step semantics. Unlike small-step semantics, which must dwell on every intermediate program state, big-step semantics conveniently jumps directly to the ever-important result of the computation. Big-step semantics also typically involves fewer inference rules than their small-step counterparts. However, in exchange for ergonomics, big-step semantics gives up power: Small-step semantics describes program behaviors that are outside the grasp of big-step semantics, notably divergence. This work presents a little-known extension of big-step semantics with inductive definitions that captures diverging computations without introducing error states. This big-stop semantics is illustrated for typed, untyped, and effectful variants of PCF, as well as a while-loop-based imperative language. Big-stop semantics extends the standard big-step inference rules with a few additional rules to define an evaluation judgment that is equivalent to the reflexive-transitive closure of small-step transitions. This simple extension contrasts with other solutions in the literature that sacrifice ergonomics by introducing many additional inference rules, global state, and/or less-commonly-understood reasoning principles like coinduction. The ergonomics of big-stop semantics is exemplified via concise Agda proofs for some key results and compilation theorems.

2025-08-21T01:33:50Z David M Kahn Jan Hoffmann Runming Li http://arxiv.org/abs/2511.01872v1 Learned Cost Model for Placement on Reconfigurable Dataflow Hardware 2025-10-21T22:45:45Z

Mapping a dataflow-graph of an ML model onto a reconfigurable system is difficult, as different mappings have different throughputs and consume resource constraints differently. To solve this, a model to evaluate the throughput of mappings is necessary as measuring throughput completely is expensive. Many use a hand-designed analytical model, relying on proxy features or intuition, introducing error. We provide a Learned Approach that predicts throughput 31%-52% more accurately over a variety of graphs. In addition, our approach shows no accuracy degradation after removing performance annotations. We show that using this approach results in 5.6% faster compiled graphs.

2025-10-21T22:45:45Z 7 pages, 2 figures, 2 tables, DAC Conference style (2022) Etash Guha Tianxiao Jiang Andrew Deng Jian Zhang Muthu Annamalai http://arxiv.org/abs/2510.19012v1 Comparative analysis of large data processing in Apache Spark using Java, Python and Scala 2025-10-21T18:54:21Z

During the study, the results of a comparative analysis of the process of handling large datasets using the Apache Spark platform in Java, Python, and Scala programming languages were obtained. Although prior works have focused on individual stages, comprehensive comparisons of full ETL workflows across programming languages using Apache Iceberg remain limited. The analysis was performed by executing several operations, including downloading data from CSV files, transforming and loading it into an Apache Iceberg analytical table. It was found that the performance of the Spark algorithm varies significantly depending on the amount of data and the programming language used. When processing a 5-megabyte CSV file, the best result was achieved in Python: 6.71 seconds, which is superior to Scala's score of 9.13 seconds and Java's time of 9.62 seconds. For processing a large CSV file of 1.6 gigabytes, all programming languages demonstrated similar results: the fastest performance was showed in Python: 46.34 seconds, while Scala and Java showed results of 47.72 and 50.56 seconds, respectively. When performing a more complex operation that involved combining two CSV files into a single dataset for further loading into an Apache Iceberg table, Scala demonstrated the highest performance, at 374.42 seconds. Java processing was completed in 379.8 seconds, while Python was the least efficient, with a runtime of 398.32 seconds. It follows that the programming language significantly affects the efficiency of data processing by the Apache Spark algorithm, with Scala and Java being more productive for processing large amounts of data and complex operations, while Python demonstrates an advantage in working with small amounts of data. The results obtained can be useful for optimizing data handling processes depending on specific performance requirements and the amount of information being processed.

2025-10-21T18:54:21Z CITI 2025, 3rd International Workshop on Computer Information Technologies in Industry 4.0, June 11-12, 2025, Ternopil, Ukraine. The article includes 10 pages, 5 figures, 9 tables Ivan Borodii Illia Fedorovych Halyna Osukhivska Diana Velychko Roman Butsii http://arxiv.org/abs/2510.19853v1 A Specification's Realm: Characterizing the Knowledge Required for Executing a Given Algorithm Specification 2025-10-21T18:25:54Z

An algorithm specification in natural language or pseudocode is expected to be clear and explicit enough to enable mechanical execution. In this position paper we contribute an initial characterization of the knowledge that an executing agent, human or machine, should possess in order to be able to carry out the instructions of a given algorithm specification as a stand-alone entity, independent of any system implementation. We argue that, for that algorithm specification, such prerequisite knowledge, whether unique or shared with other specifications, can be summarized in a document of practical size. We term this document the realm of the algorithm specification. The generation of such a realm is itself a systematic analytical process, significant parts of which can be automated with the help of large language models and the reuse of existing documents. The algorithm-specification's realm would consist of specification language syntax and semantics, domain knowledge restricted to the referenced entities, inter-entity relationships, relevant underlying cause-and-effect rules, and detailed instructions and means for carrying out certain operations. Such characterization of the realm can contribute to methodological implementation of the algorithm specification in diverse systems and to its formalization for mechanical verification. The paper also touches upon the question of assessing execution faithfulness, which is distinct from correctness: in the absence of a reference interpretation of natural language or pseudocode specification with a given vocabulary, how can we determine if an observed agent's execution indeed complies with the input specification.

2025-10-21T18:25:54Z Assaf Marron David Harel