https://arxiv.org/api/u57LU5x7CU1/dX0Y3r6tvM/sT7s 2026-06-21T21:49:17Z 9918 825 15 http://arxiv.org/abs/2512.19250v1 Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems 2025-12-22T10:34:45Z

Traditional auto-parallelizing compilers, reliant on rigid heuristics, struggle with the complexity of modern heterogeneous systems. This paper presents a comprehensive evaluation of small (approximately 1B parameter) language-model-driven compiler auto-parallelization. We evaluate three models: gemma3, llama3.2, and qwen2.5, using six reasoning strategies across 11 real-world kernels drawn from scientific computing, graph algorithms, and machine learning. Our system is benchmarked against strong compiler baselines, including LLVM Polly, TVM, and Triton. Across 376 total evaluations, the proposed approach achieves an average speedup of 6.81x and a peak performance of 43.25x on convolution operations. We analyze scalability, verify correctness using multiple sanitizers, and confirm robustness across diverse compilers and hardware platforms. Our results demonstrate that small, efficient language models can serve as powerful reasoning engines for complex compiler optimization tasks.

2025-12-22T10:34:45Z Accepted at NeurIPS 2025 ML for Systems Workshop Prathamesh Devadiga http://arxiv.org/abs/2512.19769v1 A Declarative Language for Building And Orchestrating LLM-Powered Agent Workflows 2025-12-22T05:03:37Z

Building deployment-ready LLM agents requires complex orchestration of tools, data sources, and control flow logic, yet existing systems tightly couple agent logic to specific programming languages and deployment models. We present a declarative system that separates agent workflow specification from implementation, enabling the same pipeline definition to execute across multiple backend languages (Java, Python, Go) and deployment environments (cloud-native, on-premises). Our key insight is that most agent workflows consist of common patterns -- data serialization, filtering, RAG retrieval, API orchestration -- that can be expressed through a unified DSL rather than imperative code. This approach transforms agent development from application programming to configuration, where adding new tools or fine-tuning agent behaviors requires only pipeline specification changes, not code deployment. Our system natively supports A/B testing of agent strategies, allowing multiple pipeline variants to run on the same backend infrastructure with automatic metric collection and comparison. We evaluate our approach on real-world e-commerce workflows at PayPal, processing millions of daily interactions. Our results demonstrate 60% reduction in development time, and 3x improvement in deployment velocity compared to imperative implementations. The language's declarative approach enables non-engineers to modify agent behaviors safely, while maintaining sub-100ms orchestration overhead. We show that complex workflows involving product search, personalization, and cart management can be expressed in under 50 lines of DSL compared to 500+ lines of imperative code.

2025-12-22T05:03:37Z Ivan Daunis http://arxiv.org/abs/2512.18842v1 DafnyMPI: A Dafny Library for Verifying Message-Passing Concurrent Programs 2025-12-21T18:16:47Z

The Message Passing Interface (MPI) is widely used in parallel, high-performance programming, yet writing bug-free software that uses MPI remains difficult. We introduce DafnyMPI, a novel, scalable approach to formally verifying MPI software. DafnyMPI allows proving deadlock freedom, termination, and functional equivalence with simpler sequential implementations. In contrast to existing specialized frameworks, DafnyMPI avoids custom concurrency logics and instead relies on Dafny, a verification-ready programming language used for sequential programs, extending it with concurrent reasoning abilities. DafnyMPI is implemented as a library that enables safe MPI programming by requiring users to specify the communication topology upfront and to verify that calls to communication primitives such as MPI_ISEND and MPI_WAIT meet their preconditions. We formalize DafnyMPI using a core calculus and prove that the preconditions suffice to guarantee deadlock freedom. Functional equivalence is proved via rely-guarantee reasoning over message payloads and a system that guarantees safe use of read and write buffers. Termination and the absence of runtime errors are proved using standard Dafny techniques. To further demonstrate the applicability of DafnyMPI, we verify numerical solutions to three canonical partial differential equations. We believe DafnyMPI demonstrates how to make formal verification viable for a broader class of programs and provides proof engineers with additional tools for software verification of parallel and concurrent systems.

2025-12-21T18:16:47Z To appear in Proceedings of the ACM on Programming Languages (POPL) Aleksandr Fedchin Antero Mejr Hari Sundar Jeffrey S. Foster 10.1145/3776705 http://arxiv.org/abs/2502.00138v2 JustAct+: A Framework for Auditable Multi-Agent Systems Regulated by Inter-Organisational Policies 2025-12-20T02:03:05Z

In open multi-agent agent systems that cross organisational boundaries, agent actions must be regulated by complex policies. Consider medical data processing systems, which must observe generic laws (e.g., EU data protection regulations) and also specific participants' resource conditions (e.g., Bob consents to sharing his X-Rays with EU hospitals). Presently, we address the implementation of these systems as distributed software. Solutions to key sub-problems are available: existing policy languages capture the necessary normative concepts and formalise the computational representation and reasoning about policies, and existing distributed algorithms and protocols coordinate agents' changing actions and policies. But which policies and protocols are useful in application? With the JustAct framework, we characterise a class of multi-agent systems where actors justify their actions with sufficient policy information collected from dynamic policy statements and agreements. We prove key properties of these systems, e.g., any decision that an action is permitted now cannot be refuted later, regardless of any added statements or updated agreements. We study a particular instance of the framework by specifying (in Rocq) and implementing (in Rust) a particular policy language and runtime system for mediating agent communications. We demonstrate and assess JustAct via a case study of this implementation: we reproduce the usage scenarios of Brane, an existing policy-regulated, inter-domain, medical data processing system.

2025-01-31T19:43:48Z Christopher A. Esterhuyse Tim Müller L. Thomas van Binsbergen http://arxiv.org/abs/2512.18134v1 Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs 2025-12-19T23:34:56Z

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

2025-12-19T23:34:56Z Rupanshu Soi Rohan Yadav Fredrik Kjolstad Alex Aiken Maryam Mehri Dehnavi Michael Garland Michael Bauer http://arxiv.org/abs/2512.17616v1 Multi-Language Benchmark Generation via L-Systems 2025-12-19T14:19:13Z

L-systems are a mathematical formalism proposed by biologist Aristid Lindenmayer with the aim of simulating organic structures such as trees, snowflakes, flowers, and other branching phenomena. They are implemented as a formal language that defines how patterns can be iteratively rewritten. This paper describes how such a formalism can be used to create artificial programs written in programming languages such as C, C++, Julia and Go. These programs, being large and complex, can be used to test the performance of compilers, operating systems, and computer architectures. This paper demonstrates the usefulness of these benchmarks through multiple case studies. These case studies include a comparison between clang and gcc; a comparison between C, C++, Julia and Go; a study of the historical evolution of gcc in terms of code quality; a look into the effects of profile guided optimizations in gcc; an analysis of the asymptotic behavior of the different phases of clang's compilation pipeline; and a comparison between the many data structures available in the Gnome Library (GLib). These case studies demonstrate the benefits of the L-System approach to create benchmarks, when compared with fuzzers such as CSmith, which were designed to uncover bugs in compilers, rather than evaluating their performance.

2025-12-19T14:19:13Z 29 pages, 19 figures, 1 table and 46 references Vinícius Francisco da Silva Heitor Leite Fernando Magno Quintão Pereira http://arxiv.org/abs/2504.11654v2 Pathological Cases for a Class of Reachability-Based Garbage Collectors 2025-12-17T22:18:23Z

Although existing garbage collectors (GCs) perform extremely well on typical programs, there still exist pathological programs for which modern GCs significantly degrade performance. This observation begs the question: might there exist a 'holy grail' GC algorithm, as yet undiscovered, guaranteeing both constant-length pause times and that memory is collected promptly upon becoming unreachable? For decades, researchers have understood that such a GC is not always possible, i.e., some pathological behavior is unavoidable when the program can make heap cycles and operates near the memory limit, regardless of the GC algorithm used. However, this understanding has until now been only informal, lacking a rigorous formal proof. This paper complements that informal understanding with a rigorous proof, showing with mathematical certainty that every GC that can implement a realistic mutator-observer interface has some pathological program that forces it to either introduce a long GC pause into program execution or reject an allocation even though there is available space. Hence, language designers must either accept these pathological scenarios and design heuristic approaches that minimize their impact (e.g., generational collection), or restrict programs and environments to a strict subset of the behaviors allowed by our mutator-observer-style interface (e.g., by enforcing a type system that disallows cycles or overprovisioning memory).

2025-04-15T22:42:03Z OOPSLA 2025. This version corrects misuse of connectedness vs. reachability terminology Matthew Sotoudeh http://arxiv.org/abs/2512.15679v1 A High-level Synthesis Toolchain for the Julia Language 2025-12-17T18:32:06Z

With the push towards Exascale computing and data-driven methods, problem sizes have increased dramatically, increasing the computational requirements of the underlying algorithms. This has led to a push to offload computations to general purpose hardware accelerators such as GPUs and TPUs, and a renewed interest in designing problem-specific accelerators using FPGAs. However, the development process of these problem-specific accelerators currently suffers from the "two-language problem": algorithms are developed in one (usually higher-level) language, but the kernels are implemented in another language at a completely different level of abstraction and requiring fundamentally different expertise. To address this problem, we propose a new MLIR-based compiler toolchain that unifies the development process by automatically compiling kernels written in the Julia programming language into SystemVerilog without the need for any additional directives or language customisations. Our toolchain supports both dynamic and static scheduling, directly integrates with the AXI4-Stream protocol to interface with subsystems like on- and off-chip memory, and generates vendor-agnostic RTL. This prototype toolchain is able to synthesize a set of signal processing/mathematical benchmarks that can operate at 100MHz on real FPGA devices, achieving between 59.71% and 82.6% of the throughput of designs generated by state-of-the-art toolchains that only compile from low-level languages like C or C++. Overall, this toolchain allows domain experts to write compute kernels in Julia as they normally would, and then retarget them to an FPGA without additional pragmas or modifications.

2025-12-17T18:32:06Z Extended version of poster abstract accepted for presentation at ISFPGA'26 Benedict Short Ian McInerney John Wickerson http://arxiv.org/abs/2512.15834v1 Optimizing Agentic Language Model Inference via Speculative Tool Calls 2025-12-17T18:22:44Z

Language models (LMs) are becoming increasingly dependent on external tools. LM-based agentic frameworks frequently interact with their environment via such tools to search files, run code, call APIs, etc. Further, modern reasoning-based LMs use tools such as web search and Python code execution to enhance their reasoning capabilities. While tools greatly improve the capabilities of LMs, they also introduce performance bottlenecks during the inference process. In this paper, we introduce novel systems optimizations to address such performance bottlenecks by speculating tool calls and forcing sequences to remain resident in the inference engine to minimize overheads. Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents. We provide a theoretical analysis of our algorithms to provide insights into speculation configurations that will yield the best performance. Further, we recommend a new "tool cache" API endpoint to enable LM providers to easily adopt these optimizations.

2025-12-17T18:22:44Z Daniel Nichols Prajwal Singhania Charles Jekel Abhinav Bhatele Harshitha Menon http://arxiv.org/abs/2512.15816v1 A Neurosymbolic Approach to Loop Invariant Generation via Weakest Precondition Reasoning 2025-12-17T14:16:59Z

Loop invariant generation remains a critical bottleneck in automated program verification. Recent work has begun to explore the use of Large Language Models (LLMs) in this area, yet these approaches tend to lack a reliable and structured methodology, with little reference to existing program verification theory. This paper presents NeuroInv, a neurosymbolic approach to loop invariant generation. NeuroInv comprises two key modules: (1) a neural reasoning module that leverages LLMs and Hoare logic to derive and refine candidate invariants via backward-chaining weakest precondition reasoning, and (2) a verification-guided symbolic module that iteratively repairs invariants using counterexamples from OpenJML. We evaluate NeuroInv on a comprehensive benchmark of 150 Java programs, encompassing single and multiple (sequential) loops, multiple arrays, random branching, and noisy code segments. NeuroInv achieves a $99.5\%$ success rate, substantially outperforming the other evaluated approaches. Additionally, we introduce a hard benchmark of $10$ larger multi-loop programs (with an average of $7$ loops each); NeuroInv's performance in this setting demonstrates that it can scale to more complex verification scenarios.

2025-12-17T14:16:59Z Daragh King Vasileios Koutavas Laura Kovacs http://arxiv.org/abs/2601.15294v1 KnowTeX: Visualizing Mathematical Dependencies 2025-12-16T18:24:28Z

Mathematical knowledge exists in many forms, ranging from informal textbooks and lecture notes to large formal proof libraries, yet moving between these representations remains difficult. Informal texts hide dependencies, while formal systems expose every detail in ways that are not always human-readable. Dependency graphs offer a middle ground by making visible the structure of results, definitions, and proofs. We present KnowTeX, a standalone, user-friendly tool that extends the ideas of Lean's Blueprints, enabling the visualization of conceptual dependencies directly from LaTeX sources. Using a simple "uses" command, KnowTeX extracts relationships among statements and generates previewable graphs in DOT and TikZ formats. Applied to mathematical texts, such graphs clarify core results, support education and formalization, and provide a resource for aligning informal and formal mathematical representations. We argue that dependency graphs should become a standard feature of mathematical writing, benefiting both human readers and automated systems.

2025-12-16T18:24:28Z Elif Uskuplu Lawrence S. Moss Valeria de Paiva http://arxiv.org/abs/2512.14045v1 A Deep Dive into Function Inlining and its Security Implications for ML-based Binary Analysis 2025-12-16T03:21:59Z

A function inlining optimization is a widely used transformation in modern compilers, which replaces a call site with the callee's body in need. While this transformation improves performance, it significantly impacts static features such as machine instructions and control flow graphs, which are crucial to binary analysis. Yet, despite its broad impact, the security impact of function inlining remains underexplored to date. In this paper, we present the first comprehensive study of function inlining through the lens of machine learning-based binary analysis. To this end, we dissect the inlining decision pipeline within the LLVM's cost model and explore the combinations of the compiler options that aggressively promote the function inlining ratio beyond standard optimization levels, which we term extreme inlining. We focus on five ML-assisted binary analysis tasks for security, using 20 unique models to systematically evaluate their robustness under extreme inlining scenarios. Our extensive experiments reveal several significant findings: i) function inlining, though a benign transformation in intent, can (in)directly affect ML model behaviors, being potentially exploited by evading discriminative or generative ML models; ii) ML models relying on static features can be highly sensitive to inlining; iii) subtle compiler settings can be leveraged to deliberately craft evasive binary variants; and iv) inlining ratios vary substantially across applications and build configurations, undermining assumptions of consistency in training and evaluation of ML models.

2025-12-16T03:21:59Z Omar Abusabha Jiyong Uhm Tamer Abuhmed Hyungjoon Koo http://arxiv.org/abs/2512.15788v1 Automated Formalization of Probabilistic Requirements from Structured Natural Language 2025-12-15T20:20:27Z

Integrating autonomous and adaptive behavior into software-intensive systems presents significant challenges for software development, as uncertainties in the environment or decision-making processes must be explicitly captured. These challenges are amplified in safety- and mission-critical systems, which must undergo rigorous scrutiny during design and development. Key among these challenges is the difficulty of specifying requirements that use probabilistic constructs to capture the uncertainty affecting these systems. To enable formal analysis, such requirements must be expressed in precise mathematical notations such as probabilistic logics. However, expecting developers to write requirements directly in complex formalisms is unrealistic and highly error-prone. We extend the structured natural language used by NASA's Formal Requirement Elicitation Tool (FRET) with support for the specification of unambiguous and correct probabilistic requirements, and develop an automated approach for translating these requirements into logical formulas. We propose and develop a formal, compositional, and automated approach for translating structured natural-language requirements into formulas in probabilistic temporal logic. To increase trust in our formalizations, we provide assurance that the generated formulas are well-formed and conform to the intended semantics through an automated validation framework and a formal proof. The extended FRET tool enables developers to specify probabilistic requirements in structured natural language, and to automatically translate them into probabilistic temporal logic, making the formal analysis of autonomous and adaptive systems more practical and less error-prone.

2025-12-15T20:20:27Z Official website https://github.com/NASA-SW-VnV/fret/releases/tag/v3.0.0 Anastasia Mavridou Marie Farrell Gricel Vázquez Tom Pressburger Timothy E. Wang Radu Calinescu Michael Fisher http://arxiv.org/abs/2507.10301v2 Rows and Capabilities as Modal Effects 2025-12-15T18:49:57Z

Effect handlers allow programmers to model and compose computational effects modularly. Effect systems statically guarantee that all effects are handled. Several recent practical effect systems are based on either row polymorphism or capabilities. However, there remains a gap in understanding the precise relationship between effect systems with such disparate foundations. The main difficulty is that in both row-based and capability-based systems, effect tracking is typically entangled with other features such as functions. We propose a uniform framework for encoding, analysing, and comparing effect systems. Our framework exploits and generalises modal effect types, a recent novel effect system which decouples effect tracking from functions via modalities. Modalities offer fine-grained control over when and how effects are tracked, enabling us to express different strategies for effect tracking. We give encodings as macro translations from existing row-based and capability-based effect systems into our framework and show that these encodings preserve types and semantics. Our encodings reveal the essence of effect tracking mechanisms in different effect systems, enable a direct analysis on their differences, and provide practical insights on language design.

2025-07-14T14:04:15Z Wenhao Tang Sam Lindley http://arxiv.org/abs/2505.08091v2 LEGO: A Layout Expression Language for Code Generation of Hierarchical Mapping 2025-12-15T16:10:37Z

We describe LEGO, a new approach to optimizing data movement whereby code is expressed as a layout-independent computation and composed with layouts for data and computation. This code generator organization derives complex indexing expressions associated with hierarchical parallel code and data movement for GPUs. LEGO maps from layout specification to indexing expressions, and can be integrated into existing compilers and code templates. It facilitates the exploration of data layouts in combination with other optimizations. We demonstrate LEGO's integration with the Triton and MLIR compilers, and with CUDA templates. We show that LEGO is capable of deriving performance competitive with Triton, and shows broad applicability for data and thread layout mapping optimizations in its integration with CUDA and MLIR.

2025-05-12T21:53:09Z Amir Mohammad Tavakkoli Cosmin Oancea Mary Hall