Closure Conversion, Flat Environments, and the Complexity of Abstract Machines

2025-07-21T17:52:29Z

Closure conversion is a program transformation at work in compilers for functional languages to turn inner functions into global ones, by building closures pairing the transformed functions with the environment of their free variables. Abstract machines rely on similar and yet different concepts of closures and environments. In this paper, we study the relationship between the two approaches. We adopt a very simple λ-calculus with tuples as source language and study abstract machines for both the source language and the target of closure conversion. Moreover, we focus on the simple case of flat closures/environments, that is, with no sharing of environments. We provide three contributions. Firstly, a new simple proof technique for the correctness of closure conversion, inspired by abstract machines. Secondly, we show how the closure invariants of the target language allow us to design a new way of handling environments in abstract machines, not suffering the shortcomings of other styles. Thirdly, we study the machines from the point of view of time complexity, adapting analyses by Accattoli and co-authors. We show that closure conversion decreases various dynamic costs while increasing the size of the initial code. Despite these changes, the overall complexity of the machines before and after closure conversion turns out to be the same.

Formal Analysis of Networked PLC Controllers Interacting with Physical Environments

2025-07-21T13:18:50Z

Programmable Logic Controllers (PLCs) are widely used in industrial automation to control physical systems. As PLC applications become increasingly complex, ensuring their correctness is crucial. Existing formal verification techniques focus on individual PLC programs in isolation, often neglecting interactions with physical environments and network communication between controllers. This limitation poses significant challenges in analyzing real-world industrial systems, where continuous dynamics and communication delays play a critical role. In this paper, we present a unified formal framework that integrates discrete PLC semantics, networked communication, and continuous physical behaviors. To mitigate state explosion, we apply partial order reduction, significantly reducing the number of explored states while maintaining correctness. Our framework enables precise analysis of PLC-driven systems with continuous dynamics and networked communication.

From Provable Correctness to Probabilistic Generation: A Comparative Review of Program Synthesis Paradigms

2025-07-21T11:33:57Z

Program synthesis--the automated generation of executable code from high-level specifications--has been a central goal of computer science for over fifty years. This thesis provides a comparative literature review of the main paradigms that have shaped the field, tracing its evolution from formal logic based methods to recent advances using large scale neural models. We examine five key approaches: logic based (deductive) synthesis, inductive (example based) synthesis, sketch/schema based synthesis, large language model based synthesis, and neuro-symbolic hybrids. For each, we analyze foundational principles, notable systems, and practical applications, highlighting trade offs between correctness guarantees, specification requirements, search complexity, and expressive power. By reviewing developments from formally verified synthesis tools such as KIDS and Coq to data driven models generating probabilistic code from natural language like Codex, we present a comprehensive narrative of progress and ongoing challenges. This work emphasizes the transition from symbolic to hybrid neuro-symbolic methods and outlines future directions for reliable and scalable program synthesis.

Quantum Programming in Polylogarithmic Time

2025-07-21T09:12:38Z

Polylogarithmic time delineates a relevant notion of feasibility on several classical computational models such as Boolean circuits or parallel random access machines. As far as the quantum paradigm is concerned, this notion yields the complexity class FBQPOLYLOG of functions approximable in polylogarithmic time with a quantum random-access Turing machine. We introduce a quantum programming language with first-order recursive procedures, which provides the first programming-language-based characterization of FBQPOLYLOG. Each program computes a function in FBQPOLYLOG (soundness) and, conversely, each function of this complexity class is computed by a program (completeness). We also provide a compilation strategy from programs to uniform families of quantum circuits of polylogarithmic depth and polynomial size, whose set of computed functions is known as QNC, and recover the well-known separation result FBQPOLYLOG $\subsetneq$ QNC.

A Few Fit Most: Improving Performance Portability of SGEMM on GPUs using Multi-Versioning

2025-07-21T06:29:28Z

Hand-optimizing linear algebra kernels for different GPU devices and applications is complex and labor-intensive. Instead, many developers use automatic performance tuning (autotuning) to achieve high performance on a variety of devices. However, autotuning "overfits", and must be redone if any part of the environment changes, such as if the device or input characteristics change. In most non-trivial cases, a single compute kernel cannot maintain near-optimal performance across all environments. Changing the kernel to specialize it to the current execution environment is possible, but on GPUs, runtime tuning and compilation can be expensive. In this work, we use multi-versioning -- producing several variants of the same code -- as a way to generate performance portable code. We describe a framework called portability tuning that can automatically generate multi-versioned code whose performance is portable, requiring no retuning. We evaluate our framework on a dataset of execution times for GEMM kernels from the CLBlast linear algebra library. We find our portability tuning techniques outperform CLBlast's default kernels -- often approaching within 10% of the theoretical maximum performance -- despite CLBlast using autotuning techniques. Further, we find that our generated programs generalize well to new and unseen devices, matching the performance of autotuning without ever portability tuning for those devices.

Timetide: A programming model for logically synchronous distributed systems

2025-07-19T04:04:33Z

Massive strides in deterministic models have been made using synchronous languages. They are mainly focused on centralised applications, as the traditional approach is to compile away the concurrency. Time triggered languages such as Giotto and Lingua Franca are suitable for distribution albeit that they rely on expensive physical clock synchronisation, which is both expensive and may suffer from scalability. Hence, deterministic programming of distributed systems remains challenging. We address the challenges of deterministic distribution by developing a novel multiclock semantics of synchronous programs. The developed semantics is amenable to seamless distribution. Moreover, our programming model, Timetide, alleviates the need for physical clock synchronisation by building on the recently proposed logical synchrony model for distributed systems. We discuss the important aspects of distributing computation, such as network communication delays, and explore the formal verification of Timetide programs. To the best of our knowledge, Timetide is the first multiclock synchronous language that is both amenable to distribution and formal verification without the need for physical clock synchronisation or clock gating.

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

2025-07-18T23:21:52Z

Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI workloads efficiently on these devices requires libraries of optimized kernels. Creating efficient kernels demands expertise in domain-specific C++ with vector intrinsics and in-depth knowledge of the target architecture. Unlike GPU programming, which has had years to mature, NPU programming is new, with smaller and more fragmented developer communities across hardware platforms. This fragmentation poses a challenge when utilizing LLMs to assist in writing NPU kernels, as domain-specific optimized code examples are underrepresented in LLM pre-training data. In this paper we introduce NPUEval -- a benchmark for writing and evaluating NPU kernels, consisting of 102 common operators for machine learning workloads. We evaluate LLM generated code on actual hardware based on both functional correctness and vectorization efficiency using open source compiler tools targeting the AMD NPU. We evaluate a range of state-of-the-art LLMs with a mix of proprietary and open-weight models. Latest reasoning models like DeepSeek R1, show promising results achieving out-of-the-box 50%+ vectorization on select kernels. However, the average score across the entire dataset remains roughly 10% even with compiler feedback and vectorized kernel examples -- showing that this is a challenging dataset even for frontier models. The dataset and evaluation code will be released with a permissive open source license, providing an essential benchmark for advancing research in code generation and NPU kernel optimization.

Towards Regulated Deep Learning

2025-07-18T12:07:30Z

Regulation of Multi-Agent Systems (MAS) and Declarative Electronic Institutions (DEIs) was a multidisciplinary research topic of the past decade involving (Physical and Software) Agents and Law since the beginning, but recently evolved towards News-claimed Robot Lawyer since 2016. One of these first proposals of restricting the behaviour of Software Agents was Electronic Institutions. However, with the recent reformulation of Artificial Neural Networks (ANNs) as Deep Learning (DL), Security, Privacy,Ethical and Legal issues regarding the use of DL has raised concerns in the Artificial Intelligence (AI) Community. Now that the Regulation of MAS is almost correctly addressed, we propose the Regulation of Artificial Neural Networks as Agent-based Training of a special type of regulated Artificial Neural Network that we call Institutional Neural Network (INN).The main purpose of this paper is to bring attention to Artificial Teaching (AT) and to give a tentative answer showing a proof-of-concept implementation of Regulated Deep Learning (RDL). This paper introduces the former concept and provide $I^*$, a language previously used to model declaratively and extend Electronic Institutions, as a means to regulate the execution of Artificial Neural Networks and their interactions with Artificial Teachers (ATs)

Frex: dependently-typed algebraic simplification

2025-07-18T07:57:21Z

We present a new design for an algebraic simplification library structured around concepts from universal algebra: theories, models, homomorphisms, and universal properties of free algebras and free extensions of algebras. The library's dependently typed interface guarantees that both built-in and user-defined simplification modules are terminating, sound, and complete with respect to a well-specified class of equations. We have implemented the design in the Idris 2 and Agda dependently typed programming languages and shown that it supports modular extension to new theories, proof extraction and certification, goal extraction via reflection, and interactive development.

Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models

2025-07-18T06:48:39Z

When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea -- a ``Model Synthesis Architecture'' (MSA) -- using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset -- built around a `Model Olympics` domain of sports vignettes -- tests models' capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people's ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.

Increasing the Expressiveness of a Gradual Verifier

2025-07-17T20:50:52Z

Static verification provides strong correctness guarantees for code; however, fully specifying programs for static verification is a complex, burdensome process for users. Gradual verification was introduced to make this process easier by supporting the verification of partially specified programs. The only currently working gradual verifier, Gradual C0, successfully verifies heap manipulating programs, but lacks expressiveness in its specification language. This paper describes the design and implementation of an extension to Gradual C0 that supports unfolding expressions, which allow more intuitive specifications of recursive heap data structures.

AI-Assisted Fixes to Code Review Comments at Scale

2025-07-17T19:11:00Z

Aim. There are 10s of thousands of code review comments each week at Meta. We developed Metamate for Code Review (MetaMateCR) that provides AI-assisted fixes for reviewer comments in production at scale. Method. We developed an internal benchmark of 64k data points to fine-tune Llama models. Once our models achieve reasonable offline results, we roll them into production. To ensure that our AI-assisted fixes do not negatively impact the time it takes to do code reviews, we conduct randomized controlled safety trials as well as full production experiments. Offline Results. As a baseline, we compare GPT-4o to our small and large Llama models. In offline results, our LargeLSFT model creates an exact match patch 68% of the time outperforming GPT-4o by 9 percentage points (pp). The internal models also use more modern Hack functions when compared to the PHP functions suggested by GPT-4o. Safety Trial. When we roll MetaMateCR into production in a safety trial that compares no AI patches with AI patch suggestions, we see a large regression with reviewers taking over 5% longer to conduct reviews. After investigation, we modify the UX to only show authors the AI patches, and see no regressions in the time for reviews. Production. When we roll LargeLSFT into production, we see an ActionableToApplied rate of 19.7%, which is a 9.2pp improvement over GPT-4o. Our results illustrate the importance of safety trials in ensuring that AI does not inadvertently slow down engineers, and a successful review comment to AI patch product running at scale.

Random Variate Generation with Formal Guarantees

2025-07-17T19:05:07Z

This article introduces a new approach to principled and practical random variate generation with formal guarantees. The key idea is to first specify the desired probability distribution in terms of a finite-precision numerical program that defines its cumulative distribution function (CDF), and then generate exact random variates according to this CDF. We present a universal and fully automated method to synthesize exact random variate generators given any numerical CDF implemented in any binary number format, such as floating-point, fixed-point, and posits. The method is guaranteed to operate with the same precision used to specify the CDF, does not overflow, avoids expensive arbitrary-precision arithmetic, and exposes a consistent API. The method rests on a novel space-time optimal implementation for the class of generators that attain the information-theoretically optimal Knuth and Yao entropy rate, consuming the least possible number of input random bits per output variate. We develop a random variate generation library using our method in C and evaluate it on a diverse set of ``continuous'' and ``discrete'' distributions, showing competitive runtime with the state-of-the-art GNU Scientific Library while delivering higher accuracy, entropy efficiency, and automation.

GPU Performance Portability needs Autotuning

2025-07-17T17:31:44Z

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.

Secure Parsing and Serializing with Separation Logic Applied to CBOR, CDDL, and COSE

2025-07-17T15:50:49Z

Incorrect handling of security-critical data formats, particularly in low-level languages, are the root cause of many security vulnerabilities. Provably correct parsing and serialization tools that target languages like C can help. Towards this end, we present PulseParse, a library of verified parser and serializer combinators for non-malleable binary formats. Specifications and proofs in PulseParse are in separation logic, offering a more abstract and compositional interface, with full support for data validation, parsing, and serialization. PulseParse also supports a class of recursive formats -- with a focus on security and handling adversarial inputs, we show how to parse such formats with only a constant amount of stack space. We use PulseParse at scale by providing the first formalization of CBOR, a recursive, binary data format standard, with growing adoption in various industrial standards. We prove that the deterministic fragment of CBOR is non-malleable and provide EverCBOR, a verified library in both C and Rust to validate, parse, and serialize CBOR objects implemented using PulseParse. Next, we provide the first formalization of CDDL, a schema definition language for CBOR. We identify well-formedness conditions on CDDL definitions that ensure that they yield unambiguous, non-malleable formats, and implement EverCDDL, a tool that checks that a CDDL definition is well-formed, and then produces verified parsers and serializers for it. To evaluate our work, we use EverCDDL to generate verified parsers and serializers for various security-critical applications. Notably, we build a formally verified implementation of COSE signing, a standard for cryptographically signed objects. We also use our toolchain to generate verified code for other standards specified in CDDL, including DICE Protection Environment, a secure boot protocol standard.