https://arxiv.org/api/dAkNGxs1aSLtkRK7yzAIUg4CxhY 2026-06-23T10:22:40Z 9934 960 15 http://arxiv.org/abs/2511.00865v4 FlowLog: Efficient and Extensible Datalog via Incrementality 2025-11-17T03:56:03Z

Datalog-based languages are regaining popularity as a powerful abstraction for expressing recursive computations in domains such as program analysis and graph processing. However, existing systems often face a trade-off between efficiency and extensibility. Engines like Souffle achieve high efficiency through domain-specific designs, but lack general-purpose flexibility. Others, like RecStep, offer modularity by layering Datalog on traditional databases, but struggle to integrate Datalog-specific optimizations. This paper bridges this gap by presenting FlowLog, a new Datalog engine that uses an explicit relational IR per-rule to cleanly separate recursive control (e.g., semi-naive execution) from each rule's logical plan. This boundary lets us retain fine-grained, Datalog-aware optimizations at the logical layer, but also reuse off-the-shelf database primitives at execution. At the logical level (i.e. IR), we apply proven SQL optimizations, such as logic fusion and subplan reuse. To address high volatility in recursive workloads, we adopt a robustness-first approach that pairs a structural optimizer (avoiding worst-case joins) with sideways information passing (early filtering). Built atop Differential Dataflow--a mature framework for streaming analytics--FlowLog supports both batch and incremental Datalog and adds novel recursion-aware optimizations called Boolean (or algebraic) specialization. Our evaluation shows that FlowLog outperforms state-of-the-art Datalog engines and modern databases across a broad range of recursive workloads, achieving superior scalability while preserving a simple and extensible architecture.

2025-11-02T09:11:17Z Accepted to VLDB 2026 Hangdong Zhao Zhenghong Yu Srinag Rao Simon Frisk Zhiwei Fan Paraschos Koutris http://arxiv.org/abs/2511.12823v1 Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter 2025-11-16T23:05:11Z

Over the past few years, improving LLM code generation capabilities has been a key focus in NLP research. Despite Bengali having 242 million native speakers worldwide, it receives little attention when it comes to training LLMs. More recently, various fine-tuning and augmented generation techniques have been employed to significantly enhance code generation performance. However, they require considerable expertise and resources to utilize effectively as an end user. The goal of our work is to democratize access to powerful code generation tools in resource-constrained emerging markets, enabling users to leverage them in their native language. We introduce a novel approach that combines Test-Driven Development (TDD) and Code Interpreter (CI), utilizing open-weight models, which improves the baseline accuracy for code generation with Bengali prompts and achieves an overall accuracy of 85%. Our approach requires no finetuning and proves that even the smallest models in the same family can attain up to 98% accuracy compared to the largest models. All of our results are publicly shared in GitHub for validation and reproducibility.

2025-11-16T23:05:11Z AACL-IJCNLP 2025 Workshop BLP Shared Task 2, 6 pages, 7 figures, 3 tables Sajed Jalil Shuvo Saha Hossain Mohammad Seym http://arxiv.org/abs/2508.14394v2 Tuning Random Generators: Property-Based Testing as Probabilistic Programming 2025-11-16T03:12:55Z

Property-based testing validates software against an executable specification by evaluating it on randomly generated inputs. The standard way that PBT users generate test inputs is via generators that describe how to sample test inputs through random choices. To achieve a good distribution over test inputs, users must tune their generators, i.e., decide on the weights of these individual random choices. Unfortunately, it is very difficult to understand how to choose individual generator weights in order to achieve a desired distribution, so today this process is tedious and limits the distributions that can be practically achieved. In this paper, we develop techniques for the automatic and offline tuning of generators. Given a generator with undetermined symbolic weights and an objective function, our approach automatically learns values for these weights that optimize for the objective. We describe useful objective functions that allow users to (1) target desired distributions and (2) improve the diversity and validity of their test cases. We have implemented our approach in a novel discrete probabilistic programming system, Loaded Dice, that supports differentiation and parameter learning, and use it as a language for generators. We empirically demonstrate that our approach is effective at optimizing generator distributions according to the specified objective functions. We also perform a thorough evaluation on PBT benchmarks, demonstrating that, when automatically tuned for diversity and validity, the generators exhibit a 3.1-7.4x speedup in bug finding.

2025-08-20T03:45:13Z Extended version of OOPSLA '25 paper Ryan Tjoa Poorva Garg Harrison Goldstein Todd Millstein Benjamin Pierce Guy Van den Broeck 10.1145/3763082 http://arxiv.org/abs/2511.11939v1 Modular GPU Programming with Typed Perspectives 2025-11-14T23:28:15Z

To achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to individual threads to control their behavior, while simultaneously tracking the convergence of many threads acting in concert to perform collective operations like Tensor Core instructions. The tension between these two mindsets makes modular programming error prone. Functions that encapsulate collective operations, despite being called per-thread, must be executed cooperatively by groups of threads. In this work, we introduce Prism, a new GPU language that restores modularity while still giving programmers the low-level control over collective operations necessary for high performance. Our core idea is typed perspectives, which materialize, at the type level, the granularity at which the programmer is controlling the behavior of threads. We describe the design of Prism, implement a compiler for it, and lay its theoretical foundations in a core calculus called Bundl. We implement state-of-the-art GPU kernels in Prism and find that it offers programmers the safety guarantees needed to confidently write modular code without sacrificing performance.

2025-11-14T23:28:15Z Manya Bansal Daniel Sainati Joseph W. Cutler Saman Amarasinghe Jonathan Ragan-Kelley http://arxiv.org/abs/2511.13769v1 Compiling to linear neurons 2025-11-14T22:19:50Z

We don't program neural networks directly. Instead, we rely on an indirect style where learning algorithms, like gradient descent, determine a neural network's function by learning from data. This indirect style is often a virtue; it empowers us to solve problems that were previously impossible. But it lacks discrete structure. We can't compile most algorithms into a neural network -- even if these algorithms could help the network learn. This limitation occurs because discrete algorithms are not obviously differentiable, making them incompatible with the gradient-based learning algorithms that determine a neural network's function. To address this, we introduce $\textsf{Cajal}$: a typed, higher-order and linear programming language intended to be a minimal vehicle for exploring a direct style of programming neural networks. We prove $\textsf{Cajal}$ programs compile to linear neurons, allowing discrete algorithms to be expressed in a differentiable form compatible with gradient-based learning. With our implementation of $\textsf{Cajal}$, we conduct several experiments where we link these linear neurons against other neural networks to determine part of their function prior to learning. Linking with these neurons allows networks to learn faster, with greater data-efficiency, and in a way that's easier to debug. A key lesson is that linear programming languages provide a path towards directly programming neural networks, enabling a rich interplay between learning and the discrete structures of ordinary programming.

2025-11-14T22:19:50Z Joey Velez-Ginorio Nada Amin Konrad Kording Steve Zdancewic http://arxiv.org/abs/2511.11445v1 AI as a component in the action research tradition of learning-by-doing 2025-11-14T16:14:57Z

We consider learning mathematics through action research, hacking, discovery, inquiry, learning-by-doing as opposed to the instruct and perform, industrial model of the 19th century. A learning model based on self-awareness, types, functions, structured drawing and formal diagrams addresses the weaknesses of drill and practice and the pitfalls of statistical prediction with Large Language Models. In other words, we build mathematics/informatics education on the activity of a professional mathematician in mathematical modelling and designing programs. This tradition emphasises the role of dialogue and doing mathematics. In the Language/Action approach the teacher designs mathematising situations that scaffold previously encountered, or not-known-how-to-solve problems for the learner while teachers and teacher/interlocutors supervise the process. A critical feature is the written-oral dialogue between the learner and the teacher. As a rule, this is 1 to 1 communication. The role of the teacher/interlocutor, a more knowledgeable other, is mostly performed by a more senior student, 1 per 5 to 7 pupils. After Doug Engelbart we propose the metaphor of human intellect augmented by digital technologies such as interactive development environments or AI. Every human has their bio and digital parts. The bio part of the learner reacts to their work through dialogue in the mind. The digital part poses questions, interprets code and proposes not necessarily sound ideas.

2025-11-14T16:14:57Z 14 pages, 2 figures Ian Benson Alexei Semenov http://arxiv.org/abs/2511.13764v1 Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels 2025-11-14T14:32:28Z

The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.

2025-11-14T14:32:28Z Arun Thangamani Md Asghar Ahmad Shahid Adam Siemieniuk Rolf Morel Renato Golin Alexander Heinecke http://arxiv.org/abs/2511.11292v1 The Jasmin Compiler Preserves Cryptographic Security 2025-11-14T13:26:36Z

Jasmin is a programming and verification framework for developing efficient, formally verified, cryptographic implementations. A main component of the framework is the Jasmin compiler, which empowers programmers to write efficient implementations of state-of-the-art cryptographic primitives, including post-quantum cryptographic standards. The Jasmin compiler is proven functionally correct in the Rocq prover. However, this functional correctness statement does not apply to nonterminating or probabilistic computations, which are essential features in cryptography. In this paper, we significantly enhance the guarantees of the compiler by showing, in the Rocq prover, that its front-end (25 out of 30 passes) preserves cryptographic security. To this end, we first define a Relational Hoare Logic tailored for compiler correctness proofs. We prove the soundness of our logic w.r.t. a new denotational semantics of Jasmin programs based on interaction trees. Secondly, we use our program logic to prove the functional correctness of the (unmodified) Jasmin compiler w.r.t. said semantics. Lastly, we formalize cryptographic security -- focusing on IND-CCA -- with interaction trees and prove that the Jasmin compiler preserves cryptographic security.

2025-11-14T13:26:36Z Santiago Arranz-Olmos Gilles Barthe Lionel Blatter Benjamin Grégoire Vincent Laporte Paolo Torrini http://arxiv.org/abs/2511.11264v1 Kleene Algebra 2025-11-14T12:57:16Z

This booklet serves as an introduction to Kleene Algebra (KA), a set of laws that can be used to study general equivalences between programs. It discusses how general programs can be modeled using regular expressions, how those expressions correspond to automata, and how this correspondence can be exploited to obtain the central result of KA, namely that an equivalence of regular expressions is true if and only if it can be proved using the laws of KA. Each chapter closes with a set of exercises to further build intuition and understanding, and there is an optional chapter that develops automata theory through the lens of coalgebra.

2025-11-14T12:57:16Z Tobias Kappé Alexandra Silva Jana Wagemaker http://arxiv.org/abs/2507.17087v2 Mapple: A Domain-Specific Language for Mapping Distributed Programs 2025-11-14T09:18:18Z

Optimizing parallel programs for distributed systems is a complex task, often requiring significant code modifications. Task-based programming systems improve modularity by separating performance decisions from application logic, but their mapping interfaces are low-level. We introduce Mapple, a high-level, declarative programming interface for mapping distributed applications. Mapple provides transformation primitives to resolve dimensionality mismatches between task and processor spaces, including a key primitive, decompose, that helps minimize communication volume. We implement Mapple on top of the Legion runtime by translating Mapple mappers into its low-level C++ interface. Across nine applications, including six matrix multiplication algorithms and three scientific computing workloads, Mapple reduces mapper code size by 14x and enables performance improvements of up to 1.34x over expert-written C++ mappers. In addition, the decompose primitive achieves up to 1.83x improvement over existing dimensionality-resolution heuristics.

2025-07-23T00:01:07Z Anjiang Wei Rohan Yadav Hang Song Wonchan Lee Ke Wang Alex Aiken http://arxiv.org/abs/2511.11070v1 Optimising Density Computations in Probabilistic Programs via Automatic Loop Vectorisation 2025-11-14T08:41:53Z

Probabilistic programming languages (PPLs) are a popular tool for high-level modelling across many fields. They provide a range of algorithms for probabilistic inference, which analyse models by learning their parameters from a dataset or estimating their posterior distributions. However, probabilistic inference is known to be very costly. One of the bottlenecks of probabilistic inference stems from the iteration over entries of a large dataset or a long series of random samples. Vectorisation can mitigate this cost, but manual vectorisation is error-prone, and existing automatic techniques are often ad-hoc and limited, unable to handle general repetition structures, such as nested loops and loops with data-dependent control flow, without significant user intervention. To address this bottleneck, we propose a sound and effective method for automatically vectorising loops in probabilistic programs. Our method achieves high throughput using speculative parallel execution of loop iterations, while preserving the semantics of the original loop through a fixed-point check. We formalise our method as a translation from an imperative PPL into a lower-level target language with primitives geared towards vectorisation. We implemented our method for the Pyro PPL and evaluated it on a range of probabilistic models. Our experiments show significant performance gains against an existing vectorisation baseline, achieving $1.1$--$6\times$ speedups and reducing GPU memory usage in many cases. Unlike the baseline, which is limited to a subset of models, our method effectively handled all the tested models.

2025-11-14T08:41:53Z 70 pages, 19 figures, the first two authors contributed equally to this work, accepted at POPL'26 Sangho Lim Hyoungjin Lim Wonyeol Lee Xavier Rival Hongseok Yang http://arxiv.org/abs/2511.13751v1 Inside VOLT: Designing an Open-Source GPU Compiler 2025-11-13T23:12:00Z

Recent efforts in open-source GPU research are opening new avenues in a domain that has long been tightly coupled with a few commercial vendors. Emerging open GPU architectures define SIMT functionality through their own ISAs, but executing existing GPU programs and optimizing performance on these ISAs relies on a compiler framework that is technically complex and often undercounted in open hardware development costs. To address this challenge, the Vortex-Optimized Lightweight Toolchain (VOLT) has been proposed. This paper presents its design principles, overall structure, and the key compiler transformations required to support SIMT execution on Vortex. VOLT enables SIMT code generation and optimization across multiple levels of abstraction through a hierarchical design that accommodates diverse front-end languages and open GPU hardware. To ensure extensibility as GPU architectures evolve, VOLT centralizes fundamental SIMT-related analyses and optimizations in the middle-end, allowing them to be reused across front-ends and easily adapted to emerging open-GPU variants. Through two case studies on ISA extensions and host-runtime API, this paper also demonstrates how VOLT can support extensions

2025-11-13T23:12:00Z 11 pages, 10 figures, two tables, two algorithms Shinnung Jeong Chihyo Ahn Huanzhi Pu Jisheng Zhao Hyesoon Kim Blaise Tine http://arxiv.org/abs/2511.10565v1 zkStruDul: Programming zkSNARKs with Structural Duality 2025-11-13T18:06:21Z

Non-Interactive Zero Knowledge (NIZK) proofs, such as zkSNARKS, let one prove knowledge of private data without revealing it or interacting with a verifier. While existing tooling focuses on specifying the predicate to be proven, real-world applications optimize predicate definitions to minimize proof generation overhead, but must correspondingly transform predicate inputs. Implementing these two steps separately duplicates logic that must precisely match to avoid catastrophic security flaws. We address this shortcoming with zkStruDul, a language that unifies input transformations and predicate definitions into a single combined abstraction from which a compiler can project both procedures, eliminating duplicate code and problematic mismatches. zkStruDul provides a high-level abstraction to layer on top of existing NIZK technology and supports important features like recursive proofs. We provide a source-level semantics and prove its behavior is identical to the projected semantics, allowing straightforward standard reasoning.

2025-11-13T18:06:21Z Rahul Krishnan Ashley Samuelson Emily Yao Ethan Cecchetti http://arxiv.org/abs/2511.10374v1 Modeling Layout Abstractions Using Integer Set Relations 2025-11-13T14:49:42Z

Modern deep learning compilers rely on layout abstractions to manage the complex mapping between logical tensor structures and physical memory arrangements. CuTe layouts and Triton linear layouts are widely adopted industry standards. However, these layout systems operate independently with distinct mathematical underpinnings, preventing unified formal analysis and cross-system reasoning. We bridge this gap by introducing a novel approach that leverages the Integer Set Library (ISL) to create a unified mathematical representation for both layout systems through integer set relations, thereby enabling rigorous formal analysis, correctness verification, and the foundation for future cross-system optimization strategies. Our approach models CuTe layouts through integer set relations that encode the transformation from multi-dimensional coordinates to linear indices using stride-based calculations, including sophisticated swizzle operations that perform bit-level manipulations for enhanced memory access patterns. For Triton linear layouts, we construct integer set relations that model the binary vector space transformations where arithmetic operations follow finite field F_2 rules. We implement a complete suite of layout manipulation algorithms for composition, inversion, complement using built-in operations in ISL to ensure mathematical correctness and preserve layout semantics. Experimental evaluation shows that the system handles the full spectrum of layout complexity, from elementary identity transformations to sophisticated multi-dimensional tensor arrangements with complex stride configurations and swizzle patterns, validating the mathematical modeling approach across different layout paradigms.

2025-11-13T14:49:42Z Somashekaracharya G Bhaskaracharya Aravind Acharya Bastian Hagedorn Vinod Grover http://arxiv.org/abs/2511.09987v1 Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures 2025-11-13T05:44:01Z

We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.

2025-11-13T05:44:01Z Shiv Sundram Akhilesh Balasingam Nathan Zhang Kunle Olukotun Fredrik Kjolstad