https://arxiv.org/api/dAkNGxs1aSLtkRK7yzAIUg4CxhY
2026-06-23T10:22:40Z
9934
960
15
http://arxiv.org/abs/2511.00865v4
FlowLog: Efficient and Extensible Datalog via Incrementality
2025-11-17T03:56:03Z
Datalog-based languages are regaining popularity as a powerful abstraction for expressing recursive computations in domains such as program analysis and graph processing. However, existing systems often face a trade-off between efficiency and extensibility. Engines like Souffle achieve high efficiency through domain-specific designs, but lack general-purpose flexibility. Others, like RecStep, offer modularity by layering Datalog on traditional databases, but struggle to integrate Datalog-specific optimizations.
This paper bridges this gap by presenting FlowLog, a new Datalog engine that uses an explicit relational IR per-rule to cleanly separate recursive control (e.g., semi-naive execution) from each rule's logical plan. This boundary lets us retain fine-grained, Datalog-aware optimizations at the logical layer, but also reuse off-the-shelf database primitives at execution. At the logical level (i.e. IR), we apply proven SQL optimizations, such as logic fusion and subplan reuse. To address high volatility in recursive workloads, we adopt a robustness-first approach that pairs a structural optimizer (avoiding worst-case joins) with sideways information passing (early filtering). Built atop Differential Dataflow--a mature framework for streaming analytics--FlowLog supports both batch and incremental Datalog and adds novel recursion-aware optimizations called Boolean (or algebraic) specialization. Our evaluation shows that FlowLog outperforms state-of-the-art Datalog engines and modern databases across a broad range of recursive workloads, achieving superior scalability while preserving a simple and extensible architecture.
2025-11-02T09:11:17Z
Accepted to VLDB 2026
Hangdong Zhao
Zhenghong Yu
Srinag Rao
Simon Frisk
Zhiwei Fan
Paraschos Koutris
http://arxiv.org/abs/2511.12823v1
Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter
2025-11-16T23:05:11Z
Over the past few years, improving LLM code generation capabilities has been a key focus in NLP research. Despite Bengali having 242 million native speakers worldwide, it receives little attention when it comes to training LLMs. More recently, various fine-tuning and augmented generation techniques have been employed to significantly enhance code generation performance. However, they require considerable expertise and resources to utilize effectively as an end user. The goal of our work is to democratize access to powerful code generation tools in resource-constrained emerging markets, enabling users to leverage them in their native language.
We introduce a novel approach that combines Test-Driven Development (TDD) and Code Interpreter (CI), utilizing open-weight models, which improves the baseline accuracy for code generation with Bengali prompts and achieves an overall accuracy of 85%. Our approach requires no finetuning and proves that even the smallest models in the same family can attain up to 98% accuracy compared to the largest models. All of our results are publicly shared in GitHub for validation and reproducibility.
2025-11-16T23:05:11Z
AACL-IJCNLP 2025 Workshop BLP Shared Task 2, 6 pages, 7 figures, 3 tables
Sajed Jalil
Shuvo Saha
Hossain Mohammad Seym
http://arxiv.org/abs/2508.14394v2
Tuning Random Generators: Property-Based Testing as Probabilistic Programming
2025-11-16T03:12:55Z
Property-based testing validates software against an executable specification by evaluating it on randomly generated inputs. The standard way that PBT users generate test inputs is via generators that describe how to sample test inputs through random choices. To achieve a good distribution over test inputs, users must tune their generators, i.e., decide on the weights of these individual random choices. Unfortunately, it is very difficult to understand how to choose individual generator weights in order to achieve a desired distribution, so today this process is tedious and limits the distributions that can be practically achieved.
In this paper, we develop techniques for the automatic and offline tuning of generators. Given a generator with undetermined symbolic weights and an objective function, our approach automatically learns values for these weights that optimize for the objective. We describe useful objective functions that allow users to (1) target desired distributions and (2) improve the diversity and validity of their test cases. We have implemented our approach in a novel discrete probabilistic programming system, Loaded Dice, that supports differentiation and parameter learning, and use it as a language for generators. We empirically demonstrate that our approach is effective at optimizing generator distributions according to the specified objective functions. We also perform a thorough evaluation on PBT benchmarks, demonstrating that, when automatically tuned for diversity and validity, the generators exhibit a 3.1-7.4x speedup in bug finding.
2025-08-20T03:45:13Z
Extended version of OOPSLA '25 paper
Ryan Tjoa
Poorva Garg
Harrison Goldstein
Todd Millstein
Benjamin Pierce
Guy Van den Broeck
10.1145/3763082
http://arxiv.org/abs/2511.11939v1
Modular GPU Programming with Typed Perspectives
2025-11-14T23:28:15Z
To achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to individual threads to control their behavior, while simultaneously tracking the convergence of many threads acting in concert to perform collective operations like Tensor Core instructions. The tension between these two mindsets makes modular programming error prone. Functions that encapsulate collective operations, despite being called per-thread, must be executed cooperatively by groups of threads.
In this work, we introduce Prism, a new GPU language that restores modularity while still giving programmers the low-level control over collective operations necessary for high performance. Our core idea is typed perspectives, which materialize, at the type level, the granularity at which the programmer is controlling the behavior of threads. We describe the design of Prism, implement a compiler for it, and lay its theoretical foundations in a core calculus called Bundl. We implement state-of-the-art GPU kernels in Prism and find that it offers programmers the safety guarantees needed to confidently write modular code without sacrificing performance.
2025-11-14T23:28:15Z
Manya Bansal
Daniel Sainati
Joseph W. Cutler
Saman Amarasinghe
Jonathan Ragan-Kelley
http://arxiv.org/abs/2511.13769v1
Compiling to linear neurons
2025-11-14T22:19:50Z
We don't program neural networks directly. Instead, we rely on an indirect style where learning algorithms, like gradient descent, determine a neural network's function by learning from data. This indirect style is often a virtue; it empowers us to solve problems that were previously impossible. But it lacks discrete structure. We can't compile most algorithms into a neural network -- even if these algorithms could help the network learn. This limitation occurs because discrete algorithms are not obviously differentiable, making them incompatible with the gradient-based learning algorithms that determine a neural network's function. To address this, we introduce $\textsf{Cajal}$: a typed, higher-order and linear programming language intended to be a minimal vehicle for exploring a direct style of programming neural networks. We prove $\textsf{Cajal}$ programs compile to linear neurons, allowing discrete algorithms to be expressed in a differentiable form compatible with gradient-based learning. With our implementation of $\textsf{Cajal}$, we conduct several experiments where we link these linear neurons against other neural networks to determine part of their function prior to learning. Linking with these neurons allows networks to learn faster, with greater data-efficiency, and in a way that's easier to debug. A key lesson is that linear programming languages provide a path towards directly programming neural networks, enabling a rich interplay between learning and the discrete structures of ordinary programming.
2025-11-14T22:19:50Z
Joey Velez-Ginorio
Nada Amin
Konrad Kording
Steve Zdancewic
http://arxiv.org/abs/2511.11445v1
AI as a component in the action research tradition of learning-by-doing
2025-11-14T16:14:57Z
We consider learning mathematics through action research, hacking, discovery, inquiry, learning-by-doing as opposed to the instruct and perform, industrial model of the 19th century. A learning model based on self-awareness, types, functions, structured drawing and formal diagrams addresses the weaknesses of drill and practice and the pitfalls of statistical prediction with Large Language Models.
In other words, we build mathematics/informatics education on the activity of a professional mathematician in mathematical modelling and designing programs. This tradition emphasises the role of dialogue and doing mathematics. In the Language/Action approach the teacher designs mathematising situations that scaffold previously encountered, or not-known-how-to-solve problems for the learner while teachers and teacher/interlocutors supervise the process.
A critical feature is the written-oral dialogue between the learner and the teacher. As a rule, this is 1 to 1 communication. The role of the teacher/interlocutor, a more knowledgeable other, is mostly performed by a more senior student, 1 per 5 to 7 pupils. After Doug Engelbart we propose the metaphor of human intellect augmented by digital technologies such as interactive development environments or AI. Every human has their bio and digital parts. The bio part of the learner reacts to their work through dialogue in the mind. The digital part poses questions, interprets code and proposes not necessarily sound ideas.
2025-11-14T16:14:57Z
14 pages, 2 figures
Ian Benson
Alexei Semenov
http://arxiv.org/abs/2511.13764v1
Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels
2025-11-14T14:32:28Z
The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners.
This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.
2025-11-14T14:32:28Z
Arun Thangamani
Md Asghar Ahmad Shahid
Adam Siemieniuk
Rolf Morel
Renato Golin
Alexander Heinecke
http://arxiv.org/abs/2511.11292v1
The Jasmin Compiler Preserves Cryptographic Security
2025-11-14T13:26:36Z
Jasmin is a programming and verification framework for developing efficient, formally verified, cryptographic implementations. A main component of the framework is the Jasmin compiler, which empowers programmers to write efficient implementations of state-of-the-art cryptographic primitives, including post-quantum cryptographic standards. The Jasmin compiler is proven functionally correct in the Rocq prover. However, this functional correctness statement does not apply to nonterminating or probabilistic computations, which are essential features in cryptography.
In this paper, we significantly enhance the guarantees of the compiler by showing, in the Rocq prover, that its front-end (25 out of 30 passes) preserves cryptographic security. To this end, we first define a Relational Hoare Logic tailored for compiler correctness proofs. We prove the soundness of our logic w.r.t. a new denotational semantics of Jasmin programs based on interaction trees. Secondly, we use our program logic to prove the functional correctness of the (unmodified) Jasmin compiler w.r.t. said semantics. Lastly, we formalize cryptographic security -- focusing on IND-CCA -- with interaction trees and prove that the Jasmin compiler preserves cryptographic security.
2025-11-14T13:26:36Z
Santiago Arranz-Olmos
Gilles Barthe
Lionel Blatter
Benjamin Grégoire
Vincent Laporte
Paolo Torrini
http://arxiv.org/abs/2511.11264v1
Kleene Algebra
2025-11-14T12:57:16Z
This booklet serves as an introduction to Kleene Algebra (KA), a set of laws that can be used to study general equivalences between programs. It discusses how general programs can be modeled using regular expressions, how those expressions correspond to automata, and how this correspondence can be exploited to obtain the central result of KA, namely that an equivalence of regular expressions is true if and only if it can be proved using the laws of KA. Each chapter closes with a set of exercises to further build intuition and understanding, and there is an optional chapter that develops automata theory through the lens of coalgebra.
2025-11-14T12:57:16Z
Tobias Kappé
Alexandra Silva
Jana Wagemaker
http://arxiv.org/abs/2507.17087v2
Mapple: A Domain-Specific Language for Mapping Distributed Programs
2025-11-14T09:18:18Z
Optimizing parallel programs for distributed systems is a complex task, often requiring significant code modifications. Task-based programming systems improve modularity by separating performance decisions from application logic, but their mapping interfaces are low-level. We introduce Mapple, a high-level, declarative programming interface for mapping distributed applications. Mapple provides transformation primitives to resolve dimensionality mismatches between task and processor spaces, including a key primitive, decompose, that helps minimize communication volume. We implement Mapple on top of the Legion runtime by translating Mapple mappers into its low-level C++ interface. Across nine applications, including six matrix multiplication algorithms and three scientific computing workloads, Mapple reduces mapper code size by 14x and enables performance improvements of up to 1.34x over expert-written C++ mappers. In addition, the decompose primitive achieves up to 1.83x improvement over existing dimensionality-resolution heuristics.
2025-07-23T00:01:07Z
Anjiang Wei
Rohan Yadav
Hang Song
Wonchan Lee
Ke Wang
Alex Aiken
http://arxiv.org/abs/2511.11070v1
Optimising Density Computations in Probabilistic Programs via Automatic Loop Vectorisation
2025-11-14T08:41:53Z
Probabilistic programming languages (PPLs) are a popular tool for high-level modelling across many fields. They provide a range of algorithms for probabilistic inference, which analyse models by learning their parameters from a dataset or estimating their posterior distributions. However, probabilistic inference is known to be very costly. One of the bottlenecks of probabilistic inference stems from the iteration over entries of a large dataset or a long series of random samples. Vectorisation can mitigate this cost, but manual vectorisation is error-prone, and existing automatic techniques are often ad-hoc and limited, unable to handle general repetition structures, such as nested loops and loops with data-dependent control flow, without significant user intervention. To address this bottleneck, we propose a sound and effective method for automatically vectorising loops in probabilistic programs. Our method achieves high throughput using speculative parallel execution of loop iterations, while preserving the semantics of the original loop through a fixed-point check. We formalise our method as a translation from an imperative PPL into a lower-level target language with primitives geared towards vectorisation. We implemented our method for the Pyro PPL and evaluated it on a range of probabilistic models. Our experiments show significant performance gains against an existing vectorisation baseline, achieving $1.1$--$6\times$ speedups and reducing GPU memory usage in many cases. Unlike the baseline, which is limited to a subset of models, our method effectively handled all the tested models.
2025-11-14T08:41:53Z
70 pages, 19 figures, the first two authors contributed equally to this work, accepted at POPL'26
Sangho Lim
Hyoungjin Lim
Wonyeol Lee
Xavier Rival
Hongseok Yang
http://arxiv.org/abs/2511.13751v1
Inside VOLT: Designing an Open-Source GPU Compiler
2025-11-13T23:12:00Z
Recent efforts in open-source GPU research are opening new avenues in a domain that has long been tightly coupled with a few commercial vendors. Emerging open GPU architectures define SIMT functionality through their own ISAs, but executing existing GPU programs and optimizing performance on these ISAs relies on a compiler framework that is technically complex and often undercounted in open hardware development costs.
To address this challenge, the Vortex-Optimized Lightweight Toolchain (VOLT) has been proposed. This paper presents its design principles, overall structure, and the key compiler transformations required to support SIMT execution on Vortex. VOLT enables SIMT code generation and optimization across multiple levels of abstraction through a hierarchical design that accommodates diverse front-end languages and open GPU hardware. To ensure extensibility as GPU architectures evolve, VOLT centralizes fundamental SIMT-related analyses and optimizations in the middle-end, allowing them to be reused across front-ends and easily adapted to emerging open-GPU variants. Through two case studies on ISA extensions and host-runtime API, this paper also demonstrates how VOLT can support extensions
2025-11-13T23:12:00Z
11 pages, 10 figures, two tables, two algorithms
Shinnung Jeong
Chihyo Ahn
Huanzhi Pu
Jisheng Zhao
Hyesoon Kim
Blaise Tine
http://arxiv.org/abs/2511.10565v1
zkStruDul: Programming zkSNARKs with Structural Duality
2025-11-13T18:06:21Z
Non-Interactive Zero Knowledge (NIZK) proofs, such as zkSNARKS, let one prove knowledge of private data without revealing it or interacting with a verifier. While existing tooling focuses on specifying the predicate to be proven, real-world applications optimize predicate definitions to minimize proof generation overhead, but must correspondingly transform predicate inputs. Implementing these two steps separately duplicates logic that must precisely match to avoid catastrophic security flaws. We address this shortcoming with zkStruDul, a language that unifies input transformations and predicate definitions into a single combined abstraction from which a compiler can project both procedures, eliminating duplicate code and problematic mismatches. zkStruDul provides a high-level abstraction to layer on top of existing NIZK technology and supports important features like recursive proofs. We provide a source-level semantics and prove its behavior is identical to the projected semantics, allowing straightforward standard reasoning.
2025-11-13T18:06:21Z
Rahul Krishnan
Ashley Samuelson
Emily Yao
Ethan Cecchetti
http://arxiv.org/abs/2511.10374v1
Modeling Layout Abstractions Using Integer Set Relations
2025-11-13T14:49:42Z
Modern deep learning compilers rely on layout abstractions to manage the complex mapping between logical tensor structures and physical memory arrangements. CuTe layouts and Triton linear layouts are widely adopted industry standards. However, these layout systems operate independently with distinct mathematical underpinnings, preventing unified formal analysis and cross-system reasoning. We bridge this gap by introducing a novel approach that leverages the Integer Set Library (ISL) to create a unified mathematical representation for both layout systems through integer set relations, thereby enabling rigorous formal analysis, correctness verification, and the foundation for future cross-system optimization strategies. Our approach models CuTe layouts through integer set relations that encode the transformation from multi-dimensional coordinates to linear indices using stride-based calculations, including sophisticated swizzle operations that perform bit-level manipulations for enhanced memory access patterns. For Triton linear layouts, we construct integer set relations that model the binary vector space transformations where arithmetic operations follow finite field F_2 rules. We implement a complete suite of layout manipulation algorithms for composition, inversion, complement using built-in operations in ISL to ensure mathematical correctness and preserve layout semantics. Experimental evaluation shows that the system handles the full spectrum of layout complexity, from elementary identity transformations to sophisticated multi-dimensional tensor arrangements with complex stride configurations and swizzle patterns, validating the mathematical modeling approach across different layout paradigms.
2025-11-13T14:49:42Z
Somashekaracharya G Bhaskaracharya
Aravind Acharya
Bastian Hagedorn
Vinod Grover
http://arxiv.org/abs/2511.09987v1
Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures
2025-11-13T05:44:01Z
We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.
2025-11-13T05:44:01Z
Shiv Sundram
Akhilesh Balasingam
Nathan Zhang
Kunle Olukotun
Fredrik Kjolstad