Let's Take Esoteric Programming Languages Seriously

2025-09-01T13:20:49Z

Esoteric programming languages are challenging to learn, but their unusual features and constraints may serve to improve programming ability. From languages designed to be intentionally obtuse (e.g. INTERCAL) to others targeting artistic expression (e.g. Piet) or exploring the nature of computation (e.g. Fractan), there is rich variety in the realm of esoteric programming languages. This essay examines the counterintuitive appeal of esoteric languages and seeks to analyse reasons for this popularity. We will explore why people are attracted to esoteric languages in terms of (a) program comprehension and construction, as well as (b) language design and implementation. Our assertion is that esoteric languages can improve general PL awareness, at the same time as enabling the esoteric programmer to impress their peers with obscure knowledge. We will also consider pedagogic principles and the use of AI, in relation to esoteric languages. Emerging from the specific discussion, we identify a general set of 'good' reasons for designing new programming languages. It may not be possible to be exhaustive on this topic, and it is certain we have not achieved that goal here. However we believe our most important contribution is to draw attention to the varied and often implicit motivations involved in programming language design.

Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

2025-08-31T22:12:47Z

Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance because of high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, that are essential for efficient low-precision computations. In this paper, we introduce Tilus, a domain-specific language designed for General-Purpose GPU (GPGPU) computing that supports low-precision data types with arbitrary bit widths from 1 to 8 while maintaining GPU programmability. Tilus features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. Tilus programs are compiled into highly efficient GPU programs through automatic vectorization and instruction selection. Extensive experiments demonstrate that Tilus efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels. Compared to existing compilers such as Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, Tilus achieves performance improvements of: $1.75\times$, $2.61\times$, $1.29\times$ and $1.03\times$, respectively. We open-source Tilus at https://github.com/NVIDIA/tilus.

Formalizing Linear Motion G-code for Invariant Checking and Differential Testing of Fabrication Tools

2025-08-31T04:50:26Z

The computational fabrication pipeline for 3D printing is much like a compiler - users design models in Computer Aided Design (CAD) tools that are lowered to polygon meshes to be ultimately compiled to machine code by 3D slicers. For traditional compilers and programming languages, techniques for checking program invariants are well-established. Similarly, methods like differential testing are often used to uncover bugs in compilers themselves, which makes them more reliable. The fabrication pipeline would benefit from similar techniques but traditional approaches do not directly apply to the representations used in this domain. Unlike traditional programs, 3D models exist both as geometric objects as well as machine code that ultimately runs on the hardware. The machine code, like in traditional compiling, is affected by many factors like the model, the slicer being used, and numerous user-configurable parameters that control the slicing process. In this work, we propose a new algorithm for lifting G-code (a common language used in fabrication pipelines) by denoting a G-code program to a set of cuboids, and then defining an approximate point cloud representation for efficiently operating on these cuboids. Our algorithm opens up new opportunities: we show three use cases that demonstrate how it enables error localization in CAD models through invariant checking, quantitative comparisons between slicers, and evaluating the efficacy of mesh repair tools. We present a prototype implementation of our algorithm in a tool, GlitchFinder, and evaluate it on 58 real-world CAD models. Our results show that GlitchFinder is particularly effective in identifying slicing issues due to small features, can highlight differences in how popular slicers (Cura and PrusaSlicer) slice the same model, and can identify cases where mesh repair tools (MeshLab and Meshmixer) introduce new errors during repair.

Towards Repository-Level Program Verification with Large Language Models

2025-08-31T02:44:04Z

Recent advancements in large language models (LLMs) suggest great promises in code and proof generations. However, scaling automated formal verification to real-world projects requires resolving cross-module dependencies and global contexts, which are crucial challenges overlooked by existing LLM-based methods with a special focus on targeting isolated, function-level verification tasks. To systematically explore and address the significant challenges of verifying entire software repositories, we introduce RVBench, the first verification benchmark explicitly designed for repository-level evaluation, constructed from four diverse and complex open-source Verus projects. We further introduce RagVerus, an extensible framework that synergizes retrieval-augmented generation with context-aware prompting to automate proof synthesis for multi-module repositories. RagVerus triples proof pass rates on existing benchmarks under constrained model inference budgets, and achieves a 27% relative improvement on the more challenging RVBench benchmark, demonstrating a scalable and sample-efficient verification solution.

Abstract Interpretation of Temporal Safety Effects of Higher Order Programs

2025-08-30T19:22:53Z

This paper describes a new abstract interpretation-based approach to verify temporal safety properties of recursive, higher-order programs. While prior works have provided theoretical impact and some automation, they have had limited scalability. We begin with a new automata-based "abstract effect domain" for summarizing context-sensitive dependent effects, capable of abstracting relations between the program environment and the automaton control state. Our analysis includes a new transformer for abstracting event prefixes to automatically computed context-sensitive effect summaries, and is instantiated in a type-and-effect system grounded in abstract interpretation. Since the analysis is parametric on the automaton, we next instantiate it to a broader class of history/register (or "accumulator") automata, beyond finite state automata to express some context-free properties, input-dependency, event summation, resource usage, cost, equal event magnitude, etc. We implemented a prototype evDrift that computes dependent effect summaries (and validates assertions) for OCaml-like recursive higher-order programs. As a basis of comparison, we describe reductions to assertion checking for higher-order but effect-free programs, and demonstrate that our approach outperforms prior tools Drift, RCaml/Spacer, MoCHi, and ReTHFL. Overall, across a set of 23 benchmarks, Drift verified 12 benchmarks, RCaml/Spacer verified 6, MoCHi verified 11, ReTHFL verified 18, and evDrift verified 21; evDrift also achieved a 6.3x, 5.3x, 16.8x, and 6.4x speedup over Drift, RCaml/Spacer, MoCHi, and ReTHFL, respectively, on those benchmarks that both tools could solve.

APRIL: API Synthesis with Automatic Prompt Optimization and Reinforcement Learning

2025-08-29T19:48:09Z

APIs are central to modern software development, yet composing new APIs from large libraries is difficult due to the exponential search space; traditional component-based synthesis relies on costly exploration and hand-crafted specifications. While large language models (LLMs) can generate implementations from natural language, hallucinations and limited access to up-to-date contextual information often yield incorrect code. In this paper, we present APRIL, an approach that combines LLM-based synthesis with Automatic Prompt Optimization (APO) and Reinforcement Learning from Verifiable Rewards (RLVR): APO iteratively refines prompts for a frozen model, while RLVR fine-tunes the policy toward functional correctness, producing an efficient synthesis pipeline. Evaluated on 81 real-world APIs from widely used scientific Python libraries and benchmarked against instruction-tuned but unfine-tuned LLMs guided by expert prompts, APRIL achieves substantial improvements. These results indicate that integrating APO and RLVR provides a robust, scalable path for component-based API synthesis in large libraries.

Complete the Cycle: Reachability Types with Expressive Cyclic References (Extended Version)

2025-08-29T16:11:35Z

Local reasoning about programs that combine aliasing and mutable state is a longstanding challenge. Existing approaches -- ownership systems, linear and affine types, uniqueness types, and lexical effect tracking -- impose global restrictions such as uniqueness or linearity, or rely on shallow syntactic analyses. These designs fall short with higher-order functions and shared mutable state. Reachability Types (RT) track aliasing and separation in higher-order programs, ensuring runtime safety and non-interference. However, RT systems face three key limitations: (1) they prohibit cyclic references, ruling out non-terminating computations and fixed-point combinators; (2) they require deep tracking, where a qualifier must include all transitively reachable locations, reducing precision and hindering optimizations like fine-grained parallelism; and (3) referent qualifier invariance prevents referents from escaping their allocation contexts, making reference factories inexpressible. In this work, we address these limitations by extending RT with three mechanisms that enhance expressiveness. First, we introduce cyclic references, enabling recursive patterns to be encoded directly through the store. Second, we adopt shallow qualifier tracking, decoupling references from their transitively reachable values. Finally, we introduce an escaping rule with reference subtyping, allowing referent qualifiers to outlive their allocation context. These extensions are formalized in the $\mathsf{F}_{<:}^{\circ}$-calculus with a mechanized proof of type soundness, and case studies illustrate expressiveness through fixpoint combinators, non-interfering parallelism, and escaping read-only references.

QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

2025-08-29T07:06:23Z

Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.

CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

2025-08-28T23:00:08Z

We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, and Mojo, with Slang support in development. Our system consists of: language-specific lexers/parsers converting source code to ASTs, bidirectional CrossGL translation modules implementing ToCrossGLConverter classes for importing code and CodeGen classes for target generation, and comprehensive backend implementations handling full translation pipelines. We demonstrate effectiveness through comprehensive evaluation across programming domains, achieving successful compilation and execution across all supported backends. The universal IR design enables adding new languages with minimal effort, requiring only language-specific frontend/backend components. Our contributions include: (1) a unified IR capturing semantics of multiple programming paradigms, (2) a modular architecture enabling extensibility, (3) a comprehensive framework supporting GPU compute, graphics programming, and systems languages, and (4) empirical validation demonstrating practical viability of universal code translation. CrossTL represents a significant step toward language-agnostic programming, enabling write-once, deploy-everywhere development.

The WASM Cloak: Evaluating Browser Fingerprinting Defenses Under WebAssembly based Obfuscation

2025-08-28T21:21:51Z

Browser fingerprinting defenses have historically focused on detecting JavaScript(JS)-based tracking techniques. However, the widespread adoption of WebAssembly (WASM) introduces a potential blind spot, as adversaries can convert JS to WASM's low-level binary format to obfuscate malicious logic. This paper presents the first systematic evaluation of how such WASM-based obfuscation impacts the robustness of modern fingerprinting defenses. We develop an automated pipeline that translates real-world JS fingerprinting scripts into functional WASM-obfuscated variants and test them against two classes of defenses: state-of-the-art detectors in research literature and commercial, in-browser tools. Our findings reveal a notable divergence: detectors proposed in the research literature that rely on feature-based analysis of source code show moderate vulnerability, stemming from outdated datasets or a lack of WASM compatibility. In contrast, defenses such as browser extensions and native browser features remained completely effective, as their API-level interception is agnostic to the script's underlying implementation. These results highlight a gap between academic and practical defense strategies and offer insights into strengthening detection approaches against WASM-based obfuscation, while also revealing opportunities for more evasive techniques in future attacks.

Active Learning for Neurosymbolic Program Synthesis

2025-08-28T19:15:55Z

The goal of active learning for program synthesis is to synthesize the desired program by asking targeted questions that minimize user interaction. While prior work has explored active learning in the purely symbolic setting, such techniques are inadequate for the increasingly popular paradigm of neurosymbolic program synthesis, where the synthesized program incorporates neural components. When applied to the neurosymbolic setting, such techniques can -- and, in practice, do -- return an unintended program due to mispredictions of neural components. This paper proposes a new active learning technique that can handle the unique challenges posed by neural network mispredictions. Our approach is based upon a new evaluation strategy called constrained conformal evaluation (CCE), which accounts for neural mispredictions while taking into account user-provided feedback. Our proposed method iteratively makes CCE more precise until all remaining programs are guaranteed to be observationally equivalent. We have implemented this method in a tool called SmartLabel and experimentally evaluated it on three neurosymbolic domains. Our results demonstrate that SmartLabel identifies the ground truth program for 98% of the benchmarks, requiring under 5 rounds of user interaction on average. In contrast, prior techniques for active learning are only able to converge to the ground truth program for at most 65% of the benchmarks.

Program Semantic Inequivalence Game with Large Language Models

2025-08-28T16:38:13Z

Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics. Finding training examples to teach LLMs to solve these tasks can be challenging. In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ: a generator agent creates program variants that are semantically distinct, derived from a dataset of real-world programming tasks, while an evaluator agent has to identify input examples that cause the original programs and the generated variants to diverge in their behaviour, with the agents training each other semi-adversarially. We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources. We evaluated our approach on multiple code generation and understanding benchmarks, including cross-language vulnerability detection (Lu et al., 2021), where our method improves vulnerability detection in C/C++ code despite being trained exclusively on Python code, and the challenging Python builtin identifier swap benchmark (Miceli-Barone et al., 2023), showing that whereas modern LLMs still struggle with this benchmark, our approach yields substantial improvements. We release the code needed to replicate the experiments, as well as the generated synthetic data, which can be used to fine-tune LLMs.

Static Factorisation of Probabilistic Programs With User-Labelled Sample Statements and While Loops

2025-08-28T15:51:16Z

It is commonly known that any Bayesian network can be implemented as a probabilistic program, but the reverse direction is not so clear. In this work, we address the open question to what extent a probabilistic program with user-labelled sample statements and while loops - features found in languages like Gen, Turing, and Pyro - can be represented graphically. To this end, we extend existing operational semantics to support these language features. By translating a program to its control-flow graph, we define a sound static analysis that approximates the dependency structure of the random variables in the program. As a result, we obtain a static factorisation of the implicitly defined program density, which is equivalent to the known Bayesian network factorisation for programs without loops and constant labels, but constitutes a novel graphical representation for programs that define an unbounded number of random variables via loops or dynamic labels. We further develop a sound program slicing technique to leverage this structure to statically enable three well-known optimisations for the considered program class: we reduce the variance of gradient estimates in variational inference and we speed up both single-site Metropolis Hastings and sequential Monte Carlo. These optimisations are proven correct and empirically shown to match or outperform existing techniques.

Intersection Types for a Computational Lambda-Calculus with Global State

2025-08-27T16:27:14Z

We study the semantics of an untyped lambda-calculus equipped with operators representing read and write operations from and to a global store. We adopt the monadic approach to model side-effects and treat read and write as algebraic operations over a monad. We introduce operational and denotational semantics and a type assignment system of intersection types and prove that types are invariant under the reduction and expansion of term and state configurations. Finally, we characterize convergent terms via their typings.

Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM (Extended Version)

2025-08-27T13:11:55Z

Concurrency bugs are hard to discover and reproduce. Prior work has developed sophisticated algorithms to search for concurrency bugs, such as partial order sampling (POS); however, fundamental limitations with existing platforms for concurrency control hinder effective testing of real-world software. We observe that the design space for concurrency control on managed code involves complex trade-offs between expressibility, applicability, and maintainability on the one hand, and bug-finding efficiency on the other hand. This paper presents Fray, a platform for performing push-button concurrency testing of data-race-free JVM programs. The key insight behind Fray is that effective controlled concurrency testing requires orchestrating thread interleavings without replacing existing concurrency primitives, while encoding their semantics for faithfully expressing the set of all possible program behaviors. Fray incorporates a novel concurrency control mechanism called shadow locking, designed to make controlled concurrency testing practical and efficient for JVM programs. In an empirical evaluation on 53 benchmark programs with known bugs (SCTBench and JaConTeBe), Fray with random search finds 70% more bugs than JPF and 77% more bugs than RR's chaos mode. We also demonstrate Fray's push-button applicability on 2,655 tests from Apache Kafka, Lucene, and Google Guava. In these mature projects, Fray successfully discovered 18 real-world concurrency bugs that can cause 363 tests to fail reproducibly.