https://arxiv.org/api/2VruUF0vNPRfjNpWnKQaGMiAQnM2026-06-15T02:29:41Z988855515http://arxiv.org/abs/2602.24054v2Speak Now: Safe Actor Programming with Multiparty Session Types (Extended Version)2026-03-03T09:41:00ZActor languages such as Erlang and Elixir are widely used for implementing scalable and reliable distributed applications, but the informally-specified nature of actor communication patterns leaves systems vulnerable to costly errors such as communication mismatches and deadlocks. Multiparty session types (MPSTs) rule out communication errors early in the development process, but until now, the many-sender, single-receiver nature of actor communication has made it difficult for actor languages to benefit from session types.
This paper introduces Maty, the first actor language design supporting both static multiparty session typing and the full power of actors taking part in multiple sessions. Maty therefore combines the error prevention mechanism of session types with the scalability and fault tolerance of actor languages. Our main insight is to enforce session typing through a flow-sensitive effect system, combined with an event-driven programming style and first-class message handlers. Using MPSTs allows us to guarantee communication safety: a process will never send or receive an unexpected message, nor will a session get stuck because an actor is waiting for a message that will never be sent. We extend Maty to support Erlang-style supervision and cascading failure, and show that this preserves Maty's strong metatheory. We implement Maty in Scala using an API generation approach, and demonstrate the expressiveness of our model by implementing a representative sample of the widely-used Savina actor benchmark suite; an industry-supplied factory scenario; and a chat server.2026-02-27T14:43:01ZExtended version of paper accepted at OOPSLA'26Simon FowlerRaymond Huhttp://arxiv.org/abs/2603.02637v1StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning2026-03-03T06:04:49ZModern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment.
To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.2026-03-03T06:04:49ZShiyang LiZijian ZhangWinson ChenYuebo LuoMingyi HongCaiwen Dinghttp://arxiv.org/abs/2511.20099v4QiMeng-CRUX: Narrowing the Gap Between Natural Language and Verilog via Core Refined Understanding eXpression for Circuit Design2026-03-03T03:43:23ZLarge language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.2025-11-25T09:17:32ZAccepted by the AAAI26 Conference Main TrackLei HuangRui ZhangJiaming GuoYang ZhangDi HuangShuyao ChengPengwei JinChongxiao LiZidong DuXing HuYunji ChenQi Guohttp://arxiv.org/abs/2603.02405v1Highly Incremental: A Simple Programmatic Approach for Many Objectives (Extended Version)2026-03-02T21:31:00ZWe present a one-fits-all programmatic approach to reason about a plethora of objectives on probabilistic programs. The first ingredient is to add a reward-statement to the language. We then define a program transformation applying a monotone function to the cumulative reward of the program. The key idea is that this transformation uses incremental differences in the reward. This simple, elegant approach enables to express e.g., higher moments, threshold probabilities of rewards, the expected excess over a budget, and moment-generating functions. All these objectives can now be analyzed using a single existing approach: probabilistic wp-reasoning. We automated verification using the Caesar deductive verifier and report on the application of the transformation to some examples.2026-03-02T21:31:00ZPhilipp SchröerJoost-Pieter Katoenhttp://arxiv.org/abs/2603.02298v1CuTe Layout Representation and Algebra2026-03-02T18:31:12ZModern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations.
In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions.
CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.2026-03-02T18:31:12ZCris Ceckahttp://arxiv.org/abs/2604.09589v1Complexity of Consistency Testing for the Release-Acquire Semantics2026-03-02T17:18:31ZIn a seminal work, Gibbons and Korach studied the complexity of deciding whether an observed sequence of reads and writes of a multi-threaded program admits a sequentially consistent interleaving. They showed the problem to be NP-hard even under strong syntactic restrictions. More recently, Chakraborty et al. considered the problem for weak memory models and proved that NP-hardness remains even when the number of threads, the number of memory locations, and the value domain are all bounded.
In this paper we revisit the problem for the release-acquire variants of the C11 memory model. Our main positive result is that consistency testing can be done in polynomial-time when each memory location is written by at most one thread (multiple readers are allowed). Notably, this restriction is already NP-hard for sequential consistency. We complement this upper bound with tight hardness results: the problem is NP-hard when two threads may write to the same location, and allowing three writers per location rules out 2^{o(k)}.n^{O(1)} algorithms under the Exponential Time Hypothesis, where k denotes the number of threads, and n the number of memory operations.2026-03-02T17:18:31ZA shorter version has been accepte at FM 2026 - the 27th International Symposium on Formal MethodsR. GovindS. KrishnaSanchari SilB. Srivathsanhttp://arxiv.org/abs/2603.01889v1Bentō: Optimizing Persistent Memory Programs2026-03-02T14:09:14ZPersistent Memory (PM) is a new storage technology thatbrings high performance, byte addressability, and persistency for a lesser cost than DRAM. Due to cache volatility and store reordering, developers must use explicit instructions (e.g.: flush and fence) to guarantee that the application state remains consistent upon crashes. This is difficult to get right and, in fact, several tools have been created to detect bugs in PM programs. To overcome this difficulty, programmers tend to be overly conservative, for instance, by enforcing unnecessary ordering constraints, which partially forfeits the performance benefits of using PM. In this paper, we study the impact that different combinations of persistency instructions have in several PM programs and found that a specific combination can lead to performance improvements while preserving the original crash-consistency semantics. Based on these results we developed Bentō an automatic and black-box binary rewriter that can boost the performance of existing PM programs by up to 15% with minimal programmer effort.2026-03-02T14:09:14ZSebastião AmaroJoão GonçalvesMiguel Matoshttp://arxiv.org/abs/2603.02260v1Handling Exceptions and Effects with Automatic Resource Analysis2026-02-28T05:05:57ZThere exist many techniques for automatically deriving parametric resource (or cost) bounds by analyzing the source code of a program. These techniques work effectively for a large class of programs and language features. However, non-local transfer of control as needed for exception or effect handlers has remained a challenge.
This paper presents the first automatic resource bound analysis that supports non-local control transfer between exceptions or effects and their handlers. The analysis is an extension of type-based automatic amortized resource analysis (AARA), which automates the potential method of amortized analysis. It is presented for a simple functional language with lists and linear potential functions. However, the ideas are directly applicable to richer settings and implemented for Standard ML and polynomial potential functions.
Apart from the new type system for exceptions and effects, a main contribution is a novel syntactic type-soundness theorem that establishes the correctness of the derived bounds with respect to a stack-based abstract machine. An experimental evaluation shows that the new analysis is capable of analyzing programs that cannot be analyzed by existing methods and that the efficiency overhead of supporting exception and effect handlers is low.2026-02-28T05:05:57ZEthan ChuYiyang GuoJan Hoffmann10.1145/3798207http://arxiv.org/abs/2603.00378v1OBASE: Object-Based Address-Space Engineering to Improve Memory Tiering2026-02-27T23:35:24ZHardware and OS mechanisms for memory tiering are widely deployed, yet datacenters still overprovision DRAM. The root cause is hotness fragmentation: allocators place objects by size rather than access pattern, so hot and cold objects become interleaved within the same pages. A single hot object marks its page as active, trapping surrounding cold data in expensive DRAM. Our analysis of Google production workloads shows that up to 97% of the bytes in active pages are cold and unreclaimable. We propose address-space engineering: dynamically reorganizing virtual memory so that hot objects cluster into uniformly hot pages and cold objects into uniformly cold pages. We present OBASE, a compiler-runtime system for unmanaged languages that serves as an object-aware frontend for page-aware OS backends. OBASE tracks accesses via lightweight pointer instrumentation and migrates objects at runtime using a lock-free protocol that is safe under concurrency. By reorganizing the address space, OBASE enables unmodified backends (kswapd, TMO, TPP, Memtis) to tier memory effectively. Across ten concurrent data structures, six backends, and production traces from Meta and Twitter, OBASE improves page utilization by 2-4x and reduces memory footprint by up to 70%, with only 2-5% overhead.2026-02-27T23:35:24ZVinay BanakarSuli YangKan WuAndrea C. Arpaci-DusseauRemzi H. Arpaci-DusseauKimberly Keetonhttp://arxiv.org/abs/2510.10066v2OBsmith: LLM-Powered JavaScript Obfuscator Testing2026-02-27T22:39:40ZJavaScript obfuscators are widely deployed to protect intellectual property and resist reverse engineering, yet their correctness has been largely overlooked compared to performance and resilience. Existing evaluations typically measure resistance to deobfuscation, leaving the critical question of whether obfuscators preserve program semantics unanswered. Incorrect transformations can silently alter functionality, compromise reliability, and erode security-undermining the very purpose of obfuscation. To address this gap, we present OBsmith, a novel framework to systematically test JavaScript obfuscators using large language models (LLMs). OBsmith leverages LLMs to generate program sketches abstract templates capturing diverse language constructs, idioms, and corner cases-which are instantiated into executable programs and subjected to obfuscation under different configurations. Besides LLM-powered sketching, OBsmith also employs a second source: automatic extraction of sketches from real programs. This extraction path enables more focused testing of project specific features and lets developers inject domain knowledge into the resulting test cases. OBsmith uncovers 11 previously unknown correctness bugs. Under an equal program budget, five general purpose state-of-the-art JavaScript fuzzers (FuzzJIT, Jsfunfuzz, Superion, DIE, Fuzzilli) failed to detect these issues, highlighting OBsmith's complementary focus on obfuscation induced misbehavior. An ablation shows that all components except our generic MRs contribute to at least one bug class; the negative MR result suggests the need for obfuscator-specific metamorphic relations. Our results also seed discussion on how to balance obfuscation presets and performance cost. We envision OBsmith as an important step towards automated testing and quality assurance of obfuscators and other semantic-preserving toolchains.2025-10-11T07:02:42ZAccepted, to appear in OOPSLA 2026Proc. ACM Program. Lang. 10, OOPSLA1, Article 96 (April 2026), 30 pagesShan JiangChenguang ZhuSarfraz Khurshid10.1145/3798204http://arxiv.org/abs/2604.16322v1Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution2026-02-27T21:31:41ZInterpreting and following human instructions is a critical capability of large language models (LLMs) in automatic programming. However, synthesizing large-scale instruction-paired coding data remains largely unexplored and is particularly challenging when ensuring logical compatibility among multiple constraints. In this study, we propose IFCodeEvolve, an actor-schema co-evolution framework for instruction following coding data generation. By representing instructions as parametric function schema, we construct a library that covers the vast instruction space via dynamic constraint instantiation. Building upon this, Monte Carlo Tree Search (MCTS) sampler is applied to efficiently navigate this space, utilizing actor model feedback as a dynamic termination signal. Furthermore, to progressively explore challenging problems, we introduce a co-evolving paradigm that iteratively advances both the actor model and the schema library, via schema composition and mutation, based on sampler statistics. Empirical results demonstrate that IFCodeEvolve significantly boosts base model performance, with our 32B model achieving parity with proprietary SOTA models. Additionally, we contribute IFCodeBench, a comprehensive human-verified benchmark equipped with solutions and robust AST-based verification.2026-02-27T21:31:41ZTinglin HuangBo ChenXiao ZhangKai ShenRex Yinghttp://arxiv.org/abs/2409.19176v6Polynomial Universes in Homotopy Type Theory2026-02-27T20:42:20ZAwodey, later with Newstead, showed how polynomial functors with extra structure (termed ``natural models'') hold within them the categorical semantics for dependent type theory. Their work presented these ideas clearly but ultimately led them outside of the usual category of polynomial functors to a particular \emph{tricategory} of polynomials in order to explain all of the structure possessed by such models. This paper builds off that work -- explicating the categorical semantics of dependent type theory by axiomatizing them entirely in terms of the usual category of polynomial functors. In order to handle the higher-categorical coherences required for such an explanation, we work with polynomial functors in the language of Homotopy Type Theory (HoTT), which allows for higher-dimensional structures to be expressed purely within this category. The move to HoTT moreover enables us to express a key additional condition on polynomial functors -- \emph{univalence} -- which is sufficient to guarantee that models of type theory expressed as univalent polynomials satisfy all higher coherences of their corresponding algebraic structures, purely in virtue of being closed under the usual constructors of dependent type theory. We call polynomial functors satisfying this condition \emph{polynomial universes}. As an example of the simplification to the theory of natural models this enables, we highlight the fact that a polynomial universe being closed under dependent product types implies the existence of a distributive law of monads, which witnesses the usual distributivity of dependent products over dependent sums.2024-09-27T23:12:54ZElectronic Notes in Theoretical Informatics and Computer Science, Volume 5 - Proceedings of MFPS XLI (December 20, 2025) entics:16885C. B. AberléDavid I. Spivak10.46298/entics.16885http://arxiv.org/abs/2504.14283v5Cyclic Proofs in Hoare Logic and its Reverse2026-02-27T20:29:36ZWe examine the relationships between axiomatic and cyclic proof systems for the partial and total versions of Hoare logic and those of its dual, known as reverse Hoare logic (or sometimes incorrectness logic). In the axiomatic proof systems for these logics, the proof rules for looping constructs involve an explicit loop invariant, which in the case of the total versions additionally require a well-founded termination measure. In the cyclic systems, these are replaced by rules that simply unroll the loops, together with a principle allowing the formation of cycles in the proof, subject to a global soundness condition that ensures the well-foundedness of the circular reasoning. Interestingly, the cyclic soundness conditions for partial Hoare logic and its reverse are similar and essentially coinductive in character, while those for the total versions are also similar and essentially inductive. We show that these cyclic systems are sound, by direct argument, and relatively complete, by translation from axiomatic to cyclic proofs.2025-04-19T12:51:18ZElectronic Notes in Theoretical Informatics and Computer Science, Volume 5 - Proceedings of MFPS XLI (December 20, 2025) entics:16696James BrotherstonQuang Loc LeGauri DesaiYukihiro Oda10.46298/entics.16696http://arxiv.org/abs/2510.07851v4The Functional Machine Calculus III: Control2026-02-27T20:17:00ZThe Functional Machine Calculus (Heijltjes 2022) is a new approach to unifying the imperative and functional programming paradigms. It extends the lambda-calculus, preserving the key features of confluent reduction and typed termination, to embed computational effects, evaluation strategies, and control flow operations. The first instalment modelled sequential higher-order computation with global store, input/output, probabilities, and non-determinism, and embedded both the call-by-name and call-by-value lambda-calculus, as well as Moggi's computational metalanguage and Levy's call-by-push-value. The present paper extends the calculus from sequential to branching and looping control flow. This allows the faithful embedding of a minimal but complete imperative language, including conditionals, exception handling, and iteration, as well as constants and algebraic data types.
The calculus is defined through a simple operational semantics, extending the (simplified) Krivine machine for the lambda-calculus with multiple operand stacks to model effects and a continuation stack to model sequential, branching, and looping computation. It features a confluent reduction relation and a system of simple types that guarantees termination of the machine and strong normalization of reduction (in the absence of iteration). These properties carry over to the embedded imperative language, providing a unified functional-imperative model of computation that supports simple types, a direct and intuitive operational semantics, and a confluent reduction semantics.2025-10-09T06:48:54ZElectronic Notes in Theoretical Informatics and Computer Science, Volume 5 - Proceedings of MFPS XLI (December 20, 2025) entics:16682Willem Heijltjes10.46298/entics.16682http://arxiv.org/abs/2510.06777v4Strong Dinatural Transformations and Generalised Codensity Monads2026-02-27T19:44:31ZWe introduce dicodensity monads: a generalisation of pointwise codensity monads generated by functors to monads generated by mixed-variant bifunctors. Our construction is based on the notion of strong dinaturality (also known as Barr dinaturality), and is inspired by denotational models of certain types in polymorphic lambda calculi - in particular, a form of continuation monads with universally quantified variables, such as the Church encoding of the list monad in System F. Extending some previous results on Cayley-style representations, we provide a set of sufficient conditions to establish an isomorphism between a monad and the dicodensity monad for a given bifunctor. Then, we focus on the class of monads obtained by instantiating our construction with hom-functors and, more generally, bifunctors given by objects of homomorphisms (that is, internalised hom-sets between Eilenberg--Moore algebras). This gives us, for example, novel presentations of monads generated by different kinds of semirings and other theories used to model ordered nondeterministic computations.2025-10-08T09:00:13Z15 pagesElectronic Notes in Theoretical Informatics and Computer Science, Volume 5 - Proceedings of MFPS XLI (December 20, 2025) entics:16689Maciej PirógFilip Sieczkowski10.46298/entics.16689