https://arxiv.org/api/D6j/zB2HpOqzuRnKzjVi4KDPv+U2026-06-25T02:16:01Z9941115515http://arxiv.org/abs/2509.20534v2Efficient Symbolic Computation via Hash Consing2025-09-26T01:51:02ZSymbolic computation systems suffer from memory inefficiencies due to redundant storage of structurally identical subexpressions, commonly known as expression swell, which degrades performance in both classical computer algebra and emerging AI-driven mathematical reasoning tools. In this paper, we present the first integration of hash consing into JuliaSymbolics, a high-performance symbolic toolkit in Julia, by employing a global weak-reference hash table that canonicalizes expressions and eliminates duplication. This approach reduces memory consumption and accelerates key operations such as differentiation, simplification, and code generation, while seamlessly integrating with Julia's metaprogramming and just-in-time compilation infrastructure. Benchmark evaluations across different computational domains reveal substantial improvements: symbolic computations are accelerated by up to 3.2 times, memory usage is reduced by up to 2 times, code generation is up to 5 times faster, function compilation up to 10 times faster, and numerical evaluation up to 100 times faster for larger models. While certain workloads with fewer duplicate unknown-variable expressions show more modest gains or even slight overhead in initial computation stages, downstream processing consistently benefits significantly. These findings underscore the importance of hash consing in scaling symbolic computation and pave the way for future work integrating hash consing with e-graphs for enhanced equivalence-aware expression sharing in AI-driven pipelines.2025-09-24T20:06:56ZBowen ZhuAayush SabharwalSongchen TanYingbo MaAlan EdelmanChristopher Rackauckashttp://arxiv.org/abs/2509.20380v2ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation2025-09-26T01:37:37ZThe increasing ubiquity of GPUs is accompanied by the increasing complexity of their hardware and parallel programming frameworks. Directive-based parallel programming standards like OpenACC simplify GPU programming to some extent by abstracting away low-level complexities, but a fair amount of expertise is still required in order to use those directives effectively.
We introduce ACCeLLiuM, two open weights Large Language Models specifically fine-tuned for generating expert OpenACC directives for data-parallel loops, along with the supervised fine-tuning dataset that was used to train them. The ACCeLLiuM SFT dataset contains 4,033 OpenACC pragma-loop pairs mined from public GitHub C/C++ repositories, with 3,223 pairs for training and 810 for testing. Experimental evaluations show a pronounced performance gap in generating correct OpenACC pragmas between base LLMs and our fine-tuned versions. On the held-out test set, base LLMs fail to consistently generate valid pragmas, whereas LLMs fine-tuned on the ACCeLLiuM dataset generate valid pragmas with the correct directive type for $87\%$ of the data-parallel loops, and exact pragmas--including directives, clauses, clause order, and clause variables--for $50\%$ of the cases. Even when not exact, generated pragmas frequently incorporate the correct clauses in a different order than the ground-truth label, or include additional clauses that enable finer control over parallel execution, data movement, and concurrency, offering practical value beyond strict string-matching. By publicly releasing the code, models, and dataset as ACCeLLiuM we hope to establish a reproducible benchmark for LLM-powered OpenACC pragma generation, and lower the barrier to automated GPU offloading of serially written programs.2025-09-20T20:41:32ZSamyak JhaveriVanessa KlotzmannCrista Lopeshttp://arxiv.org/abs/2509.21039v1Mojo: MLIR-Based Performance-Portable HPC Science Kernels on GPUs for the Python Ecosystem2025-09-25T11:45:29ZWe explore the performance and portability of the novel Mojo language for scientific computing workloads on GPUs. As the first language based on the LLVM's Multi-Level Intermediate Representation (MLIR) compiler infrastructure, Mojo aims to close performance and productivity gaps by combining Python's interoperability and CUDA-like syntax for compile-time portable GPU programming. We target four scientific workloads: a seven-point stencil (memory-bound), BabelStream (memory-bound), miniBUDE (compute-bound), and Hartree-Fock (compute-bound with atomic operations); and compare their performance against vendor baselines on NVIDIA H100 and AMD MI300A GPUs. We show that Mojo's performance is competitive with CUDA and HIP for memory-bound kernels, whereas gaps exist on AMD GPUs for atomic operations and for fast-math compute-bound kernels on both AMD and NVIDIA GPUs. Although the learning curve and programming requirements are still fairly low-level, Mojo can close significant gaps in the fragmented Python ecosystem in the convergence of scientific computing and AI.2025-09-25T11:45:29ZAccepted at the IEEE/ACM SC25 Conference WACCPD Workshop. The International Conference for High Performance Computing, Networking, Storage, and Analysis, St. Louis, MO, Nov 16-21, 2025. 15 pages, 7 figures. WFG and TM contributed equallyWilliam F. GodoyTatiana MelnichenkoPedro Valero-LaraWael ElwasifPhilip FacklerRafael Ferreira Da SilvaKeita TeranishiJeffrey S. Vetter10.1145/3731599.3767573http://arxiv.org/abs/2404.08163v2ViCAR: Visualizing Categories with Automated Rewriting in Coq2025-09-25T09:18:08ZWe present ViCAR, a library for working with monoidal categories in the Coq proof assistant. ViCAR provides definitions for categorical structures that users can instantiate with their own verification projects. Upon verifying relevant coherence conditions, ViCAR gives a set of lemmas and tactics for manipulating categorical structures. We also provide a visualizer that can display any composition and tensor product of morphisms as a string diagram, showing its categorical structure. This enables graphical reasoning and automated rewriting for Coq projects with monoidal structures.2024-04-11T23:59:14ZIn Proceedings ACT 2024, arXiv:2509.18357EPTCS 429, 2025, pp. 234-248Bhakti ShahWillam SpencerLaura ZielinskiBen CaldwellAdrian LehmannRobert Rand10.4204/EPTCS.429.13http://arxiv.org/abs/2509.17343v4Quantum Simulation Programming via Typing2025-09-25T03:20:07ZQuantum simulations are designed to model quantum systems, and many compilation frameworks have been developed for executing such simulations on quantum computers. Most compilers leverage the capabilities of digital and analog quantum computers by representing quantum particle systems with Pauli strings or digital quantum circuits, making it challenging for users in physics, chemistry, and biology to program simulations effectively. QBLUE is proposed as the first programming language for describing the behaviors of quantum systems in terms of second quantization Hamiltonians. Within QBLUE, a novel type system is proposed to clearly define states across different quantum systems and treat quantum computers as quantum particle systems of specific types. The type system is compatible with the compilation of quantum simulations expressed in QBLUE for digital and analog quantum computers. With QBLUE, users can specify the desired quantum particle system and model the system on quantum computers.2025-09-22T04:15:18Zaccepted talk paper in the 22nd International Conference on Quantum Physics and Logic (QPL 2025)Liyi LiFederico ZaharievChandeepa DissanayakeJae SwanepoelAmr SabryMark S. Gordonhttp://arxiv.org/abs/2509.20518v1Enhancing Python Programming Education with an AI-Powered Code Helper: Design, Implementation, and Impact2025-09-24T19:43:46ZThis is the study that presents an AI-Python-based chatbot that helps students to learn programming by demonstrating solutions to such problems as debugging errors, solving syntax problems or converting abstract theoretical concepts to practical implementations. Traditional coding tools like Integrated Development Environments (IDEs) and static analyzers do not give robotic help while AI-driven code assistants such as GitHub Copilot focus on getting things done. To close this gap, our chatbot combines static code analysis, dynamic execution tracing, and large language models (LLMs) to provide the students with relevant and practical advice, hence promoting the learning process. The chatbots hybrid architecture employs CodeLlama for code embedding, GPT-4 for natural language interactions, and Docker-based sandboxing for secure execution. Evaluated through a mixed-methods approach involving 1,500 student submissions, the system demonstrated an 85% error resolution success rate, outperforming standalone tools like pylint (62%) and GPT-4 (73%). Quantitative results revealed a 59.3% reduction in debugging time among users, with pre- and post-test assessments showing a 34% improvement in coding proficiency, particularly in recursion and exception handling. Qualitative feedback from 120 students highlighted the chatbots clarity, accessibility, and confidence-building impact, though critiques included occasional latency and restrictive code sanitization. By balancing technical innovation with pedagogical empathy, this research provides a blueprint for AI tools that prioritize educational equity and long-term skill retention over mere code completion. The chatbot exemplifies how AI can augment human instruction, fostering deeper conceptual understanding in programming education.2025-09-24T19:43:46Z20 pages, 16 figuresSayed Mahbub Hasan AmiriMd Mainul Islamhttp://arxiv.org/abs/2509.20426v1Dual-Language General-Purpose Self-Hosted Visual Language and new Textual Programming Language for Applications2025-09-24T17:43:55ZMost visual programming languages (VPLs) are domain-specific, with few general-purpose VPLs like Programming Without Coding Technology (PWCT). These general-purpose VPLs are developed using textual programming languages and improving them requires textual programming. In this thesis, we designed and developed PWCT2, a dual-language (Arabic/English), general-purpose, self-hosting visual programming language. Before doing so, we specifically designed a textual programming language called Ring for its development. Ring is a dynamically typed language with a lightweight implementation, offering syntax customization features. It permits the creation of domain-specific languages through new features that extend object-oriented programming, allowing for specialized languages resembling Cascading Style Sheets (CSS) or Supernova language. The Ring Compiler and Virtual Machine are designed using the PWCT visual programming language where the visual implementation is composed of 18,945 components that generate 24,743 lines of C code, which increases the abstraction level and hides unnecessary details. Using PWCT to develop Ring allowed us to realize several issues in PWCT, which led to the development of the PWCT2 visual programming language using the Ring textual programming language. PWCT2 provides approximately 36 times faster code generation and requires 20 times less storage for visual source files. It also allows for the conversion of Ring code into visual code, enabling the creation of a self-hosting VPL that can be developed using itself. PWCT2 consists of approximately 92,000 lines of Ring code and comes with 394 visual components. PWCT2 is distributed to many users through the Steam platform and has received positive feedback, On Steam, 1772 users have launched the software, and the total recorded usage time exceeds 17,000 hours, encouraging further research and development.2025-09-24T17:43:55ZPhD thesisMahmoud Samir Fayedhttp://arxiv.org/abs/2509.15754v2Hornet Node and the Hornet DSL: A Minimal, Executable Specification for Bitcoin Consensus2025-09-24T04:04:17ZBitcoin's consensus rules are encoded in the implementation of its reference client: "The code is the spec." Yet this code is unsuitable for formal verification due to side effects, mutable state, concurrency, and legacy design. A standalone formal specification would enable verification both across versions of the reference client and against new client implementations, strengthening decentralization by reducing the risk of consensus-splitting bugs. Yet such a specification has long been considered intractable given the complexity of Bitcoin's consensus logic. We demonstrate a compact, executable, declarative C++ specification of Bitcoin consensus rules that syncs mainnet to tip in a few hours on a single thread. We also introduce the Hornet Domain-Specific Language (DSL) specifically designed to encode these rules unambiguously for execution, enabling formal reasoning, consensus code generation, and AI-driven adversarial testing. Our spec-driven client Hornet Node offers a modern and modular complement to the reference client. Its clear, idiomatic style makes it suitable for education, while its performance makes it ideal for experimentation. We highlight architectural contributions such as its layered design, efficient data structures, and strong separation of concerns, supported by production-quality code examples. We argue that Hornet Node and Hornet DSL together provide the first credible path toward a pure, formal, executable specification of Bitcoin consensus.2025-09-19T08:30:27ZToby Sharphttp://arxiv.org/abs/2509.19613v1Compilation as Multi-Language Semantics2025-09-23T22:05:36ZModeling interoperability between programs in different languages is a key problem when modeling verified and secure compilation, which has been successfully addressed using multi-language semantics. Unfortunately, existing models of compilation using multi-language semantics define two variants of each compiler pass: a syntactic translation on open terms to model compilation, and a run-time translation of closed terms at multi-language boundaries to model interoperability.
In this talk, I discuss work-in-progress approach to uniformly model a compiler entirely as a reduction system on open term in a multi-language semantics, rather than as a syntactic translation. This simultaneously defines the compiler and the interoperability semantics, reducing duplication. It also provides interesting semantic insights. Normalization of the cross-language redexes performs ahead-of-time (AOT) compilation. Evaluation in the multi-language models just-in-time (JIT) compilation. Confluence of multi-language reduction implies compiler correctness, and part of the secure compilation proof (full abstraction), enabling focus on the difficult part of the proof. Subject reduction of the multi-language reduction implies type-preservation of the compiler.2025-09-23T22:05:36ZWilliam J. Bowmanhttp://arxiv.org/abs/2509.19607v1Macro-embedding Compiler Intermediate Languages in Racket2025-09-23T21:53:47ZWe present the design and implementation of a macro-embedding of a family of compiler intermediate languages, from a Scheme-like language to x86-64, into Racket. This embedding is used as part of a testing framework for a compilers course to derive interpreters for all the intermediate languages. The embedding implements features including safe, functional abstractions as well as unsafe assembly features, and the interactions between the two at various intermediate stages.
This paper aims to demonstrate language-oriented techniques and abstractions for implementing (1) a large family of languages and (2) interoperability between low- and high-level languages. The primary strength of this approach is the high degree of code reuse and interoperability compared to implementing each interpreter separately. The design emphasizes modularity and compositionality of an open set of language features by local macro expansion into a single host language, rather than implementing a language pre-defined by a closed set of features. This enables reuse from both the host language (Racket) and between intermediate languages, and enables interoperability between high- and low-level features, simplifying development of the intermediate language semantics. It also facilitates extending or redefining individual language features in intermediate languages, and exposing multiple interfaces to the embedded languages.2025-09-23T21:53:47ZWilliam J. Bowmanhttp://arxiv.org/abs/2509.19459v1Automated Insertion of Flushes and Fences for Persistency2025-09-23T18:14:21ZCXL shared memory and persistent memory allow the contents of memory to persist beyond crashes. Stores to persistent or CXL memory are typically not immediately made persistent; developers must manually flush the corresponding cache lines to force the data to be written to the underlying storage. Correctly using flush and fence operations is known to be challenging. While state-of-the-art tools can find missing flush instructions, they often require bug-revealing test cases. No existing tools can ensure the absence of missing flush bugs.
In this paper, we present PMRobust, a compiler that automatically inserts flush and fence operations to ensure that code using persistent memory is free from missing flush and fence bugs. PMRobust employs a novel static analysis with optimizations that target newly allocated objects. We have evaluated PMRobust on persistent memory libraries and several persistent memory data structures and measured a geometric mean overhead of 0.26% relative to the original benchmarks with hand-placed flush and fence operations.2025-09-23T18:14:21ZYutong GuoWeiyu LuoBrian Demskyhttp://arxiv.org/abs/2212.02754v3Petri Nets-based Methods on Automatically Detecting for Concurrency Bugs in Rust Programs2025-09-23T00:55:32ZRust's memory safety guarantees, notably ownership and lifetime systems, have driven its widespread adoption. Concurrency bugs still occur in Rust programs, and existing detection approaches exhibit significant limitations: static analyzers suffer from context insensitivity and high false positives, while dynamic methods incur prohibitive runtime costs due to exponential path exploration. This paper presents a Petri net-based method for efficient, precise detection of Rust concurrency bugs. The method rests on three pillars: (1) A syntax-preserving program-to-Petri-net transformation tailored for target bug classes; (2) Semantics-preserving state compression via context-aware slicing; (3) Bug detection through efficient Petri net reachability analysis.
The core innovation is its rigorous, control-flow-driven modeling of Rust's ownership semantics and synchronization primitives within the Petri net structure, with data operations represented as token movements. Integrated pointer analysis automates alias identification during transformation. Experiments on standard Rust concurrency benchmarks demonstrate that our method outperforms the state-of-the-art methods LockBud and Miri that are both tools of detecting concurrency bugs of Rust programs. Compared to LockBud, our approach reduces false positives by 35.7\% and false negatives by 28.3\% , which is obtained through our precise flow-sensitive pointer analysis. Compared with Miri that is a dynamic analysis tool, although Miri can obtain the same detection results, our method achieves 100% faster verification speed since our method takes a state reduce algorithm.2022-12-06T05:11:21ZKaiwen ZhangGuanjun Liuhttp://arxiv.org/abs/2412.12578v2Enabling the Verification and Formalization of Hybrid Quantum-Classical Computing with OpenQASM 3.0 compatible QASM-TS 2.02025-09-22T21:15:13ZThe unique features of the hybrid quantum-classical computing model implied by the specification of OpenQASM 3.0 motivate new approaches to quantum program verification. We implement and thoroughly test a QASM 3.0 parser in TypeScript to enable implementations of verification and validation software, compilers, and more. We aim to help the community to formalize the logic of hybrid quantum-classical computing by providing tools that may help with such efforts.2024-12-17T06:13:31ZAccepted versionJournal of Open Source Software (2025), 10(113), 8696Sean KimMarcus Edwards10.21105/joss.08696http://arxiv.org/abs/2504.10369v2SymRTLO: Enhancing RTL Code Optimization with LLMs and Neuron-Inspired Symbolic Reasoning2025-09-22T15:39:22ZOptimizing Register Transfer Level (RTL) code is crucial for improving the power, performance, and area (PPA) of digital circuits in the early stages of synthesis. Manual rewriting, guided by synthesis feedback, can yield high-quality results but is time-consuming and error-prone. Most existing compiler-based approaches have difficulty handling complex design constraints. Large Language Model (LLM)-based methods have emerged as a promising alternative to address these challenges. However, LLM-based approaches often face difficulties in ensuring alignment between the generated code and the provided prompts. This paper presents SymRTLO, a novel neuron-symbolic RTL optimization framework that seamlessly integrates LLM-based code rewriting with symbolic reasoning techniques. Our method incorporates a retrieval-augmented generation (RAG) system of optimization rules and Abstract Syntax Tree (AST)-based templates, enabling LLM-based rewriting that maintains syntactic correctness while minimizing undesired circuit behaviors. A symbolic module is proposed for analyzing and optimizing finite state machine (FSM) logic, allowing fine-grained state merging and partial specification handling beyond the scope of pattern-based compilers. Furthermore, a fast verification pipeline, combining formal equivalence checks with test-driven validation, further reduces the complexity of verification. Experiments on the RTL-Rewriter benchmark with Synopsys Design Compiler and Yosys show that SymRTLO improves power, performance, and area (PPA) by up to 43.9%, 62.5%, and 51.1%, respectively, compared to the state-of-the-art methods.2025-04-14T16:15:55ZNeurIPS 2025Yiting WangWanghao YePing GuoYexiao HeZiyao WangBowei TianShwai HeGuoheng SunZheyu ShenSihan ChenAnkur SrivastavaQingfu ZhangGang QuAng Lihttp://arxiv.org/abs/2509.17795v1Efficient Linearizability Monitoring2025-09-22T13:52:06ZThis paper revisits the fundamental problem of monitoring the linearizability of concurrent stacks, queues, sets, and multisets. Given a history of a library implementing one of these abstract data types, the monitoring problem is to answer whether the given history is linearizable. For stacks, queues, and (multi)sets, we present monitoring algorithms with complexities $\mathcal{O}(n^2)$, $\mathcal{O}(n\; log\, n)$, and $\mathcal{O}{(n)}$, respectively, where $n$ is the number of operations in the input history. For stacks and queues, our results hold under the standard assumption of {\it data-independence}, i.e., the behavior of the library is not sensitive to the actual values stored in the data structure. Past works to solve the same problems have cubic time complexity and (more seriously) have correctness issues: they either (i) lack correctness proofs or (ii) the suggested correctness proofs are erroneous (we present counter-examples), or (iii) have incorrect algorithms. Our improved complexity results rely on substantially different algorithms for which we provide detailed proofs of correctness. We have implemented our stack and queue algorithms in LiMo (Linearizability Monitor). We evaluate LiMo and compare it with the state-of-the-art tool Violin -- whose correctness proofs we have found errors in -- which checks for linearizability violations. Our experimental evaluation confirms that LiMo outperforms Violin regarding both efficiency and scalability.2025-09-22T13:52:06ZParosh Aziz AbdullaSamuel GrahnBengt JonssonShankaranarayanan KrishnaOm Swostik Mishra