https://arxiv.org/api/qhAzMDCfssnLNls2LX8UZOz7r/s2026-06-21T11:29:08Z991869015http://arxiv.org/abs/2510.23101v2Beyond Imprecise Distance Metrics: Trace-Guided Directed Greybox Fuzzing via LLM-Predicted Call Stacks2026-01-31T07:35:46ZDirected greybox fuzzing (DGF) aims to efficiently trigger bugs at specific target locations by prioritizing seeds whose execution paths are more likely to reach the targets. However, existing DGF approaches suffer from imprecise potential estimation due to their reliance on static-analysis-based distance metrics. The over-approximation inherent in static analysis causes many seeds with execution paths irrelevant to vulnerability triggering to be mistakenly prioritized, significantly reducing fuzzing efficiency. To address this issue, we propose trace-guided directed greybox fuzzing (TDGF). TDGF replaces static-analysis-based distance metrics with vulnerability-oriented execution information (referred to as guidance traces) to steer directed fuzzing: seeds whose execution paths overlap more with the guidance traces are scheduled earlier for mutation. We empirically study two representative types of guidance traces: the control-flow trace and the call-stack trace of vulnerability-triggering executions. We find that the fine-grained control-flow traces offer nearly the same guidance capability as the coarse-grained call-stack traces, while call-stack traces are also easier for large language models (LLMs) to predict. Based on this insight, we further propose a framework that leverages LLMs to predict the call stack at vulnerability-triggering time and uses it to guide DGF. We implement our approach and evaluate it against several state-of-the-art fuzzers with experiments totaling 58.4 CPU-years. On a suite of real-world programs, our approach triggers vulnerabilities 2.13$\times$ to 3.14$\times$ faster than the baselines. Moreover, through directed patch testing on the latest program versions used in our controlled experiments, our approach discovers 10 new vulnerabilities and 2 incomplete fixes, with 10 assigned CVE IDs.2025-10-27T08:17:03ZPreprint, under submissionYifan ZhangXin Zhanghttp://arxiv.org/abs/2603.19239v1Defusing Logic Bombs in Symbolic Execution with LLM-Generated Ghost Code2026-01-31T07:14:57ZSymbolic execution is a powerful program analysis technique, but its effectiveness is fundamentally limited by solver-hostile program fragments, complex numerical reasoning, and unbounded heap structures. Recent work proposed replacing constraint solvers with large language models (LLMs) to bypass these limitations, but such approaches struggle to analyze real-world codebases, where deep execution paths require globally consistent reasoning across many interacting constraints. We present Gordian, a hybrid symbolic execution framework that uses LLMs selectively to generate lightweight ghost code that aids an SMT solver in handling solver-hostile code fragments, while preserving its precise, global reasoning capability. In particular, we propose three types of ghost code: (1) inversion of difficult code fragments with iterative bidirectional constraint propagation, (2) modeling via solver-friendly surrogates while preserving relevant behavior, and (3) semantic partitioning of unbounded heap spaces. We implemented Gordian on top of the KLEE symbolic execution engine and evaluated it on synthetic "logic bombs" capturing distinct symbolic reasoning challenges, a popular mathematical library FDLibM, and three structured-input programs (libexpat, jq, and bc). Across all benchmarks, Gordian improves coverage on average by 52-84% over traditional symbolic execution baselines, and by 86-419% over LLM-based techniques, while reducing LLM token usage by an average of 90-96%. This highlights the practicality and effectiveness of this approach in real-world settings.2026-01-31T07:14:57ZDimitrios Stamatios BourasSergey Mechtaevhttp://arxiv.org/abs/2501.07047v4Leveraging ASIC AI Chips for Homomorphic Encryption2026-01-31T03:35:18ZHomomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google's TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU's low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU's coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU's architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU's MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS2025-01-13T04:08:14ZIEEE International Symposium on High-Performance Computer Architecture (HPCA) 2026; 18 pages, 16 figures, 5 algorithms, 10 tables. Leveraging Google TPUs for Homomorphic Encryption2026 IEEE Symposium on High-Performance Computer ArchitectureJianming TongTianhao HuangJingtian DangLeo de CastroAnirudh ItagiAnupam GolderAsra AliJeremy KunJevin Jiang ArvindG. Edward SuhTushar Krishna10.1109/HPCA68181.2026.11408507http://arxiv.org/abs/2602.00303v1Towards Analyzing N-language Polyglot Programs2026-01-30T20:52:26ZPolyglot programming is gaining popularity as developers integrate multiple programming languages to harness their individual strengths. With the recent popularity of platforms like GraalVM and other multi-language runtimes, creating and managing these systems has become much more feasible. However, current research on analyzing multilingual programs mainly focuses on two languages, leaving out the increasing complexity of systems that use three or more. For example, modern web systems often link JavaScript, WebAssembly, and Rust within the same execution chain. This paper envisions the landscape of software systems with three-language polyglot communication. We identify fundamental challenges in analyzing them and propose a conceptual roadmap to advance static analysis techniques to address them. Our vision aims to stimulate discussion and inspire new research directions toward scalable, language-agnostic analysis frameworks for next-generation polyglot systems.2026-01-30T20:52:26ZJyoti PrakashAbhishek TiwariMikkel Baun Kjærgaardhttp://arxiv.org/abs/2505.11480v3SuperCoder: Assembly Program Superoptimization with Large Language Models2026-01-30T18:27:24ZSuperoptimization is the task of transforming a program into a faster one while preserving its input-output behavior. In this work, we investigate whether large language models (LLMs) can serve as superoptimizers, generating assembly programs that outperform code already optimized by industry-standard compilers. We construct the first large-scale benchmark for this problem, consisting of 8,072 assembly programs averaging 130 lines, in contrast to prior datasets restricted to 2-15 straight-line, loop-free programs. We evaluate 23 LLMs on this benchmark and find that the strongest baseline, Claude-opus-4, achieves a 51.5% test-passing rate and a 1.43x average speedup over gcc -O3. To further enhance performance, we fine-tune models with reinforcement learning, optimizing a reward function that integrates correctness and performance speedup. Starting from Qwen2.5-Coder-7B-Instruct (61.4% correctness, 1.10x speedup), the fine-tuned model SuperCoder attains 95.0% correctness and 1.46x average speedup, with additional improvement enabled by Best-of-N sampling and iterative refinement. Our results demonstrate, for the first time, that LLMs can be applied as superoptimizers for assembly programs, establishing a foundation for future research in program performance optimization beyond compiler heuristics.2025-05-16T17:40:45ZAnjiang WeiTarun SureshHuanmi TanYinglun XuGagandeep SinghKe WangAlex Aikenhttp://arxiv.org/abs/2601.21842v2Optimal Software Pipelining using an SMT-Solver2026-01-30T14:08:20ZSoftware Pipelining is a classic and important loop-optimization for VLIW processors. It improves instruction-level parallelism by overlapping multiple iterations of a loop and executing them in parallel. Typically, it is implemented using heuristics. In this paper, we present an optimal software pipeliner based on a Satisfiability Modulo Theories (SMT) Solver. We show that our approach significantly outperforms heuristic algorithms and hand-optimization. Furthermore, we show how the solver can be used to give feedback to programmers and processor designers on why a software pipelined schedule of a certain initiation interval is not feasible.2026-01-29T15:19:12ZJan-Willem Roordahttp://arxiv.org/abs/2512.02918v2Belobog: Move Language Fuzzing Framework For Real-World Smart Contracts2026-01-30T12:05:45ZMove is a research-oriented programming language designed for secure and verifiable smart contract development and has been widely used in managing billions of digital assets in blockchains, such as Sui and Aptos. Move features a strong static type system and explicit resource semantics to enforce safety properties such as the prevention of data races, invalid asset transfers, and entry vulnerabilities. However, smart contracts written in Move may still contain certain vulnerabilities that are beyond the reach of its type system. It is thus essential to validate Move smart contracts. Unfortunately, due to its strong type system, existing smart contract fuzzers are ineffective in producing syntactically or semantically valid transactions to test Move smart contracts. This paper introduces the first fuzzing framework, Belobog, for Move smart contracts. Belobog is type-aware and ensures that all generated and mutated transactions are well-typed. More specifically, for a target Move smart contract, Belobog first constructs a type graph based on Move's type system, and then generates or mutates a transaction based on the graph trace derived from the type graph. In order to overcome the complex checks in Move smart contracts, we further design and implement a concolic executor in Belobog. We evaluated Belobog on 109 real-world Move smart contract projects. The experimental results show that Belobog is able to detect 100% critical and 79% major vulnerabilities manually audited by human experts. We further selected two recent notorious incidents in the Move ecosystem, i.e., Cetus and Nemo. Belobog successfully reproduced full exploits for both of them, without any prior knowledge. Moreover, we applied Belobog on three ongoing auditing projects and found 2 critical, 2 major, and 3 medium new vulnerabilities, all acknowledged by the project developers.2025-12-02T16:36:13ZRevised a lot with new detailsZiqiao KongWanxu XiaZhengwei LiYi LuPan LiLiqun YangYang LiuXiapu LuoShaohua Lihttp://arxiv.org/abs/2510.09726v2Herb.jl: A Unifying Program Synthesis Library2026-01-30T10:15:01ZProgram synthesis -- the automatic generation of code given a specification -- is one of the most fundamental tasks in artificial intelligence (AI) and the dream of many programmers. Numerous synthesizers have been developed for program synthesis, offering different approaches to the exponentially growing program space. Although such state-of-the-art tools exist, reusing and adapting them remains tedious and time-consuming. We propose Herb.jl, a unifying program synthesis library written in Julia, to address these issues. Since current methods share similar building blocks, we aim to break down the underlying algorithms into extendable, reusable subcomponents. To demonstrate the benefits of using Herb.jl, we show how to implement a simple problem and grammar, and how to solve it with just a few lines of code.2025-10-10T09:45:36ZTilman HinnerichsReuben Gardos ReidJaap de JongBart SwinkelsPamela WochnerNicolae FilatTudor MagurescuIssa HanouSebastijan Dumancichttp://arxiv.org/abs/2601.16670v2The Green Side of the Lua2026-01-30T07:52:17ZThe United Nations' 2030 Agenda for Sustainable Development highlights the importance of energy-efficient software to reduce the global carbon footprint. Programming languages and execution models strongly influence software energy consumption, with interpreted languages generally being less efficient than compiled ones. Lua illustrates this trade-off: despite its popularity, it is less energy-efficient than greener and faster languages such as C.
This paper presents an empirical study of Lua's runtime performance and energy efficiency across 25 official interpreter versions and just-in-time (JIT) compilers. Using a comprehensive benchmark suite, we measure execution time and energy consumption to analyze Lua's evolution, the impact of JIT compilation, and comparisons with other languages. Results show that all LuaJIT compilers significantly outperform standard Lua interpreters. The most efficient LuaJIT consumes about seven times less energy and runs seven times faster than the best Lua interpreter. Moreover, LuaJIT approaches C's efficiency, using roughly six times more energy and running about eight times slower, demonstrating the substantial benefits of JIT compilation for improving both performance and energy efficiency in interpreted languages.2026-01-23T11:40:30ZAndré BrandãoDiogo MatosMiguel GuimarãesSimão CunhaJoão Saraivahttp://arxiv.org/abs/2601.22557v1Recursive Mutexes in Separation Logic2026-01-30T04:53:38ZMutexes (i.e., locks) are well understood in separation logic, and can be specified in terms of either protecting an invariant or atomically changing the state of the lock. In this abstract, we develop the same styles of specifications for \emph{recursive} mutexes, a common variant of mutexes in object-oriented languages such as C++ and Java. A recursive mutex can be acquired any number of times by the same thread, and our specifications treat all acquires/releases uniformly, with clients only needing to determine whether they hold the mutex when accessing the lock invariant.2026-01-30T04:53:38ZRocqPL 2026-Rocq for Programming LanguagesKe DuWilliam ManskyPaolo G. GiarrussoGregory Malechahttp://arxiv.org/abs/2602.04892v1Doc2Spec: Synthesizing Formal Programming Specifications from Natural Language via Grammar Induction2026-01-30T02:58:27ZEnsuring that API implementations and usage comply with natural language programming rules is critical for software correctness, security, and reliability. Formal verification can provide strong guarantees but requires precise specifications, which are difficult and costly to write manually. To address this challenge, we present Doc2Spec, a multi-agent framework that uses LLMs to automatically induce a specification grammar from natural-language rules and then generates formal specifications guided by the induced grammar. The grammar captures essential domain knowledge, constrains the specification space, and enforces consistent representations, thereby improving the reliability and quality of generated specifications. Evaluated on seven benchmarks across three programming languages, Doc2Spec outperforms a baseline without grammar induction and achieves competitive results against a technique with a manually crafted grammar, demonstrating the effectiveness of automated grammar induction for formalizing natural-language rules.2026-01-30T02:58:27ZShihao XiaMengting HeHaomin JiaLinhai Songhttp://arxiv.org/abs/2601.19092v2Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers2026-01-28T22:37:40ZScaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.2026-01-27T01:57:29ZBohan HouHongyi JinGuanjie WangJinqi ChenYaxing CaiLijie YangZihao YeYaoyao DingRuihang LaiTianqi Chenhttp://arxiv.org/abs/2601.21096v1Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve2026-01-28T22:34:56ZModern compilers rely on hand-crafted heuristics to guide optimization passes. These human-designed rules often struggle to adapt to the complexity of modern software and hardware and lead to high maintenance burden. To address this challenge, we present Magellan, an agentic framework that evolves the compiler pass itself by synthesizing executable C++ decision logic. Magellan couples an LLM coding agent with evolutionary search and autotuning in a closed loop of generation, evaluation on user-provided macro-benchmarks, and refinement, producing compact heuristics that integrate directly into existing compilers. Across several production optimization tasks, Magellan discovers policies that match or surpass expert baselines. In LLVM function inlining, Magellan synthesizes new heuristics that outperform decades of manual engineering for both binary-size reduction and end-to-end performance. In register allocation, it learns a concise priority rule for live-range processing that matches intricate human-designed policies on a large-scale workload. We also report preliminary results on XLA problems, demonstrating portability beyond LLVM with reduced engineering effort.2026-01-28T22:34:56ZAccepted to C4ML@CGO'26Hongzheng ChenAlexander NovikovNgân VũHanna AlamZhiru ZhangAiden GrossmanMircea TrofinAmir Yazdanbakhshhttp://arxiv.org/abs/2504.16214v3Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs2026-01-28T21:15:30ZEfficient GPU programming is crucial for achieving high performance in deep learning (DL) applications. The performance of GPU programs depends on how data is parallelized across threads and arranged within memory subsystems. The mapping functions describing tensors on GPUs are known as \emph{tensor layouts}. Low-level programming frameworks, such as CUTLASS and Hidet, provide expressive layout abstractions but often require \emph{considerable programming effort} to manually specify optimal layouts. High-level GPU programming languages, such as Triton, rely on compiler heuristics to generate dataflow, layouts, and pipelining strategies in GPU programs. However, the heuristics for dataflow and pipelining strategies are not generalizable to complex operators. To balance expressiveness and programmability, we propose Hexcute, a compiler framework that automates layout synthesis while providing explicit control over dataflow and pipelining. Hexcute formalizes layout synthesis as a constraint programming problem and solves it with a type-inference-based algorithm. This approach enables systematic exploration of optimal layouts and instructions.
Our evaluation shows that Hexcute matches the performance of libraries like cuBLAS and FlashAttention on GEMM, Attention, and their variants, while reducing the amount of code by 1.27$\times$-7.94$\times$ compared to CUTLASS. For mixed-type mixture-of-experts (MoE) operators, Hexcute achieves an average speedup of 6.46$\times$ over Triton. In the end-to-end evaluations of vLLM, Hexcute delivers up to 2.60$\times$ speedup on DeepSeek-R1-AWQ and 2.04$\times$ on a Mamba-based model.2025-04-22T19:01:28Z19 pages, 27 figuresXiao ZhangYaoyao DingBolin SunYang HuTatiana ShpeismanGennady Pekhimenkohttp://arxiv.org/abs/2601.18944v2Neural Theorem Proving for Verification Conditions: A Real-World Benchmark2026-01-28T18:25:21ZTheorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification--particularly VC proving--remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.2026-01-26T20:37:11ZAccepted in ICLR'26Qiyuan XuXiaokun LuanRenxi WangJoshua Ong Jun LeangPeixin WangHaonan LiWenda LiConrad Watt