From Description to Score: Can LLMs Quantify Vulnerabilities?

2026-01-05T02:16:05Z

Manual vulnerability scoring, such as assigning Common Vulnerability Scoring System (CVSS) scores, is a resource-intensive process that is often influenced by subjective interpretation. This study investigates the potential of general-purpose large language models (LLMs), namely ChatGPT, Llama, Grok, DeepSeek, and Gemini, to automate this process by analyzing over 31{,}000 recent Common Vulnerabilities and Exposures (CVE) entries. The results show that LLMs substantially outperform the baseline on certain metrics (e.g., \textit{Availability Impact}), while offering more modest gains on others (e.g., \textit{Attack Complexity}). Moreover, model performance varies across both LLM families and individual CVSS metrics, with ChatGPT-5 attaining the highest precision. Our analysis reveals that LLMs tend to misclassify many of the same CVEs, and ensemble-based meta-classifiers only marginally improve performance. Further examination shows that CVE descriptions often lack critical context or contain ambiguous phrasing, which contributes to systematic misclassifications. These findings underscore the importance of enhancing vulnerability descriptions and incorporating richer contextual details to support more reliable automated reasoning and alleviate the growing backlog of CVEs awaiting triage.

Bithoven: Formal Safety for Expressive Bitcoin Smart Contracts

2026-01-04T08:48:34Z

The rigorous security model of Bitcoin's UTXO architecture often comes at the cost of developer usability, forcing a reliance on manual stack manipulation that leads to critical financial vulnerabilities like signature malleability, unspendable states and unconstrained execution paths. Industry standards such as Miniscript provide necessary abstractions for policy verification but do not model the full imperative logic required for complex contracts, leaving gaps in state management and resource liveness. This paper introduces Bithoven, a high-level language designed to bridge the gap between expressiveness and formal safety. By integrating a strict type checker and a resource liveness analyzer with a semantic control-flow analyzer, Bithoven eliminates major categories of consensus and logic defects defined in our fault model prior to deployment. Our results indicate that this safety comes at modest cost: Bithoven compiles to Bitcoin Script with efficiency comparable to hand-optimized code, demonstrating that type-safe, developer-friendly abstractions are viable even within the strict byte-size constraints of the Bitcoin blockchain.

Verified Purely Functional Catenable Real-Time Deques

2026-01-03T10:47:14Z

We present OCaml and Rocq implementations of Kaplan and Tarjan's purely functional, real-time catenable deques. The correctness of our Rocq implementation is machine-checked.

Piecewise Analysis of Probabilistic Programs via $k$-Induction

2026-01-02T19:47:36Z

In probabilistic program analysis, quantitative analysis aims at deriving tight numerical bounds for probabilistic properties such as expectation and assertion probability. Most previous works consider numerical bounds over the whole program state space monolithically and do not consider piecewise bounds. Not surprisingly, monolithic bounds are either conservative, or not expressive and succinct enough in general. To derive better bounds, we propose a novel approach for synthesizing piecewise bounds over probabilistic programs. First, we show how to extract useful piecewise information from latticed $k$-induction operators, and combine the piecewise information with Optional Stopping Theorem to obtain a general approach to derive piecewise bounds over probabilistic programs. Second, we develop algorithms to synthesize piecewise polynomial bounds, and show that the synthesis can be reduced to bilinear programming in the linear case, and soundly relaxed to semidefinite programming in the polynomial case. Experimental results show that our approach generates tight piecewise bounds for a wide range of benchmarks when compared with the state of the art.

Virtual Garbage Collector (VGC): A Zone-Based Garbage Collection Architecture for Python's Parallel Runtime

2026-01-01T15:12:59Z

The Virtual Garbage Collector (VGC) proposes a zone-based memory management architecture aimed at improving execution predictability and memory behavior in Python runtimes. The design explores a dual-layer model consisting of an Active VGC, responsible for managing runtime object lifecycles, and a Passive VGC, intended as a compile-time optimization layer for static allocation planning. Rather than relying on traditional heap traversal or generational heuristics, VGC introduces memory zoning and checkpoint-based state evaluation to reduce allocation churn and constrain garbage collection scope. Execution partitioning is experimentally evaluated to isolate workloads and localize memory pressure, enabling more deterministic behavior under loop-intensive, recursive, and compute-heavy workloads. This work presents the architectural principles, execution model, and experimental observations of VGC within a partition-aware runtime context. While the full realization of the dual-layer design is an ongoing effort, the results indicate that zone-based allocation and partitioned execution provide a viable foundation for improving scalability and memory predictability in Python-oriented systems.

BALI: Branch-Aware Loop Invariant Inference with Large Language Models

2025-12-31T07:13:42Z

Loop invariants are fundamental for reasoning about the correctness of iterative algorithms. However, deriving suitable invariants remains a challenging and often manual task, particularly for complex programs. In this paper, we introduce BALI, a branch-aware framework that integrates large language models (LLMs) to enhance the inference and verification of loop invariants. Our approach combines automated reasoning with branch-aware static program analysis to improve both precision and scalability. Specifically, unlike prior LLM-only guess-and-check methods, BALI first verifies branch-sequence-level (path-level) clauses with SMT and then composes them into program-level invariants. We outline its key components, present preliminary results, and discuss future directions toward fully automated invariant discovery.

Universal Conditional Logic: A Formal Language for Prompt Engineering

2025-12-31T05:27:00Z

We present Universal Conditional Logic (UCL), a mathematical framework for prompt optimization that transforms prompt engineering from heuristic practice into systematic optimization. Through systematic evaluation (N=305, 11 models, 4 iterations), we demonstrate significant token reduction (29.8%, t(10)=6.36, p < 0.001, Cohen's d = 2.01) with corresponding cost savings. UCL's structural overhead function O_s(A) explains version-specific performance differences through the Over-Specification Paradox: beyond threshold S* = 0.509, additional specification degrades performance quadratically. Core mechanisms -- indicator functions (I_i in {0,1}), structural overhead (O_s = gamma * sum(ln C_k)), early binding -- are validated. Notably, optimal UCL configuration varies by model architecture -- certain models (e.g., Llama 4 Scout) require version-specific adaptations (V4.1). This work establishes UCL as a calibratable framework for efficient LLM interaction, with model-family-specific optimization as a key research direction.

Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

2025-12-29T20:51:38Z

Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

Automating the Analysis of Parsing Algorithms (and other Dynamic Programs)

2025-12-29T18:19:43Z

Much algorithmic research in NLP aims to efficiently manipulate rich formal structures. An algorithm designer typically seeks to provide guarantees about their proposed algorithm -- for example, that its running time or space complexity is upper-bounded as a certain function of its input size. They may also wish to determine the necessary properties of the quantities derived by the algorithm to synthesize efficient data structures and verify type errors. In this paper, we develop a system for helping programmers to perform these types of analyses. We apply our system to a number of NLP algorithms and find that it successfully infers types, dead and redundant code, and parametric runtime and space complexity bounds.

Adaptable TeaStore: A Choreographic Approach

2025-12-29T14:35:18Z

The Adaptable TeaStore has recently been proposed as a reference model for adaptable microservice architectures. It includes different configurations, as well as scenarios requiring to transition between them. We describe an implementation of the Adaptable TeaStore based on AIOCJ, a choreographic language that allows one to program multiparty systems that can adapt at runtime to different conditions. Following the choreographic tradition, AIOCJ ensures by-construction correctness of communications (e.g., no deadlocks) before, during, and after adaptation. Adaptation is dynamic, and the adaptation scenarios need to be fully specified only at runtime. Using AIOCJ to model the Adaptable TeaStore, we showcase the strengths of the approach and its current limitations, providing suggestions for future directions for refining the paradigm (and the AIOCJ language, in particular), to better align it with real-world Cloud architectures.

RapunSL: Untangling Quantum Computing with Separation, Linear Combination and Mixing

2025-12-29T12:12:29Z

Quantum Separation Logic (QSL) has been proposed as an effective tool to improve the scalability of deductive reasoning for quantum programs. In QSL, separation is interpreted as disentanglement, and the frame rule brings a notion of entanglement-local specification (one that only talks about the qubits entangled with those acted upon by the program). In this paper, we identify two notions of locality unique to the quantum domain, and we construct a novel quantum separation logic, RapunSL, which is able to soundly reduce reasoning about superposition states to reasoning about pure states (basis-locality), and reasoning about mixed states arising from measurement to reasoning about pure states (outcome-locality). To do so, we introduce two connectives, linear combination and mixing, which together with separation provide a dramatic improvement in the scalability of reasoning, as we demonstrate on a series of challenging case studies.

Verifying Asynchronous Hyperproperties in Reactive Systems

2025-12-29T10:06:46Z

Hyperproperties are system properties that relate multiple execution traces and commonly occur when specifying information-flow and security policies. Logics like HyperLTL utilize explicit quantification over execution traces to express temporal hyperproperties in reactive systems, i.e., hyperproperties that reason about the temporal behavior along infinite executions. An often unwanted side-effect of such logics is that they compare the quantified traces synchronously. This prohibits the logics from expressing properties that compare multiple traces asynchronously, such as Zdancewic and Myers's observational determinism, McLean's non-inference, or stuttering refinement. We study the model-checking problem for a variant of asynchronous HyperLTL (A-HLTL), a temporal logic that can express hyperproperties where multiple traces are compared across timesteps. In addition to quantifying over system traces, A-HLTL features secondary quantification over stutterings of these traces. Consequently, A-HLTL allows for a succinct specification of many widely used asynchronous hyperproperties. Model-checking A-HLTL requires finding suitable stutterings, which, thus far, has been only possible for very restricted fragments or terminating systems. In this paper, we propose a novel game-based approach for the verification of arbitrary $\forall^*\exists^*$ A-HLTL formulas in reactive systems. In our method, we consider the verification as a game played between a verifier and a refuter, who challenge each other by controlling parts of the underlying traces and stutterings. A winning strategy for the verifier then corresponds to concrete witnesses for existentially quantified traces and asynchronous alignments for existentially quantified stutterings. We identify fragments for which our game-based interpretation is complete and thus constitutes a finite-state decision procedure.

Anka: A Domain-Specific Language for Reliable LLM Code Generation

2025-12-29T05:28:17Z

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python's flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.

Compiling Gradual Types with Evidence

2025-12-27T19:25:15Z

Efficiently supporting sound gradual typing in a language with structural types is challenging. To date, the Grift compiler is the only close-to-the-metal implementation of gradual typing in this setting, exploiting coercions for runtime checks, and further extended with monotonic references for efficient access to statically-typed data structures. On the language design and semantics side, the Abstracting Gradual Typing (AGT) methodology has proven fruitful to elucidate existing designs and to innovate by deriving gradualizations of a wide variety of typing disciplines and language features. Grounded in abstract interpretation, the Curry-Howard inspired runtime semantics of AGT is based on the notion of evidence for consistent judgments that evolve during reduction, monitoring the plausibility of well-typedness. While expressive and versatile, it is unclear whether such evidence-based semantics are a viable route to realize an efficient implementation of gradual typing. In this work, we explore this question by designing, implementing, and evaluating an evidence-based compiler, called GrEv. We explain how to bridge the gap between the formal semantics and the GrEv compiler implementation, and identify novel monotonic semantics. We empirically evaluate the performance of GrEv on the Grift benchmark suite. The results show that an evidence-based compiler can be competitive with, and even faster than, a coercion-based compiler, exhibiting more stability across configurations on the static-to-dynamic spectrum. In addition to enriching the space of gradual typing compilers, this work opens a direct door to exploring efficient implementations of the many advanced gradual typing disciplines formally derived with AGT in the literature.

LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization

2025-12-27T16:08:33Z

The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.