https://arxiv.org/api/S6+5CNKs5gLk2Q7zL3poqr3qETA 2026-03-26T08:25:31Z 5878 75 15 http://arxiv.org/abs/2510.02524v2 Unraveling Syntax: How Language Models Learn Context-Free Grammars 2026-02-27T00:38:37Z

While large models achieve impressive results, their learning dynamics are far from understood. Many domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely "subgrammars". We first define subgrammars, and prove a set of fundamental theorems regarding language modeling and subgrammars. We show that language modeling loss (or equivalently the Kullback-Leibler divergence) recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. We also prove that the constant in this linear recurrence is a function of the expected recursion, a notion we introduce. We show that under additional assumptions, parametrized models learn subgrammars in parallel. Empirically, we confirm that small transformers learn subgrammars in parallel, unlike children, who first master simple substructures. We also briefly explore several other questions regarding subgrammars. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently lead to internal representations that better reflect the grammar's substructure in all cases; we also observe persistent difficulty with deeper recursion, a limitation that appears even of large language models.

2025-10-02T19:52:19Z Equal contribution by LYS and DM Laura Ying Schulz Daniel Mitropolsky Tomaso Poggio http://arxiv.org/abs/2410.18799v4 Arbitrary-arity Tree Automata and QCTL 2026-02-26T16:54:09Z

We introduce a new class of automata (which we coin EU-automata) running on infininte trees of arbitrary (finite) arity. We develop and study several algorithms to perform classical operations (union, intersection, complement, projection, alternation removal) for those automata, and precisely characterise their complexities. We also develop algorithms for solving membership and emptiness for the languages of trees accepted by EU-automata. We then use EU-automata to obtain several algorithmic and expressiveness results for the temporal logic QCTL (which extends CTL with quantification over atomic propositions) and for MSO. On the one hand, we obtain decision procedures with optimal complexity for QCTL satisfiability and model checking; on the other hand, we obtain an algorithm for translating any QCTL formula with k quantifier alternations to formulas with at most one quantifier alternation, at the expense of a $(k + 1)$-exponential blow-up in the size of the formulas. Using the same techniques, we prove that any MSO formula can be translated into a formula with at most four quantifier alternations (and only one second-order-quantifier alternation), this time with a $(k + 2)$-exponential blow-up in the size of the formula.

2024-10-24T14:51:00Z François Laroussinie Nicolas Markey http://arxiv.org/abs/2510.14841v3 On the order of lazy cellular automata 2026-02-26T15:47:31Z

We study the most elementary family of cellular automata defined over an arbitrary group universe $G$ and an alphabet $A$: the lazy cellular automata, which act as the identity on configurations in $A^G$, except when they read a unique active transition $p \in A^S$, in which case they write a fixed symbol $a \in A$. As expected, the dynamical behavior of lazy cellular automata is relatively simple, yet subtle questions arise since they completely depend on the choice of $p$ and $a$. In this paper, we investigate the order of a lazy cellular automaton $τ: A^G \to A^G$, defined as the cardinality of the set $\{ τ^k : k \in \mathbb{N} \}$. In particular, we establish a general upper bound for the order of $τ$ in terms of the fibers of $p$, and we prove that this bound is attained when $p$ is a quasi-constant pattern.

2025-10-16T16:15:16Z 14 pages Edgar Alcalá-Arroyo Alonso Castillo-Ramirez http://arxiv.org/abs/2509.00168v3 Generalised Möbius Categories and Convolution Kleene Algebras 2026-02-26T14:36:53Z

Convolution algebras on maps from structures such as monoids, groups or categories into semirings, rings or fields abound in mathematics and the sciences. Of special interest in computing are convolution algebras based on variants of Kleene algebras, which are additively idempotent semirings equipped with a Kleene star. Yet an obstacle to the construction of convolution Kleene algebras on a wide class of structures has so far been the definition of a suitable star. We show that a generalisation of Möbius categories combined with a generalisation of a classical definition of a star for formal power series allow such a construction. We discuss several instances of this construction on generalised Möbius categories: convolution Kleene algebras with tests, modal convolution Kleene algebras, concurrent convolution Kleene algebras and higher convolution Kleene algebras (e.g. on strict higher categories and higher relational monoids). These are relevant to the verification of weighted and probabilistic sequential and concurrent programs, using quantitative Hoare logics or predicate transformer algebras, as well as for algebraic reasoning in higher-dimensional rewriting. We also adapt the convolution Kleene algebra construction to Conway semirings, which is widely studied in the context of weighted automata. Finally, we compare the convolution Kleene algebra construction with a previous construction of convolution quantales and present concrete example structures in preparation for future applications.

2025-08-29T18:08:33Z 40 pages James Cranch Georg Struth Jana Wagemaker http://arxiv.org/abs/2602.23030v1 Efficient Constructions of Finite-State Independent Normal Pairs 2026-02-26T14:11:13Z

Finite-state independence is a robust notion of algorithmic independence for infinite words. It was introduced for general infinite words by Becher, Carton, and Heiber via deterministic asynchronous two-tape finite automata. Álvarez, Becher, and Carton then studied the normal case and characterized finite-state independence in terms of deterministic finite-state shufflers. A shuffler is a finite automaton that reads from two input tapes $x,y\inΣ^\infty$ and, at each step, chooses one tape to read next, outputs the symbol read, and updates its state based only on that output symbol. In terms of this characterization, two normal sources are finite-state independent if every deterministic finite-state way of shuffling (interleaving) them still produces a normal sequence. Álvarez, Becher, and Carton posed the following questions: (1) can one compute finite-state independent normal pairs efficiently, improving their doubly-exponential procedure; and (2) given a normal word $x$, can one effectively construct a normal word $y$ that is finite-state independent from $x$? We answer both questions by explicit deterministic constructions. First, we give a deterministic polynomial-time algorithm that, on input $N$, outputs the first $N$ symbols of two normal words $x$ and $y$ such that for every shuffler $S$, the shuffled output $S(x,y)$ is normal; hence $(x,y)$ is finite-state independent. Second, we solve the one-sided companion problem effectively. Given any computable normal word $x\inΣ^\infty$, we give an explicit deterministic construction of a computable normal word $y\inΣ^\infty$ such that for every shuffler $S$, the shuffled output $S(x,y)$ is normal. In particular, $x$ and $y$ are finite-state independent by the shuffler characterization theorem.

2026-02-26T14:11:13Z Subin Pulari http://arxiv.org/abs/2602.23401v1 Grammar-Constrained (CFL) Reachability: Subcubic Preprocessing, Indexing Trade-offs, and Structured Decoding Semantics 2026-02-26T07:57:56Z

We study the problem of grammar-constrained context-free language reachability in graphs, focusing on complexity and empirical performance. We present an algorithmic framework for evaluating reachability queries constrained by context-free grammars, and analyze its theoretical runtime bounds. To complement our theoretical results, we conduct an extensive empirical evaluation on a comprehensive benchmark of real-world schemas, comparing different algorithmic variants and reporting performance trade-offs. Our results highlight the impact of grammar structure and graph characteristics on reachability computation, and provide guidance for selecting efficient approaches in practice.

2026-02-26T07:57:56Z 19 pages, 4 figures, 4 tables. Code and empirical data available at: https://github.com/farukalpay/grammar-constrained-cfl-reachability-empirical Faruk Alpay Levent Sarioglu http://arxiv.org/abs/2602.22713v1 Opacity in Discrete Event Systems: A Perspective and Overview 2026-02-26T07:40:51Z

Opacity has emerged as a central confidentiality notion for information-flow security in discrete event systems (DES), capturing the requirement that an external observer (intruder) should never be able to determine with certainty whether the system is, was, or will be in a secret state. This article provides a concise, newcomer-friendly overview of opacity in DES, emphasizing core definitions and the unifying estimation viewpoint behind major opacity notions,. We summarize representative verification techniques and highlight how different observation models reshape both the problem formulation and algorithmic structure. We then review principal enforcement paradigms, ranging from opacity-enforcing supervisory control to sensor activation/information release optimization and obfuscation/editing mechanisms. Beyond finite automata, we outline how opacity has been studied in richer models such as stochastic systems, timed systems, Petri nets, and continuous/hybrid dynamics, and we briefly survey applications in robotics, location privacy, and information services. Finally, we discuss selected open challenges, including solvability under incomparable information, scalable methods beyond worst-case complexity, and opacity under intelligent or data-driven adversaries.

2026-02-26T07:40:51Z Xiang Yin http://arxiv.org/abs/2512.24796v2 LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories) 2026-02-25T20:35:18Z

While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introduce LeanCat, a challenging benchmark comprising 100 fully formalized category-theory tasks in Lean. Unlike algebra or arithmetic, category theory serves as a rigorous stress test for structural, interface-level reasoning. Our evaluation reveals a severe abstraction gap: the best state-of-the-art model solves only 12.0% of tasks at pass@4, with performance collapsing from 55.0% on Easy tasks to 0.0% on High-difficulty tasks, highlighting a failure in compositional generalization. To overcome this, we evaluate LeanBridge, a retrieval-augmented agent that employs a retrieve-generate-verify loop. LeanBridge achieves a peak success rate of 24.0% -- doubling the performance of the best static baseline. These results empirically demonstrate that iterative refinement and dynamic library retrieval are not merely optimizations but strict necessities for neuro-symbolic reasoning in abstract domains. LeanCat offers a compact, reusable testbed for tracking progress toward reliable, research-level formalization.

2025-12-31T11:33:29Z 22 pages, 9 figures, 5 tables Rongge Xu Hui Dai Yiming Fu Jiedong Jiang Tianjiao Nie Junkai Wang Holiverse Yang Zhi-Hao Zhang http://arxiv.org/abs/2602.21895v1 Symbols frequencies in the Thue--Morse word in base $3/2$ and related conjectures 2026-02-25T13:24:47Z

We study a binary Thue--Morse-type sequence arising from the base-$3/2$ expansion of integers, an archetypal automatic sequence in a rational base numeration system. Because the sequence is generated by a periodic iteration of morphisms rather than a single primitive substitution, classical Perron--Frobenius methods do not directly apply to determine symbol frequencies. We prove that both symbols ${\tt 0},{\tt 1}$ occur with frequency $1/2$ and we show uniform recurrence and symmetry properties of its set of factors. The proof reveals a structural bridge between combinatorics on words and harmonic analysis: the first difference sequence is shown to be Toeplitz, providing dynamical rigidity, while filtered frequencies naturally encode a dyadic structure that lifts to the compact group of $2$-adic integers. In this $2$-adic setting, desubstitution becomes a linear operator on Fourier coefficients, and a spectral contraction argument enforces uniqueness of limiting densities. Our results answer several conjectures of Dekking (on a sibling sequence) and illustrate how harmonic analysis on compact groups can be fruitfully combined with substitution dynamics.

2026-02-25T13:24:47Z 39 pages, 7 figures Julien Cassaigne Bastiàn Espinoza Michel Rigo Manon Stipulanti http://arxiv.org/abs/2602.21459v1 Regular Expression Denial of Service Induced by Backreferences 2026-02-25T00:23:50Z

This paper presents the first systematic study of denial-of-service vulnerabilities in Regular Expressions with Backreferences (REwB). We introduce the Two-Phase Memory Automaton (2PMFA), an automaton model that precisely captures REwB semantics. Using this model, we derive necessary conditions under which backreferences induce super-linear backtracking runtime, even when sink ambiguity is linear -- a regime where existing detectors report no vulnerability. Based on these conditions, we identify three vulnerability patterns, develop detection and attack-construction algorithms, and validate them in practice. Using the Snort intrusion detection ruleset, our evaluation identifies 45 previously unknown REwB vulnerabilities with quadratic or worse runtime. We further demonstrate practical exploits against Snort, including slowing rule evaluation by 0.6-1.2 seconds and bypassing alerts by triggering PCRE's matching limit.

2026-02-25T00:23:50Z 24 pages, 8 figures. Submitted to USENIX Security 2026. For the code repository of detector, see https://anonymous.4open.science/r/slmad-EABE. For the code repository of measurements, see https://anonymous.4open.science/r/atkre-7D50 Yichen Liu Berk Çakar Aman Agrawal Minseok Seo James C. Davis Dongyoon Lee http://arxiv.org/abs/2504.21841v3 Neuro-Symbolic Generation of Explanations for Robot Policies with Weighted Signal Temporal Logic 2026-02-24T22:29:57Z

Neural network-based policies have demonstrated success in many robotic applications, but often lack human-explanability, which poses challenges in safety-critical deployments. To address this, we propose a neuro-symbolic explanation framework that generates a weighted signal temporal logic (wSTL) specification to describe a robot policy in a interpretable form. Existing methods typically produce explanations that are verbose and inconsistent, which hinders explainability, and loose, which do not give meaningful insights into the underlying policy. We address these issues by introducing a simplification process consisting of predicate filtering, regularization, and iterative pruning. We also introduce three novel explainability evaluation metrics -- conciseness, consistency, and strictness -- to assess explanation quality beyond conventional classification metrics. Our method is validated in three simulated robotic environments, where it outperforms baselines in generating concise, consistent, and strict wSTL explanations without sacrificing classification accuracy. This work bridges policy learning with formal methods, contributing to safer and more transparent decision-making in robotics.

2025-04-30T17:51:20Z IEEE Robotics and Automation Letters, vol. 11, pp. 3963-3970, 2026 Mikihisa Yuasa Ramavarapu S. Sreenivas Huy T. Tran 10.1109/LRA.2026.3662977 http://arxiv.org/abs/2602.21073v1 Automata Learning with an Incomplete but Inductive Teacher 2026-02-24T16:35:16Z

Active automata learning (AAL) under a Minimally Adequate Teacher (MAT) has been successfully used to infer a regular language through membership and equivalence queries. This language might not be fully characterized: we are then interested in finding any solution in a target class of possibly many regular languages. Some problems such as regular language separation or inductive invariant synthesis in the context of regular model checking (RMC) may indeed admit more than one answer. We therefore introduce IdMAT: a new teacher formalism answering queries with respect to any language in the target class, all at once. Such a teacher tailored towards invariant synthesis might provide incomplete "don't know" answers, but also inductive facts of the form "if w1 is accepted, so is w2". We pair IdMAT with a novel AAL algorithm LIndA that 1. encodes all uncertainties as a unique SAT instance and does not fork, 2. leverages incremental SAT solving and UNSAT core analysis, and 3. handles counterexamples of the simple or inductive type in a frugal manner inspired by the Rivest-Schapire refinement technique. We finally evaluate a prototype implementation in the context of regular language separation and RMC.

2026-02-24T16:35:16Z Daniel Stan Adrien Pommellet Juliette Jacquot http://arxiv.org/abs/2602.21019v1 Expregular functions 2026-02-24T15:37:03Z

Polyregular functions form a robust class of string-to-string functions with polynomial growth, as evidenced by Bojanczyk (2018). This class admits numerous descriptions and enjoys several closure properties. Most notably, polyregular functions are regularity reflecting (\ie the inverse image of a regular language is regular). In this work, we propose a robust class of string-to-string functions with exponential growth which we call expregular functions. We consider the following three models for describing them: - MSO set interpretations, which extend MSO interpretations (one of the models capturing polyregular functions), by operating on monadic variables instead of tuples of first-order variables; - yield-Hennie machines, which are branching one-tape Turing machines with bounded visit; and - Ariadne transducers, a new model of 2-way pushdown machines with a bounded visit restriction. Our main contribution is a translation from MSO set interpretations to yield-Hennie machines, which are known to be regularity reflecting (Dartois, Nguy\~{ê}n, Peyrat 2026). In particular this establishes that MSO set interpretations are regularity reflecting, which in turn settles a major conjecture about automatic structures: every automatic $ω$-word has a decidable MSO theory. Yield-Hennie machine directly translate to Ariadne transducers, and our second contribution is to prove that Ariadne transducers also translate to MSO set interpretations, thus establishing the equivalence of the three models. This is obtained by showing that Ariadne automata -- the automaton model corresponding to Ariadne transducers -- recognise regular languages.

2026-02-24T15:37:03Z Thomas Colcombet Nathan Lhote Pierre Ohlmann http://arxiv.org/abs/2406.06513v3 Repetition Threshold for Binary Automatic Sequences 2026-02-24T13:22:09Z

The critical exponent of an infinite word $\bf x$ is the supremum, over all finite nonempty factors $f$, of the exponent of $f$. In this note we show that for all integers $k\geq 2,$ there is a binary infinite $k$-automatic sequence with critical exponent $\leq 7/3$. The same conclusion holds for Fibonacci-automatic and Tribonacci-automatic sequences.

2024-06-10T17:53:28Z J. -P. Allouche N. Rampersad J. Shallit http://arxiv.org/abs/2602.19743v1 NILE: Formalizing Natural-Language Descriptions of Formal Languages 2026-02-23T11:42:56Z

This paper explores how natural-language descriptions of formal languages can be compared to their formal representations and how semantic differences can be explained. This is motivated from educational scenarios where learners describe a formal language (presented, e.g., by a finite state automaton, regular expression, pushdown automaton, context-free grammar or in set notation) in natural language, and an educational support system has to (1) judge whether the natural-language description accurately describes the formal language, and to (2) provide explanations why descriptions are not accurate. To address this question, we introduce a representation language for formal languages, Nile, which is designed so that Nile expressions can mirror the syntactic structure of natural-language descriptions of formal languages. Nile is sufficiently expressive to cover a broad variety of formal languages, including all regular languages and fragments of context-free languages typically used in educational contexts. Generating Nile expressions that are syntactically close to natural-language descriptions then allows to provide explanations for inaccuracies in the descriptions algorithmically. In experiments on an educational data set, we show that LLMs can translate natural-language descriptions into equivalent, syntactically close Nile expressions with high accuracy - allowing to algorithmically provide explanations for incorrect natural-language descriptions. Our experiments also show that while natural-language descriptions can also be translated into regular expressions (but not context-free grammars), the expressions are often not syntactically close and thus not suitable for providing explanations.

2026-02-23T11:42:56Z Tristan Kneisel Marko Schmellenkamp Fabian Vehlken Thomas Zeume