Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures

2025-05-01T19:19:29Z

This paper presents a mathematical framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM). Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture, in which a finite sequence of observations (tokens) is mapped to the conditional probability of the next token. Our objective is not to construct a mathematical model of a transformer. Rather, our interest lies in deriving, from first principles, transformer-like architectures that solve the prediction problem for which the transformer is designed. The proposed framework is based on an original optimal control approach, where the prediction objective (MMSE) is reformulated as an optimal control problem. An analysis of the optimal control problem is presented leading to a fixed-point equation on the space of probability measures. To solve the fixed-point equation, we introduce the dual filter, an iterative algorithm that closely parallels the architecture of decoder-only transformers. These parallels are discussed in detail along with the relationship to prior work on mathematical modeling of transformers as transport on the space of probability measures. Numerical experiments are provided to illustrate the performance of the algorithm using parameter values used in researchscale transformer models.

A martingale-type of characterisation of the Gaussian free field and fractional Gaussian free fields

2025-05-01T18:10:53Z

We establish a martingale-type characterisations for the continuum Gaussian free field (GFF) and for fractional Gaussian free fields (FGFs), using their connection to the stochastic heat equation and to fractional stochastic heat equations. The main theorem on the GFF generalizes previous results of similar flavour and the characterisation theorems on the FGFs are new. The proof strategy is to link the resampling dynamics coming from a martingale-type of decomposition property to the stationary dynamics of the desired field, i.e. to the (fractional) stochastic heat equation.

The local coupling of noise technique and its application to lower error bounds for strong approximation of SDEs with irregular coefficients

2025-05-01T16:59:01Z

In recent years, interest in approximation methods for stochastic differential equations (SDEs) with non-Lipschitz continuous coefficients has increased. We show lower bounds for the $L^p$-error of such methods in the case of approximation at a single point in time or globally in time. On the one hand, we show that for a large class of piecewise Lipschitz continuous drifts and non-additive diffusions the best possible $L^p$-error rate for final time approximation that can be achieved by any method based on finitely many evaluations of the driving Brownian motion is at most $3/4$, which was previously known only for additive diffusions. Moreover, we show that the best $L^p$-error rate for global approximation that can be achieved by any method based on finitely many evaluations of the driving Brownian motion is at most $1/2$ when the drift is locally bounded and the diffusion is locally Lipschitz continuous. For the derivation of the lower bounds we introduce a new method of proof: the local coupling of noise technique. Using this technique when approximating a solution $X$ of the SDE at the final time, a lower bound for the $L^p$-error of any approximation method based on evaluations of the driving Brownian motion at the points $t_1 < \dots < t_n$ can be determined by the $L^p$-distances of solutions of the same SDE on $[t_{i-1}, t_i]$ with initial values $X_{t_{i-1}}$ and driving Brownian motions that are coupled at $t_{i-1}, t_i$ and independent, conditioned on the values of the Brownian motion at $t_{i-1}, t_i$.

Expected First Return Times for Random Walks on Bounded Grids

2025-05-02T21:56:21Z

We derive a general formula for computing the expected first return time of a random walk on a finite graph. Using this framework, we calculate the expected first return time in various settings over bounded rectangular grids with different boundary conditions.

The heat flow conjecture for polynomials and random matrices

2025-05-01T16:28:01Z

We study the evolution of the roots of a polynomial of degree $N$, when the polynomial itself is evolving according to the heat flow. We propose a general conjecture for the large-$N$ limit of this evolution. Specifically, we propose (1) that the log potential of the limiting root distribution should evolve according to a certain first-order, nonlinear PDE, and (2) that the limiting root distribution at a general time should be the push-forward of the initial distribution under a certain explicit transport map. These results should hold for sufficiently small times, that is, until singularities begin to form. We offer three lines of reasoning in support of our conjecture. First, from a random matrix perspective, the conjecture is supported by a deformation theorem for the second moment of the characteristic polynomial of certain random matrix models. Second, from a dynamical systems perspective, the conjecture is supported by the computation of the second derivative of the roots with respect to time, which is formally small before singularities form. Third, from a PDE perspective, the conjecture is supported by the exact PDE\ satisfied by the log potential of the empirical root distribution of the polynomial, which formally converges to the desired PDE as $N\rightarrow \infty.$ We also present a "multiplicative" version of the the conjecture, supported by similar arguments. Finally, we verify rigorously that the conjectures hold at the level of the holomorphic moments.

Scaling limit of a weakly asymmetric simple exclusion process in the framework of regularity structures

2025-05-01T15:56:35Z

We prove that a parabolically rescaled and suitably renormalised height function of a weakly asymmetric simple exclusion process on a circle converges to the Cole-Hopf solution of the KPZ equation. This is an analogue of the celebrated result by Bertini and Giacomin from 1997 for the exclusion process on a circle with any particle density. The main goal of this article is to analyse the interacting particle system using the framework of regularity structures without applying the Gaertner transform, a discrete version of the Cole-Hopf transform which linearises the KPZ equation. Our analysis relies on discretisation framework for regularity structures developed by Erhard and Hairer as well as estimates for iterated integrals with respect to cadlag martingales derived by Grazieschi, Matetski and Weber. The main technical challenge addressed in this work is the renormalisation procedure which requires a subtle analysis of regularity preserving discrete convolution operators.

A stochastic epidemic model with memory of the last infection and waning immunity

2025-05-01T15:32:48Z

We adapt the article of Forien, Pang, Pardoux and Zotsa: Arxiv preprint Arxiv2210.04667(2022), on epidemic models with varying infectivity and waning immunity, to incorporate the memory of the last infection. To this end, we introduce a parametric approach and consider a piecewise deterministic Markov process modeling both the evolution of the parameter, also called the trait, and the age of infection of individuals over time. At each new infection, a new trait is randomly chosen for the infected individual according to a Markov kernel, and their age is reset to zero. In the large population limit, we derive a partial differential equation (PDE) that describes the density of traits and ages. The main goal is to study the conditions under which endemic equilibria exist for the deterministic PDE model and to establish an endemicity threshold that depends on the model parameters. The local stability of these equilibria is also analyzed. The endemicity threshold is computed for several examples, including models that incorporate a vaccination policy, and a local stability result is obtained for a memory-free SIS-type model.

Information geometry of tempered stable processes

2025-05-01T14:19:51Z

We find the information geometry of tempered stable processes. Beginning with the derivation of $\alpha$-divergence between two tempered stable processes, we obtain the corresponding Fisher information matrices and the $\alpha$-connections on their statistical manifolds. Furthermore, we explore statistical applications of this geometric framework. Various tempered stable processes such as generalized tempered stable processes, classical tempered stable processes, and rapidly-decreasing tempered stable processes are presented as illustrative examples.

Anticipated backward stochastic Volterra integral equations and their applications to nonzero-sum stochastic differential games

2025-05-01T11:51:08Z

In [J. Wen, Y. Shi, Stat. Probab. Lett. 156 (2020) 108599] the authors first introduced a kind of anticipated backward stochastic Volterra integral equations (anticipated BSVIEs, for short). By virtue of the duality principle, it is found in this paper that the anticipated BSVIEs can be applied to the study of stochastic differential games. For this in this paper we deeply investigate a more general class of anticipated BSVIEs whose generator includes both pointwise time-advanced functions and average time-advanced functions. In theory, the well-posedness and the comparison theorem of anticipated BSVIEs are established, and some regularity results of adapted M-solutions are proved by applying Malliavin calculus, which cover the previous results for BSVIEs. Further, using linear ABSVIEs as the adjoint equation, we present the maximum principle for the nonzero-sum differential game system of stochastic delay Volterra integral equations (SDVIEs, for short) for the first time. As one of the applications of the theorem, a Nash equilibrium point of the linear-quadratic differential game problem of SDVIEs is obtained.

Post-Lie deformations of pre-Lie algebras and their applications in Regularity Structures

2025-05-01T11:12:41Z

In this paper, we study post-Lie deformations of a pre-Lie algebra, namely deforming a pre-Lie algebra into a post-Lie algebra. We construct the differential graded Lie algebra that governs post-Lie deformations of a pre-Lie algebra. We also develop the post-Lie cohomology theory for a pre-Lie algebra, by which we classify infinitesimal post-Lie deformations of a pre-Lie algebra using the second cohomology group. The rigidity of such kind of deformations is also characterized using the second cohomology group. Finally, we apply this deformation theory to Regularity Structures. We prove that the post-Lie algebraic structure on the decorated trees which appears spontaneously in Regularity Structures is a post-Lie deformation of a pre-Lie algebra.

Lévy processes under level-dependent Poissonian switching

2025-05-01T11:02:11Z

In this paper, we derive identities for the upward and downward exit problems and resolvents for a process whose motion changes between two L\'evy processes if it is above (or below) a barrier $b$ and coincides with a Poissonian arrival time. This can be expressed in the form of a (hybrid) stochastic differential equation, for which the existence of its solution is also discussed. All identities are given in terms of new generalisations of scale functions (counterparts of the scale functions from the theory of L\'evy processes). To illustrate the applicability of our results, the probability of ruin is obtained for a risk process with delays in the dividend payments.

The iterated Dirichlet process and applications to Bayesian inference

2025-05-01T10:53:49Z

Consider an i.i.d. sequence of random variables, taking values in some space $S$, whose underlying distribution is unknown. In problems of Bayesian inference, one models this unknown distribution as a random measure, and the law of this random measure is the prior. When $S = \{0, 1\}$, a commonly used prior is the uniform distribution on $[0, 1]$, or more generally, the beta distribution. When $S$ is finite, the analogous choice is the Dirichlet distribution. For a general space $S$, we are led naturally to the Dirichlet process (see [Ferguson, 1973]). Here, we consider an array of random variables, and in so doing are led to what we call the iterated Dirichlet process (IDP). We define the IDP and then show how to compute the posterior distribution, given a finite set of observations, using the method of sequential imputation. Ordinarily, this method requires the existence of certain joint density functions, which the IDP lacks. We therefore present a new, more general proof of the validity of sequential imputation, and show that the hypotheses of our proof are satisfied by the IDP.

Improving the convergence of Markov chains via permutations and projections

2025-05-01T09:38:16Z

This paper aims at improving the convergence to equilibrium of finite ergodic Markov chains via permutations and projections. First, we prove that a specific mixture of permuted Markov chains arises naturally as a projection under the KL divergence or the squared-Frobenius norm. We then compare various mixing properties of the mixture with other competing Markov chain samplers and demonstrate that it enjoys improved convergence. This geometric perspective motivates us to propose samplers based on alternating projections to combine different permutations and to analyze their rate of convergence. We give necessary, and under some additional assumptions also sufficient, conditions for the projection to achieve stationarity in the limit in terms of the trace of the transition matrix. We proceed to discuss tuning strategies of the projection samplers when these permutations are viewed as parameters. Along the way, we reveal connections between the mixture and a Markov chain Sylvester's equation as well as assignment problems, and highlight how these can be used to understand and improve Markov chain mixing. We provide two examples as illustrations. In the first example, the projection sampler (with a suitable choice of the permutation) improves upon Metropolis-Hastings in a discrete bimodal distribution with a reduced relaxation time from exponential to polynomial in the system size, while in the second example, the mixture of permuted Markov chain yields a mixing time that is logarithmic in system size (with high probability under random permutation), compared to a linear mixing time in the Diaconis-Holmes-Neal sampler. Finally, we provide numerical experiments on statistical physics models to illustrate the improved mixing performance of the proposed projection samplers over standard Metropolis-Hastings.

Solutions to the stochastic thin-film equation for the range of mobility exponents $n\in (2,3)$

2025-05-01T09:25:32Z

Recently, many existence results for the stochastic thin-film equation were established in the case of a quadratic mobility exponent $n=2$, in which the noise term $\partial_x(u^\frac{n}{2}\mathcal{W})$ becomes linear. In the case of a non-quadratic mobility exponent, results are only available in the situation that $n\ge \frac{8}{3}$ leaving the interval of mobility exponents $n\in (2,\frac{8}{3})$ untreated. In this article we resolve the current gap in the literature by presenting a proof, which works under the assumption $n\in (2,3)$, i.e., the regime of weak slippage. The key idea is to use that the $\log$-entropy dissipation coincides with the energy production due to the noise. To realize this idea, we approximate the stochastic thin-film equation by stochastic thin-film equations with inhomogeneous mobility functions, which behave like a higher power near $0$. As a consequence the approximate solutions are non-negative, which is vital to use the $\log$-entropy estimate.

Approximation to Deep Q-Network by Stochastic Delay Differential Equations

2025-05-01T08:19:24Z

Despite the significant breakthroughs that the Deep Q-Network (DQN) has brought to reinforcement learning, its theoretical analysis remains limited. In this paper, we construct a stochastic differential delay equation (SDDE) based on the DQN algorithm and estimate the Wasserstein-1 distance between them. We provide an upper bound for the distance and prove that the distance between the two converges to zero as the step size approaches zero. This result allows us to understand DQN's two key techniques, the experience replay and the target network, from the perspective of continuous systems. Specifically, the delay term in the equation, corresponding to the target network, contributes to the stability of the system. Our approach leverages a refined Lindeberg principle and an operator comparison to establish these results.