https://arxiv.org/api/feTuOGdTbuogAeVBNIx3ZiFPrXM 2026-06-21T14:39:10Z 78511 705 15 http://arxiv.org/abs/2602.10949v2 Optimal Initialization in Depth: Lyapunov Initialization and Limit Theorems for Deep Leaky ReLU Networks 2026-06-02T11:05:28Z

Effective initialization in deep networks requires an understanding of random neural networks. In this work, a rigorous probabilistic analysis of deep bias-free random Leaky ReLU networks is provided. We prove a Law of Large Numbers and a Central Limit Theorem for the logarithm of the norm of network activations, establishing that, as the number of layers increases, their growth is governed by a parameter called the Lyapunov exponent. This parameter characterizes a sharp phase transition between vanishing and exploding activations, and we calculate the Lyapunov exponent explicitly for Gaussian or orthogonal weight matrices. Our results reveal that standard methods, such as He initialization or orthogonal initialization, do not guarantee activation stability for deep networks of low width. Based on these theoretical insights, we propose a novel initialization method, referred to as Lyapunov initialization, which sets the Lyapunov exponent to zero and thereby ensures that the neural network is as stable as possible, leading empirically to improved learning.

2026-02-11T15:36:13Z Preprint, 44 pages Constantin Kogler Tassilo Schwarz Samuel Kittle http://arxiv.org/abs/2511.05050v3 Estimating Bidirectional Causal Effects with Large Scale Online Kernel Learning 2026-06-02T10:50:28Z

In this study, a scalable online kernel learning framework is proposed for estimating bidirectional causal effects in systems characterized by mutual dependence and heteroskedasticity. Traditional causal inference often focuses on unidirectional effects, overlooking the common bidirectional relationships in real-world phenomena. Building on heteroskedasticity-based identification, the proposed method integrates a quasi-maximum likelihood estimator for simultaneous equation models with large scale online kernel learning. It employs random Fourier feature approximations to flexibly model nonlinear conditional means and variances, while an adaptive online gradient descent algorithm ensures computational efficiency for streaming and high-dimensional data. Results from extensive simulations demonstrate that the proposed method achieves superior accuracy and stability than single equation and polynomial approximation baselines, exhibiting lower bias and root mean squared error across various data-generating processes. These results confirm that the proposed approach effectively captures complex bidirectional causal effects with near-linear computational scaling. By combining econometric identification with modern machine learning techniques, the proposed framework offers a practical, scalable, and theoretically grounded solution for large scale causal inference in natural/social science, policy making, business, and industrial applications.

2025-11-07T07:44:06Z Proceedings of the 2025 International Conference on Data Science and Intelligent Systems (DSIS 2025), Article 65, pp. 449-455 Masahiro Tanaka 10.1109/DSIS67228.2025.11390623 http://arxiv.org/abs/2502.08006v3 Greed is Good: A Unifying Perspective on Guided Generation 2026-06-02T10:22:32Z

Training-free guided generation is a widely used and powerful technique that allows the end user to exert further control over the generative process of flow/diffusion models. Generally speaking, two families of techniques have emerged for solving this problem for gradient-based guidance: namely, posterior guidance (i.e., guidance via projecting the current sample to the target distribution via the target prediction model) and end-to-end guidance (i.e., guidance by performing backpropagation throughout the entire ODE solve). In this work, we show that these two seemingly separate families can actually be unified by looking at posterior guidance as a greedy strategy of end-to-end guidance. We explore the theoretical connections between these two families and provide an in-depth theoretical of these two techniques relative to the continuous ideal gradients. Motivated by this analysis we then show a method for interpolating between these two families enabling a trade-off between compute and accuracy of the guidance gradients. We then validate this work on several inverse image problems and property-guided molecular generation.

2025-02-11T23:05:16Z Accepted at NeurIPS 2025 Zander W. Blasingame Chen Liu http://arxiv.org/abs/2502.08834v4 Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers 2026-06-02T10:04:14Z

Deep generative models based on neural differential equations have become state-of-the-art for many generation tasks. These models rely on ODE/SDE solvers that integrate from a prior distribution to the data distribution; in many applications it is also highly desirable to integrate in the inverse direction. Standard solvers, however, accumulate discretization errors that prohibit exact inversion, an inaccuracy that is unacceptable in precision-critical applications. Existing inversion methods suffer from poor stability and low order of convergence, and are strictly limited to the ODE setting. In this work, we propose Rex, a family of reversible exponential (stochastic) Runge-Kutta solvers obtained by applying Lawson methods to convert any explicit (stochastic) Runge-Kutta scheme into an algebraically reversible one for both diffusion ODEs and SDEs. Beyond a rigorous theoretical analysis -- establishing arbitrary-order convergence and a non-zero region of linear stability -- we empirically demonstrate that Rex achieves near-machine-precision reconstruction and improves Boltzmann sampling with flow models as well as image generation and editing with diffusion models.

2025-02-12T22:51:54Z Accepted as an Oral presentation at ICML 2026 Zander W. Blasingame Chen Liu http://arxiv.org/abs/2605.30253v2 Wasserstein Contraction of Coordinate Ascent Variational Inference 2026-06-02T09:51:10Z

We study the contraction in Wasserstein distance of the coordinate ascent variational inference algorithm. This is shown to hold under a transport-information inequality at the fixed points and a functional smoothness condition. The results are general and sharp, allow for local convergence guarantees, hold for general smooth manifolds, and also in some non-smooth spaces. We consider applications to Bayesian Gaussian Mixture Models, and high-dimensional Bayesian Probit Regression, and Logistic Regression with Pólya-Gamma random variables (i.e. Jaakkola-Jordan's algorithm).

2026-05-28T17:16:22Z 17 pages + 3 pages appendix, 3 figures. V2 fixes some citations not displaying properly in the appendix. No content change compared to prior version Rocco Caprio Adrien Corenflos Sam Power http://arxiv.org/abs/2510.20372v4 Testing Most Influential Sets 2026-06-02T09:35:36Z

Small influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is excessive rather than expected under natural random sampling variation. We address this gap by developing a principled framework for most influential sets. Focusing on linear least-squares, we derive a convenient exact influence formula and identify the extreme value distributions of maximal influence - the heavy-tailed Fréchet for constant-size sets and heavy-tailed data, and the well-behaved Gumbel for growing sets or light tails. This allows us to conduct rigorous hypothesis tests for excessive influence. We demonstrate through applications across economics, biology, and machine learning benchmarks, resolving contested findings and replacing ad-hoc heuristics with rigorous inference.

2025-10-23T09:12:29Z Published as a conference paper at ICLR 2026 Lucas D. Konrad Nikolas Kuschnig http://arxiv.org/abs/2605.11607v2 Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty 2026-06-02T09:14:53Z

Probabilistic partial least squares (PPLS) is a central likelihood-based model for two-view learning when one needs both interpretable latent factors and calibrated uncertainty. Building on the identifiable parameterization of Bouhaddani et al.\ (2018), existing fitting pipelines still face two practical bottlenecks: noise--signal coupling under joint EM/ECM updates and nontrivial handling of orthogonality constraints. Following the fixed-noise scalar-likelihood protocol, we develop an end-to-end framework that combines noise pre-estimation, constrained likelihood optimization, and prediction calibration in one pipeline. We estimate the observation noise from the low-eigenvalue noise subspace and enforce orthogonality through exact Stiefel-manifold optimization. The noise-subspace estimator attains a signal-strength-independent leading finite-sample rate and matches a minimax lower bound, whereas a full-spectrum noise estimator carries a deterministic bias under the same model. We further extend the framework to sub-Gaussian settings via optional Gaussianization and provide closed-form standard errors through a block-structured Fisher analysis. Across synthetic high-noise settings and two multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage without post-hoc recalibration, reaches Ridge-level point accuracy on TCGA-BRCA at rank $r=3$, matches or exceeds PO2PLS on cross-view prediction while providing native calibrated uncertainty, and improves stability of parameter recovery.

2026-05-12T06:38:12Z Haoran Hu Xingce Wang http://arxiv.org/abs/2606.03347v1 AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking 2026-06-02T08:57:38Z

Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We propose AugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data by separating conditioning from supervision. AugMask 1) constructs numeric inputs via conditional stochastic augmentation using lightweight auxiliary models, and 2) applies denoising supervision only to observed coordinates. In effect, augmented missing entries serve as uncertain conditioning context rather than training targets. We connect this training rule to a Rao--Blackwellized objective and show that marginalizing missing entries yields a variance-weighted sensitivity penalty, discouraging over-reliance on uncertain completions. Across diverse datasets and missingness regimes, AugMask enables standard diffusion-based tabular generators to outperform specialized missing-aware baselines.

2026-06-02T08:57:38Z Jungkyu Kim Taeyoung Park Kibok Lee http://arxiv.org/abs/2506.13107v4 Honesty in Causal Forests: When It Helps and When It Hurts 2026-06-02T08:57:14Z

Causal forests estimate how treatment effects vary across individuals, guiding personalized interventions in areas like marketing, operations, and public policy. A standard practice is honest estimation: dividing the data into two samples, one to define subgroups and another to estimate treatment effects within them. This is intended to reduce overfitting and is the default in many software packages. But is it the right choice? We show that honest estimation can reduce the accuracy of estimates of individual treatment effects, especially when effect heterogeneity is substantial and datasets are large enough to detect it. The reason is a bias-variance trade-off: honesty lowers the risk of overfitting but increases the risk of underfitting by limiting the data available to detect and model heterogeneity. Across more than 7,000 benchmark datasets, we find that the cost of using honesty by default can be as high as requiring 27% more data to match the performance of models trained without it. Honesty is best understood as a form of regularization. Whether to adopt it should depend on the goals of the application and its empirical performance, not on reflexive default use.

2025-06-16T05:32:58Z Yanfang Hou Carlos Fernández-Loría http://arxiv.org/abs/2412.05109v2 Generating Rectifiable Measures through Neural Networks 2026-06-02T08:37:17Z

We derive universal approximation results for the class of (countably) $m$-rectifiable measures. Specifically, we prove that $m$-rectifiable measures can be approximated as push-forwards of the one-dimensional Lebesgue measure on $[0,1]$ using ReLU neural networks with arbitrarily small approximation error in terms of Wasserstein distance. What is more, the weights in the networks under consideration are quantized and bounded and the number of ReLU neural networks required to achieve an approximation error of $\varepsilon$ is no larger than $2^{b(\varepsilon)}$ with $b(\varepsilon)=\mathcal{O}(\varepsilon^{-m}\log^2(\varepsilon))$. This result improves Lemma IX.4 in Perekrestenko et al. as it shows that the rate at which $b(\varepsilon)$ tends to infinity as $\varepsilon$ tends to zero equals the rectifiability parameter $m$, which can be much smaller than the ambient dimension. We extend this result to countably $m$-rectifiable measures and show that this rate still equals the rectifiability parameter $m$ provided that, among other technical assumptions, the measure decays exponentially on the individual components of the countably $m$-rectifiable support set.

2024-12-06T15:10:04Z Erwin Riegler Alex Bühler Yang Pan Helmut Bölcskei http://arxiv.org/abs/2510.16462v3 Buzz, Choose, Forget: A Meta-Bandit Framework for Bee-Like Decision Making 2026-06-02T08:32:59Z

This work introduces MAYA, a sequential imitation learning model based on multi-armed bandits, designed to reproduce and predict individual bees' decisions in contextualized foraging tasks. The model accounts for bees' limited memory through a temporal window $τ$, whose optimal value is around 7 trials, with a slight dependence on weather conditions. Experimental results on real, simulated, and complementary (mice) datasets show that MAYA (particularly with the Wasserstein distance) outperforms imitation baselines and classical statistical models, while providing interpretability of individual learning strategies and enabling the inference of realistic trajectories for prospective ecological applications.

2025-10-18T12:03:15Z Emmanuelle Claeys Elena Kerjean Jean-Michel Loubes http://arxiv.org/abs/2507.10419v3 Multiple Choice Learning of Low-Rank Adapters for Language Modeling 2026-06-02T08:21:14Z

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the winner-takes-all loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on audio and visual captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs. We release the code for applying LoRA-MCL to a wide range of language models.

2025-07-14T16:00:51Z ICML 2026 Victor Letzelter Hugo Malard Mathieu Fontaine Gaël Richard Slim Essid Andrei Bursuc Patrick Pérez http://arxiv.org/abs/2510.12636v5 Adapting Noise to Data: Generative Flows from 1D Processes 2026-06-02T08:11:00Z

The default Gaussian latent in flow-based generative models poses challenges when learning certain distributions such as heavy-tailed ones. We introduce a general framework for learning data-adaptive parametric prior distributions (latent noise) using one-dimensional quantile functions, optimized via the Wasserstein distance between noise and data. The quantile-based prior parameterization naturally adapts to both heavy-tailed and compactly supported distributions and shortens transport paths. Numerical results on heavy-tailed weather and image datasets confirm the method's flexibility and effectiveness achieved with negligible computational overhead.

2025-10-14T15:30:28Z ICML 2026 Jannis Chemseddine Gregor Kornhardt Richard Duong Gabriele Steidl http://arxiv.org/abs/2605.05629v3 Spherical Flows for Sampling Categorical Data 2026-06-02T08:07:13Z

We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.

2026-05-07T03:34:00Z Jannis Chemseddine Gregor Kornhardt Gabriele Steidl http://arxiv.org/abs/2606.03292v1 Combining Statistical Features and Deep Encodings for Rehearsal-Based Class-Incremental Time Series Classification 2026-06-02T08:02:42Z

Many systems used in real-world environments require adding new categories and incorporating new information without forgetting what was previously learnt by the classification model. This is known as class-incremental continual learning, and in the case of multivariate time-series, is further complicated by the temporal structure of the data. In this paper, we present a novel approach for performing class incremental continual learning for the classification of multivariate time series data based upon the construction of a dual-stream feature extraction pipeline (using both deep temporal embedding features generated via a pre-trained frozen foundation model and application of statistical features). Evaluated on five benchmark datasets, the proposed system achieves competitive average accuracy across all datasets while maintaining low forgetting rates across all experimental configurations.

2026-06-02T08:02:42Z Pablo García-Santaclara Bruno Fernández-Castro Rebeca Pilar Díaz-Redondo