https://arxiv.org/api/yRH0TaNdyYSoZt2fbBY973VxTKM2026-03-20T10:53:49Z272421515http://arxiv.org/abs/2602.06175v2Optimal rates for density and mode estimation with expand-and-sparsify representations2026-03-18T21:42:13ZExpand-and-sparsify representations are a class of theoretical models that capture sparse representation phenomena observed in the sensory systems of many animals. At a high level, these representations map an input $x \in \mathbb{R}^d$ to a much higher dimension $m \gg d$ via random linear projections before zeroing out all but the $k \ll m$ largest entries. The result is a $k$-sparse vector in $\{0,1\}^m$. We study the suitability of this representation for two fundamental statistical problems: density estimation and mode estimation. For density estimation, we show that a simple linear function of the expand-and-sparsify representation produces an estimator with minimax-optimal $\ell_{\infty}$ convergence rates. In mode estimation, we provide simple algorithms on top of our density estimator that recover single or multiple modes at optimal rates up to logarithmic factors under mild conditions.2026-02-05T20:27:51ZAccepted at AISTATS 2026Kaushik SinhaChristopher Toshhttp://arxiv.org/abs/2307.12544v2Adaptive debiased machine learning using data-driven model selection techniques2026-03-18T20:56:47ZDebiased machine learning estimators for smooth functionals in nonparametric models can exhibit substantial variability and instability, often leading practitioners to instead rely on parametric or semiparametric working models. Such models, however, may be misspecified and can therefore introduce bias. We study how data-driven model selection can be combined with debiased machine learning to construct estimators that adapt to structure in the data-generating distribution. To this end, we propose Adaptive Debiased Machine Learning (ADML), a nonparametric framework for constructing superefficient estimators of pathwise differentiable parameters. The framework unifies a broad class of previously proposed adaptive estimators, including methods based on variable selection, learned feature representations, and collaborative targeted learning. It requires only high-level conditions and approximate validity of the selection procedure, which are implied by lower-level conditions already assumed in important settings, including sieve-based selection, sparsity-based methods such as the Lasso, and data-adaptive feature representations. We show that ADML estimators yield regular and efficient root-\(n\) inference for an oracle projection parameter induced by a data-adaptive oracle submodel. This oracle parameter coincides with the target parameter at the true distribution but typically has a smaller efficiency bound, thereby yielding superefficiency for the target parameter. As a practical illustration, we introduce a broad class of automatic ADML estimators for continuous linear functionals of the outcome regression, in which model selection is performed directly on the regression itself. Motivated by overlap challenges in causal inference, we develop new superefficient plug-in estimators for the average treatment effect based on calibration in semiparametric regression models.2023-07-24T06:16:17ZLars van der LaanMarco CaroneAlex LuedtkeMark van der Laanhttp://arxiv.org/abs/2510.20742v2Bayesian Prediction under Moment Conditioning2026-03-18T19:58:00ZPrediction is a central task of statistics and machine learning, yet many inferential settings provide only partial information, typically in the form of moment constraints or estimating equations. We develop a finite, fully Bayesian framework for propagating such partial information through predictive distributions. Building on de Finetti's representation theorem, we construct a curvature-adaptive version of exchangeable updating that operates directly under finite constraints, yielding an explicit discrete-Gaussian mixture that quantifies predictive uncertainty. The resulting finite-sample bounds depend on the smallest eigenvalue of the information-geometric Hessian, which measures the curvature and identification strength of the constraint manifold. This approach unifies empirical likelihood, Bayesian empirical likelihood, and generalized method-of-moments estimation within a common predictive geometry. On the operational side, it provides computable curvature-sensitive uncertainty bounds for constrained prediction; on the theoretical side, it recovers de Finetti's coherence, Doob's martingale convergence and local asymptotic normality as limiting cases of the same finite mechanism. Our framework thus offers a constructive bridge between partial information and full Bayesian prediction.2025-10-23T17:03:17ZFixed typos, updated references, minor notational clarifications addedNicholas G. PolsonDaniel Zantedeschihttp://arxiv.org/abs/2603.18204v1Highly Adaptive Empirical Risk Minimization with Principal Components2026-03-18T18:56:17ZThe Highly Adaptive Lasso (HAL) delivers unprecedented guarantees in nonparametric minimum loss estimation under minimal smoothness assumptions, such as dimension-free minimax optimal rates. However, the practical use of HAL has been severely limited by its exponentially growing computationally prohibitive indicator basis expansion in moderate to high dimensions. Existing screening strategies drastically reduce this dimension but lack any theoretical justification. We introduce the Principal Component Highly Adaptive (PC-HA) family of estimators, which for the first time provide a principled and theoretically valid dimension reduction. We establish formal results on the score equations solved by these PC-HA estimators, allowing to transfer plug-in efficiency and pointwise asymptotic normality results from HAL to these PC-HA estimators, under comparable complexity control.2026-03-18T18:56:17ZCarlos García MeixideMingxun WangAlejandro SchulerMark J. van der Laanhttp://arxiv.org/abs/2603.17984v1On min-Storey estimators for multiple testing and conformal novelty detection2026-03-18T17:46:33ZIn a multiple testing task, finding an appropriate estimator of the proportion $π_0$ of non-signal in the data to boost power of false discovery rate (FDR) controlling procedures is a long-standing research theme, sometimes referred to as 'adaptive FDR control'. The interest in this theme has been reinforced in the recent years with conformal novelty detection, for which it turns out that similar tools can be used in combination with any 'blackbox' machine learning algorithm. Nevertheless, perhaps surprisingly, finding a solution for 'adaptive FDR control' that is optimal in a broad sense is still an open problem. This paper fills this gap by introducing new $π_0$-estimators, referred to as min-Storey (MS) and interval-min-Storey (IMS), which are built upon the so-called 'Storey estimator'. Plugging these estimators in the adaptive Benjamini-Hochberg (BH) procedure is shown to deliver FDR control both in the independent and conformal settings. In addition, these methods satisfy an optimal power property over any (regular) alternative distribution. The excellent behaviors of the new adaptive procedures are illustrated with numerical experiments both in the independent and conformal models for various distribution structures.2026-03-18T17:46:33Z52 pages, 9 figures, 2 tablesGao ZijunRoquain Etiennehttp://arxiv.org/abs/2603.17925v1Multi-Armed Sequential Hypothesis Testing by Betting2026-03-18T17:01:34ZWe consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis $\mathscr{P}$ that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting $\mathscr{P}$ in favor of a composite alternative $\mathscr{Q}$ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek $e$-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against $\mathscr{P}$. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.2026-03-18T17:01:34ZRicardo J. SandovalIan Waudby-SmithMichael I. Jordanhttp://arxiv.org/abs/2504.03466v3Identifiability of VAR(1) model in a stationary setting2026-03-18T16:44:12ZWe consider a classical First-order Vector AutoRegressive (VAR(1)) model, where we interpret the autoregressive interaction matrix as influence relationships among the components of the VAR(1) process that can be encoded by a weighted directed graph. A majority of previous work studies the structural identifiability of the graph based on time series observations and therefore relies on dynamical information. In this work we assume that an equilibrium exists, and study instead the identifiability of the graph from the stationary distribution, meaning that we seek a way to reconstruct the influence graph underlying the dynamic network using only static information. We use an approach from algebraic statistics that characterizes models using the Jacobian matroids associated with the parametrization of the models, and we introduce sufficient graphical conditions under which different graphs yield distinct steady-state distributions. Additionally, we illustrate how our results could be applied to characterize networks inspired by ecological research.2025-04-04T14:17:45ZBixuan Liuhttp://arxiv.org/abs/2302.02415v3On Separability of Covariance in Multiway Data Analysis2026-03-18T14:18:26ZMultiway data analysis aims to uncover patterns in data structured as multi-indexed arrays, with multiway covariance playing a crucial role in many applications. However, the high dimensionality of multiway covariance presents significant computational challenges. To overcome these challenges, factorized covariance models have been proposed that rely on a separability assumption: the multiway covariance can be accurately expressed as a sum of Kronecker products of mode-wise covariances. This paper addresses the representability, certification, and approximation of such separable models, leaving statistical estimation or finite-sample properties aside. We reduce the question of whether a given covariance can be decomposed into a separable multiway form to an equivalent question about the separability of quantum states. Leveraging results from quantum information theory, we show that generic multiway covariances are typically \emph{not} separable and that determining the best separable approximation is NP-hard. These findings suggest that factorized covariance models can be overly restrictive and difficult to fit without additional structural assumptions. Nevertheless, our numerical experiments indicate that standard iterative algorithms, namely Frank-Wolfe and gradient descent, often converge close to the best separable approximation. As NP-hardness concerns worst-case computational complexity, Kronecker-separable approximations to multiway covariance could still be tractable to apply for analyzing many real-world datasets.2023-02-05T15:54:13Z45 pages, 8 figures, 3 tablesDogyoon SongAlfred O. Herohttp://arxiv.org/abs/2603.17599v1Prediction with Missing Data: Target Probabilities and Missingness Mechanisms2026-03-18T11:13:41ZConditions ensuring optimal parameter estimation in the presence of missing data are well established in inference, typically relying on the Missing-at-Random (MAR) assumption. In prediction, similar principles are often assumed to apply. However, methods considered biased in inference, such as pattern sub-modelling or unconditional imputation, have been shown to achieve optimal predictive performance under any missingness mechanism, including non-MAR (MNAR). To explain this apparent contradiction, we introduce a new formal framework for describing missingness in prediction. Central to this framework is a distinction between two prediction targets, defined according to whether or not the indicator of observation of the predictors is exploited to predict the outcome. This distinction leads to a classification of the missingness mechanisms describing the conditions under which these targets are equal, and when consistent prediction of each is achievable. A key result is that both targets may be consistently predicted under conditions weaker than MAR. We discuss the implications of this paradigm for handling missing data in prediction, distinguishing between missingness at development, validation and deployment of a forecaster. The findings are illustrated using simulated data and a real-world application with the prediction of significant injury after trauma upon arrival at the emergency department.2026-03-18T11:13:41Z55 pages (including 40 pages for the main article and 15 pages for the supplementary material)Pierre CatoireRobin GenuerCecile Proust-Limahttp://arxiv.org/abs/2211.15168v2Most probable paths for developed processes2026-03-18T10:23:48ZOptimal paths for the classical Onsager-Machlup function determining most probable paths between points on a manifold are only explicitly identified for specific processes, for example the Riemannian Brownian motion. This leaves out large classes of manifold-valued processes such as processes with parallel transported non-trivial diffusion matrix, processes with rank-deficient generator and sub-Riemannian processes, and push-forwards to quotient spaces. In this paper, we construct a general approach to definition and identification of most probable paths by measuring the Onsager-Machlup function on the anti-development of such processes. The construction encompasses large classes of manifold-valued process and results in explicit equation systems for the paths that we denote \emph{development most probable paths}. We define and derive these results and apply them to several cases of stochastic processes on Lie groups, homogeneous spaces, and landmark spaces appearing in shape analysis.2022-11-28T09:24:05ZErlend GrongStefan Sommerhttp://arxiv.org/abs/2411.12127v5Fine-Grained Uncertainty Quantification via Collisions2026-03-18T04:37:50ZWe propose a new and intuitive metric for aleatoric uncertainty quantification (UQ), the prevalence of class collisions defined as the same input being observed in different classes. We use the rate of class collisions to define the collision matrix, a novel and uniquely fine-grained measure of uncertainty. For a classification problem involving $K$ classes, the $K\times K$ collision matrix $S$ measures the inherent difficulty in distinguishing between each pair of classes. We discuss several applications of the collision matrix, establish its fundamental mathematical properties, and show its relationship with existing UQ methods, including the Bayes error rate (BER). We also address the new problem of estimating the collision matrix using one-hot labeled data by proposing a series of innovative techniques to estimate $S$. First, we learn a pair-wise contrastive model which accepts two inputs and determines if they belong to the same class. We then show that this contrastive model (which is PAC learnable) can be used to estimate the row Gramian matrix of $S$, defined as $G=SS^T$. Finally, we show that under reasonable assumptions, $G$ can be used to uniquely recover $S$, a new result on non-negative matrices which could be of independent interest. With a method to estimate $S$ established, we demonstrate how this estimate of $S$, in conjunction with the contrastive model, can be used to estimate the posterior class probability distribution of any point. Experimental results are also presented to validate our methods of estimating the collision matrix and class posterior distributions on several datasets.2024-11-18T23:41:27ZJesse FriedbaumSudarshan AdigaRavi Tandonhttp://arxiv.org/abs/2603.17327v1Empirical Likelihood Inference for Sen and Sen--Shorrocks--Thon Indices2026-03-18T03:44:37ZThe Sen index and Sen-Shorrocks-Thon (SST) index are widely used measures of poverty indices. Developing reliable inference for these measures enables us to compare these measures in different populations of interest in an effective way. It is important to construct confidence intervals for the Sen index and SST index, which provide better coverage probability and shorter interval length. Motivated by this, we discuss empirical likelihood (EL) and jackknife empirical likelihood (JEL) based inference for the Sen index. To derive a JEL-based confidence interval for the Sen and SST indices, we propose a new estimator for the Sen index using the theory of U-statistics and examine its properties. The large sample properties of the EL and JEL ratio statistics are studied. We also discuss EL and JEL-based inference for the Sen-Shorrocks-Thon (SST) index. The finite sample performance of the EL and JEL-based confidence intervals of both Sen and SST indices is evaluated through a Monte Carlo simulation study. Finally, we illustrate our methods using individual-level data from the Panel Study of Income Dynamics (PSID) survey from the US as well as Indian household level income data for different states sourced from the Consumer Pyramids Household Survey (CPHS).2026-03-18T03:44:37ZSreelakshmi NSaparya SureshSudheesh K. Kattumannilhttp://arxiv.org/abs/2603.17291v1On the structure of marginals in high dimensions2026-03-18T02:40:42ZLet $G, G_1,\dots,G_N$ be independent copies of a standard gaussian random vector in $\mathbb{R}^d$ and denote by $Γ= \sum_{i=1}^N \langle G_i,\cdot\rangle e_i$ the standard gaussian ensemble. We show that, for any set $A\subset S^{d-1}$, with exponentially high probability, \[ \sup_{x\in A} \frac{1}{N}\sum_{i=1}^N \big| (Γx)^\sharp_i - q_i\big| \le c \frac{ \mathbb{E} \sup_{x\in A} \langle G,x\rangle + \log^2N }{\sqrt N }. \] Here each $q_i$ is the $\frac{i}{N+1}$-quantile of the standard normal distribution and $(Γx)^\sharp $ denotes the monotone increasing rearrangement of the vector $Γx$. The estimate is sharp up to a possible logarithmic factor and significantly extends previously known bounds. Moreover, we show that similar estimates hold in much greater generality: after replacing the gaussian quantiles by the appropriate ones, the same phenomenon persists for a broad class of random vectors.2026-03-18T02:40:42ZDaniel BartlShahar Mendelsonhttp://arxiv.org/abs/2603.13681v2Generalized projection tests for function-valued parameters with applications to testing structural causal assumptions2026-03-17T21:47:01ZStructural assumptions are central to the causal inference literature. In practice, it is often crucial to assess their validity or to test implications that follow from them. In many settings, such tests can be framed as evaluating whether a function-valued parameter equals zero. In this paper, we propose a class of generalized projection tests based on series estimators for function-valued parameters. We establish conditions under which the proposed tests are valid and illustrate their applicability through examples from the data fusion and instrumental variables literature. Our approach accommodates flexible machine learning methods for estimating nuisance parameters. In contrast to many existing approaches, the limiting distribution of the proposed test statistics is straightforward to compute under the null hypothesis. We apply our method to test the equality of conditional COVID-19 risk across vaccine arms in the COVID-19 Variant Immunologic Landscape (COVAIL) trial.2026-03-14T01:35:52ZRui WangAlbert OsomBo Zhanghttp://arxiv.org/abs/2603.17160v1Self-Regularized Learning Methods2026-03-17T21:45:50ZWe introduce a general framework for analyzing learning algorithms based on the notion of self-regularization, which captures implicit complexity control without requiring explicit regularization. This is motivated by previous observations that many algorithms, such as gradient-descent based learning, exhibit implicit regularization. In a nutshell, for a self-regularized algorithm the complexity of the predictor is inherently controlled by that of the simplest comparator achieving the same empirical risk. This framework is sufficiently rich to cover both classical regularized empirical risk minimization and gradient descent. Building on self-regularization, we provide a thorough statistical analysis of such algorithms including minmax-optimal rates, where it suffices to show that the algorithm is self-regularized -- all further requirements stem from the learning problem itself. Finally, we discuss the problem of data-dependent hyperparameter selection, providing a general result which yields minmax-optimal rates up to a double logarithmic factor and covers data-driven early stopping for RKHS-based gradient descent.2026-03-17T21:45:50ZMax SchölppleLiu FanghuiIngo Steinwart