https://arxiv.org/api/7JrSCsf+k1f/VN5xi3T1IArzWrY2026-06-18T12:15:37Z3629681015http://arxiv.org/abs/2605.25734v1Stein-Encoder: A White-Box Supervised Encoder via Stein Identities in Multi-Modal Studies2026-05-25T11:43:09ZIn multi-modal biomedical research, integrating high-dimensional genomic data with clinical baselines is essential for precision medicine. However, standard deep neural network approaches often entangle these modalities, obscuring the specific predictive impact of genetic features and leading to possibly suboptimal predictive performance. Motivated by the landmark METABRIC cohort primary breast tumors study, we propose the Stein-Encoder, a white-box supervised framework designed to isolate the genetic signal driving clinical outcomes conditional on nuisance covariates. By leveraging Stein's method and residualization techniques, our approach constructs an interpretable single index that summarizes relevant biological heterogeneity while flexibly incorporating clinical factors and can be used to improve downstream prediction. We establish theoretical guarantees for identification, consistency and efficiency improvement. Applied to the METABRIC cohort, the Stein-Encoder outperforms unsupervised benchmarks in predictive accuracy. Crucially, it achieves structural disentanglement by revealing response-specific biological mechanisms: we find that tumor size is driven primarily by mitotic networks, whereas prognostic indices rely on a distinct proliferation-versus-immune axis. This work contributes a unified, computationally efficient framework that bridges statistical rigor with the representational power of neural networks, enabling interpretable, task-specific and efficient compression of multi-modal health data for a wide range of precision medicine applications, beyond biomarker discovery.2026-05-25T11:43:09ZJiarui ZhangShuoxun XuJiasheng ShiXinzhou Guohttp://arxiv.org/abs/2404.14328v2Preserving linear invariants in ensemble filtering methods2026-05-25T10:23:15ZData assimilation combines dynamical models with observations to improve state estimates. Ensemble filters sequentially assimilate observations by updating a set of samples over time, alternating between a forecast and an analysis step. Accurate and robust predictions often require preserving critical invariants such as mass, stoichiometric balance of chemical species, and electrical charge. While modern numerical solvers maintain these invariants, existing invariant-preserving analysis steps are limited to Gaussian settings. Furthermore, they can be incompatible with regularization techniques such as inflation and covariance tapering. In this work, we focus on preserving linear invariants in non-Gaussian filtering problems. Leveraging tools from measure transport theory, we introduce a novel class of nonlinear ensemble filters that preserve any desired linear invariants. Notably, we recover a constrained formulation of the Kalman filter for the special case of the Gaussian setting. We also demonstrate how to combine preserving invariants with regularization techniques in the ensemble Kalman filter. Numerical experiments illustrate the benefits of preserving linear invariants in both ensemble Kalman filters and transport-based nonlinear ensemble filters.2024-04-22T16:39:32Z25 pagesJournal of Computational Physics (2026)Mathieu Le ProvostJan GlaubitzYoussef Marzouk10.1016/j.jcp.2026.115048http://arxiv.org/abs/2605.25496v1Estimation of Directed Acyclic Graphs by Frequentist Model Averaging2026-05-25T06:57:18ZDirected acyclic graphs provide a fundamental tool for representing directed dependence structures in multivariate network data, and are widely used to model financial and economic networks. However, accurate and interpretable estimation remains challenging under graph structural uncertainty. We propose an optimal model averaging method for directed acyclic Gaussian graphs. With a set of candidate models varying by graph structures, we average estimates from candidate models using weights that minimize a penalized negative log-likelihood criterion. In contrast to existing approaches, we not only establish the asymptotic optimality, weight consistency, and parameter consistency of the proposed method, but also explicitly characterize how different candidate models affect the convergence rate. Moreover, we prove parameter consistency even when all candidate graph models are misspecified. Results from simulation studies and a real-data analysis on the banks' international liability data show the promise of the proposed method.2026-05-25T06:57:18Z33 pages, 5 figuresHuihang LiuWenhui LiXinyu Zhanghttp://arxiv.org/abs/2605.25478v1Transcripts and Algebraic Distances in Time Series: Stochastic Properties and Nonparametric Dependence Tests2026-05-25T06:31:04ZThe use of ordinal patterns (OPs) for analyzing the dependence structure of univariate and continuously distributed processes has gained popularity in recent years. This research goes one step further and considers the transcripts being computed from successive OPs in the time series. Transcripts constitute a kind of ``difference'' between successive OPs and thus naturally relate to two algebraic distances between OPs, the Cayley and Kendall edit distances. The original time series is transformed into a sequence of transcripts or distances, respectively, and important stochastic properties thereof are derived. It is shown that these properties differ substantially among different types of original processes. This motivates the development of various statistics based on transcripts and edit distances in order to investigate the dependence structure of the original process. In particular, the asymptotic distribution of these statistics under the null hypothesis of serial independence is derived, which is then used to implement nonparametric tests for serial dependence. A simulation study shows that these novel dependence tests have appealing power properties, often outperforming former OP-based dependence tests. A concluding real-world data example illustrates the application and interpretation of the proposed approaches in practice.2026-05-25T06:31:04ZChristian H. WeißJosé M. Amigóhttp://arxiv.org/abs/2606.07556v1Selecting New Measurement Locations to Diversify Traffic-Pattern Coverage: A Real-World Evaluation for Total Traffic Volume Estimation2026-05-25T06:14:39ZAccurate measurement of traffic volumes and flows is vital for modern intelligent transportation. However, despite recent technological advances in sensor devices, it is still expensive to install and maintain fixed traffic counters. Therefore, it is restricted to a small portion of location points where the counters can be installed, which severely limits the possibility of grasping and predicting the total traffic volume at a city-wide level. By contrast, devices with location history such as smartphones and connected vehicles are now widely used and provide much wider spatial coverage. However, the data from these devices are usually partial and noisy, so they are not enough to directly estimate total traffic volumes and flows. In this paper, we use the information from these widely available devices to help decide where to place additional traffic counters, and we study how selecting new measurement locations can improve city-wide traffic estimation performance. To achieve this, we propose an algorithm that chooses additional counter locations to increase the diversity of observed traffic signal patterns, rather than simply spreading counters evenly over space. The goal is to capture traffic-pattern types that are rare in the current counter set and to make the collected observations more representative for later estimation and forecasting. We also present a real-world evaluation; in a target city, we select new locations expected to improve traffic prediction, and we then commissioned new field measurements at those locations at our expense. The resulting data led to an improvement in traffic volume estimation accuracy across different fidelities.2026-05-25T06:14:39Z12 pages, 7 figuresMasaaki InoueAkifumi OkunoShintaro Fukushimahttp://arxiv.org/abs/2605.25452v1Different Statistical Perspectives for Understanding Generalisation in Graph Neural Networks2026-05-25T06:02:31ZGraph Neural Networks (GNN) are currently the most popular approach for learning and prediction on graph-structured data and are deployed in various fields, from social network analysis to drug discovery. However, there is limited mathematical understanding of the performance of GNNs. We discuss the various perspectives used to study statistical generalisation in GNNs. We identify three broad frameworks. The first approach, rooted in learning theory, relies on uniform convergence bounds and the complexity of the hypothesis class of specific GNN architectures. This approach also builds on the expressivity of GNNs, typically studied through the lens of graph isomorphism tests. The second principle is to simplify the neural architecture by analysing GNNs under the asymptotics of infinitely many parameters or infinite graph size. This approach approximates GNNs using Gaussian processes, neural tangent kernels or graphon neural network operators, which allow studying the generalisation or stability of trained GNNs. The third framework studies GNNs under random graph models, often the contextual stochastic block model, and derives non-asymptotic error rates using tools from high-dimensional statistics. We highlight some key theoretical results and discuss a few limitations and open research questions for each perspective.2026-05-25T06:02:31Z15 pages, 4 figures, submission for Special Issue in AStA Advances in Statistical AnalysisNil AydayMahalakshmi SabanayagamDebarghya Ghoshdastidarhttp://arxiv.org/abs/2605.25380v1Rank-Based Tests for Mutual Independence of High-Dimensional Random Vectors via $L_q$ Norm2026-05-25T03:09:44ZWe consider the problem of testing mutual independence among the components of a high-dimensional random vector. Building on the rank-based max-sum framework, we introduce fixed finite-$L_q$ power-sum statistics under three general classes of rank-based correlations: simple linear rank statistics, non-degenerate rank-based U-statistics and degenerate rank-based U-statistics. The proposed statistics interpolate between the dense-alternative sensitivity of the $L_2$ statistic and the sparse-alternative sensitivity of the $L_\infty$ statistic. We establish the asymptotic independence between any fixed finite-$L_q$ block and the corresponding $L_\infty$ statistic, and combine $L_2,L_4,L_6$ and $L_\infty$ p-values through a Cauchy rule. Numerical studies show that the resulting $L_{2,4,6,\infty}$ procedure is highly robust to the sparsity of the alternative and has strong empirical power across the considered designs.2026-05-25T03:09:44ZPing ZhaoHongfei WangLong Fenghttp://arxiv.org/abs/2605.05076v2High-Dimensional Statistics: Reflections on Progress and Open Problems2026-05-24T22:21:17ZOver the past two decades, the field of high-dimensional statistics has experienced substantial progress, driven largely by technological advances that have dramatically reduced the cost and effort for data collection and storage across a broad range of domains, including biology, medicine, astronomy, and the social and environmental sciences. Modern datasets are increasingly complex, often exhibiting rich dependency, heterogeneity, and other features that challenge traditional statistical methods. In response, high-dimensional statistics has evolved to address more sophisticated estimation and inference problems. This evolution has, in turn, fostered deep connections with and contributions to a wide range of research areas, including optimization, concentration of measure, random matrix theory, information theory, and theoretical computer science. Given the rapid pace of recent developments in high-dimensional statistics, our goal is to synthesize representative advances, highlight common themes and open problems, and point to important works that offer entry points into the field.2026-05-06T16:11:09ZArian MalekiSubhabrata SenSivaraman BalakrishnanVerena ZuberChao GaoRishabh DudejaChristos ThrampoulidisAnru ZhangWeijie SuJason M. KlusowskiPo-Ling LohAli Shojaiehttp://arxiv.org/abs/2605.04536v2Transversality and Geometric Regularisation in Distributional Statistical Models2026-05-24T19:16:59ZThe distributional statistical framework replaces classical probability densities by distribution-kernel pairs $(T, \varphi)$, where $T$ is a tempered distribution and $\varphi$ is a rapidly decaying kernel. We develop the thesis that the kernel acts as a geometric regulariser, placing parametric statistical models in generic (transversal) position relative to degeneracy loci encoding non-identifiability, singular information, moment indeterminacy, and representation failure.
Using the transversality theorems of Whitney, Thom, and Mather, we prove a finite-dimensional weak transversality theorem: for a generic kernel in any sufficiently rich family, the kernel-induced feature map avoids degeneracy strata of sufficiently high codimension. We establish verifiable conditions -- formulated as rank conditions on the Jacobian of the joint feature map -- under which the transversality hypothesis can be checked, and verify them for location families, the log-normal, Stein discrepancies, and graphical models.
The present results apply to parametric models; extensions to semiparametric and nonparametric settings are discussed. The degeneracy classification includes representation degeneracy (Type 0) for models without closed-form densities and higher-order instabilities (Type IV) in non-chordal graphical models. Identifiability, robustness, moment determinacy, Fisher information regularity, Stein discrepancy, inferential separation, and the Behrens-Fisher problem all admit a unified geometric interpretation as transversality conditions on the feature map. This paper serves as a geometric companion to a series of papers developing the distributional framework.2026-05-06T06:24:41Z22 pages, no figures no tables. In the second version some sketches were replaced by proofs, an example of M-determinancy was addedR. Labouriauhttp://arxiv.org/abs/2509.25507v2One-shot Conditional Sampling: MMD meets Nearest Neighbors2026-05-24T17:48:33ZHow can we generate samples from a conditional distribution that we never fully observe? This question arises across a broad range of applications in both modern machine learning and classical statistics, including image post-processing in computer vision, approximate posterior sampling in simulation-based inference, and conditional distribution modeling in complex data settings. In such settings, compared with unconditional sampling, additional feature information can be leveraged to enable more adaptive and efficient sampling. Building on this, we introduce Conditional Generator using MMD (CGMMD), a novel framework for conditional sampling. Unlike many contemporary approaches, our method frames the training objective as a simple, adversary-free direct minimization problem. A key feature of CGMMD is its ability to produce conditional samples in a single forward pass of the generator, enabling practical one-shot sampling with low test-time complexity. We establish rigorous theoretical bounds on the loss incurred when sampling from the CGMMD sampler, and prove convergence of the estimated distribution to the true conditional distribution. In the process, we also develop a uniform concentration result for nearest-neighbor based functionals, which may be of independent interest. Finally, we show that CGMMD performs competitively on synthetic tasks involving complex conditional densities, as well as on practical applications such as image denoising and image super-resolution.2025-09-29T21:04:50ZAccepted at the 43rd International Conference on Machine Learning (ICML 2026)Anirban ChatterjeeSayantan ChoudhuryRohan Horehttp://arxiv.org/abs/2605.25169v1Learning Treatment Effects during Resource Allocation via Priority-Queue Randomization2026-05-24T17:01:17ZPublic service programs often allocate limited resources under uncertainty about their benefits, creating a need for randomization to support credible evaluation. In practice, however, applicants commonly enter waitlists where resources are prioritized toward individuals judged to have higher need through tiered priority queues, making direct randomization difficult. Motivated by this, we develop an experimental design framework for learning treatment effects while treating those most in need where incoming applicants are randomized into priority queues based on their assessed risk scores. Treatments are then provided across queues in priority order and first-in-first-out within queue as budget becomes available. Our contributions are two-fold. First, we characterize what causal effects are identified under this priority-queue allocation. When arrivals are exogenous, treatments are conditionally randomized, and hence standard estimands are identified; when arrivals are endogenous, queue randomization instead provides an instrument for treatment, identifying local treatment effects induced by the queuing process. Second, we develop optimized queue-assignment designs that trade off statistical efficiency against prioritizing higher-need applicants. We show in the process that, despite dependence in treatment assignments induced by the design, usual iid efficiency bounds remain well-justified design objectives. We illustrate the proposed designs using data from a housing allocation program in a large U.S. county.2026-05-24T17:01:17ZJungHo LeeJohnna SundbergPim WelleBryan Wilderhttp://arxiv.org/abs/2603.10941v2Covariate-adjusted statistical dependence representation through partial copulas: bounds and new insights2026-05-24T17:00:38ZIn this paper, we revisit the notion of partial copula, originally introduced to test conditional independence, highlighting its capability to represent the dependence between two random variables after removing their dependence with a covariate. Building upon results previously presented in the literature, we show that partial copulas can be seen as a nonlinear analogue of partial correlation. Then, we prove several results showing how dependence properties of the conditional copulas constrain the form of the partial copula. Finally, a simulation study is conducted to illustrate the results and to show the potential of the partial copula as a way to describe covariate-adjusted statistical dependence. This highlights the potential of the method to be used in causal inference problems and to recover the true sign of a causal effect.2026-03-11T16:28:00ZVinícius Litvinoff JustusFelipe Fontana Vieirahttp://arxiv.org/abs/2505.01357v3Weight-calibrated estimation for factor models of high-dimensional time series2026-05-24T14:43:30ZThe factor modeling for high-dimensional time series is powerful in discovering latent common components for dimension reduction and information extraction. Most available estimation methods can be divided into two categories: the covariance-based under asymptotically-identifiable assumption and the autocovariance-based with white idiosyncratic noise. This paper follows the autocovariance-based framework and develops a novel weight-calibrated method to improve the estimation performance. It adopts a linear projection to tackle high-dimensionality, and employs a reduced-rank autoregression formulation. The asymptotic theory of the proposed method is established, relaxing the assumption on white noise. Additionally, we make the first attempt in the literature by providing a systematic theoretical comparison among the covariance-based, the standard autocovariance-based, and our proposed weight-calibrated autocovariance-based methods in the presence of factors with different strengths. Extensive simulations are conducted to showcase the superior finite-sample performance of our proposed method, as well as to validate the newly established theory. The superiority of our proposal is further illustrated through the analysis of one financial and one macroeconomic data sets.2025-05-02T15:52:42ZThis version is the accepted version by Journal of the American Statistical AssociationXinghao QiaoZihan WangQiwei YaoBo Zhanghttp://arxiv.org/abs/2605.19938v2Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation2026-05-24T14:07:56ZUniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy-maximizing resampling, but its resampling weights depend on a local k-nearest-neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial-maximization moment estimator can replace the plug-in density rule without changing the surrounding MASEM architecture. The proposed PMM-MASEM module computes shell spacings from nested k-nearest-neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing distribution departs from the flat Exp(1) regime; otherwise it falls back to the plug-in/MLE rule. This fallback is essential: on a flat homogeneous manifold the plug-in estimator is already the MLE, so PMM should not outperform it. A local Known-DGP Monte Carlo experiment confirms this gate: the selector returns MLE on flat Exp(1) spacings and reduces density MSE by 22--36% on asymmetric gamma and boundary-spacing regimes. The evidence is not uniformly positive: PMM3 worsens a platykurtic uniform spacing law, and a lightweight resampling-proxy experiment improves seven-lobes coverage but degrades the sine and swiss-roll proxies. The current evidence therefore supports an applicability-boundary result rather than a general MASEM improvement claim.2026-05-19T14:57:39Z16 pages, 5 figures, 3 tables. Code supplement: https://github.com/SZabolotnii/Ku-PMM-MASEM-code-supplementSerhii Zabolotniihttp://arxiv.org/abs/2605.25079v1Trans-dimensional Bayesian model averaging for $^{13}$C-based metabolic flux analysis: Evidence-based flux inference under structural model uncertainty2026-05-24T13:42:07ZAccurate quantification of intracellular metabolic fluxes is central to systems biology and biotechnology. Flux estimation relies on biochemical network models, with $^{13}$C metabolic flux analysis (MFA) being the state-of-the-art approach. However, isotope labeling data are often insufficient to uniquely support a single network formulation. In such cases, flux estimates become model-dependent, highlighting the need for methods that explicitly account for structural uncertainty. Bayesian model averaging (BMA) provides a principled framework for this purpose, but its application to $^{13}$C-MFA has so far been restricted to uncertainty in reaction bidirectionality within fixed network topologies. We introduce a scalable Bayesian inference framework for $^{13}$C-MFA, Bayesian model set averaging, that applies BMA to encompass uncertainty in reactions and pathways. Our approach combines reversible jump Markov chain Monte Carlo for trans-dimensional exploration of model spaces with diffusive nested sampling for robust estimation of model evidences, enabling averaging over large families of metabolic network models. Using illustrative and application-scale synthetic case studies, we demonstrate that the method yields robust flux estimates, reveals when multiple network configurations are statistically indistinguishable, and recovers data-supported model structures. Importantly, rather than committing to a single model, the framework manages structural uncertainty: under limited data, competing models are retained, whereas increasing data informativeness improved model and flux recovery. The approach scales to billions of model variants, providing a practical foundation for uncertainty- and misspecification-aware quantitative flux inference in $^{13}$C-MFA.2026-05-24T13:42:07ZJohann F. JadebeckAnton StratmannMartin BeyßKatharina Nöh