https://arxiv.org/api/RHXfXBZGxXIXqQTFhgMkbnvBduM 2026-06-09T23:29:56Z 36101 45 15 http://arxiv.org/abs/2606.08468v1 Nonparametric undirected graphical model selection using diffusion models 2026-06-07T06:22:13Z

Undirected graphical models provide a fundamental framework for representing conditional independence structures among high-dimensional random variables. While undirected graphical model selection has become a central problem in high-dimensional statistics, most existing methods are restricted to parametric settings. In this paper, we develop a nonparametric approach to undirected graphical model selection based on diffusion models. Recent work has shown that diffusion models can adapt to the unknown graph structure of the underlying distribution, yet utilizing these models for explicit graph estimation remains unexplored. To bridge this gap, we introduce a novel diffusion-based method for nonparametric undirected graphical model selection. We establish the model selection consistency of the proposed method and demonstrate its empirical performance through extensive simulations and two real data analyses.

2026-06-07T06:22:13Z Hyeok Kyu Kwon Myeonggu Kang Minwoo Chae Wanjie Wang http://arxiv.org/abs/2103.11066v6 Treatment Allocation under Uncertain Costs 2026-06-07T03:51:53Z

We consider the problem of learning how to optimally allocate treatments whose cost is uncertain and can vary with pre-treatment covariates. This setting may arise in medicine if we need to prioritize access to a scarce resource that different patients would use for different amounts of time, or in marketing if we want to target discounts whose cost to the company depends on how much the discounts are used. Here, we show that the optimal treatment allocation rule under budget constraints is a thresholding rule based on priority scores (those with a higher score are treated first), and we propose a number of practical methods for learning these priority scores using data from a randomized trial. Our formal results leverage a statistical connection between our problem and that of learning heterogeneous treatment effects under endogeneity using an instrumental variable. We find our method to perform well in a number of empirical evaluations.

2021-03-20T00:36:28Z Georgy Kalashnov Evan Munro Hao Sun Shuyang Du Stefan Wager http://arxiv.org/abs/2606.08418v1 TS-Neyman: Posterior Sampling for Adaptive Stratified Estimation 2026-06-07T02:36:16Z

Many model evaluation tasks reduce to estimating an average loss, error rate, or subgroup metric on a stratified pool when each label, human rating, or simulator call is costly. The precision-optimal Neyman allocation depends on within-stratum variances, which must be learned from the same observations used for estimation. We formulate this as a sequential allocation problem and use the exact one-step marginal variance reduction as the priority index. Replacing the unknown variances by independent inverse-chi-squared posterior draws yields TS-Neyman, a Thompson-sampling rule that preserves the oracle marginal-gain structure while randomizing over variance uncertainty. For any fixed finite number of strata, we prove almost-sure convergence of the TS-Neyman allocation proportions to the Neyman target, asymptotic optimality of the variance proxy, and a central limit theorem for the resulting adaptive stratified estimator. In two five-stratum budget-scaling benchmarks, one bounded-loss benchmark and one binary model-error benchmark in the spirit of Dai et al. 2023, TS-Neyman's relative efficiency stays within 5 percent of the oracle on the bounded-loss population and within about 15 percent on the binary benchmark. In an additional CivilComments real-data replay with confidence-based strata, it stays within about 8 percent of the oracle and improves on equal allocation by roughly 7 to 14 percent in MSE across budgets, while plug-in greedy and two-stage plug-in can degrade by over an order of magnitude under sparse pilots. Common-pilot warm-start and prior-sensitivity studies show that this behavior is stable under working-model and working-prior misspecification.

2026-06-07T02:36:16Z Kosuke Morikawa Mst Moushumi Pervin Jae Kwang Kim http://arxiv.org/abs/2606.08409v1 Matrix representations and distance metrics for unlabeled ranked phylogenetic networks 2026-06-07T02:17:03Z

Phylogenetic networks are graphs inferred from molecular sequence data that represent ancestral histories shaped by reticulate processes such as recombination, hybridization, and horizontal gene transfer. We introduce a family of distance metrics for rooted, ranked, unlabeled phylogenetic networks, extending a previously developed distance for ranked trees. Our approach relies on a bijective triangular matrix representation of phylogenetic networks that captures the temporal order of internal events, speciations, and hybridizations. Our metrics, defined as standard matrix norms, allow efficient quantitative comparisons of network topologies, timed networks and networks with differing numbers of hybridizations. Our distance can be used for both isochronous networks where all tips are sampled at one time point, and heterochronous networks where tips are allowed to be sampled at different time points. We show that our metrics capture biologically meaningful differences among evolutionary histories in both simulations and empirical posterior distributions of viral phylogenetic networks. These tools fill a methodological gap, enabling principled comparisons of ranked, unlabeled phylogenetic networks, including ancestral recombination graphs.

2026-06-07T02:17:03Z 25 pages, 11 figures. Submitted to the Proceedings of the National Academy of Sciences (PNAS) Jiayang Wang Julia A. Palacios Claudia Solís-Lemus http://arxiv.org/abs/2606.08407v1 Topological Effective Connectivity Modeling in Brain Networks 2026-06-07T02:11:09Z

Characterizing directed information flow in brain networks is difficult because neural circuits are full of recurrent feedback loops. Many existing tools for directed dependence assume a directed acyclic graph (DAG) structure to resolve directional ambiguity, and therefore cannot represent these loops. We present a nonparametric, information-theoretic framework that addresses this by coupling the discrete Hodge decomposition with lead-lag mutual information, splitting the resulting edge flow into three orthogonal components: a gradient term capturing hierarchical, feed-forward relationships; a curl term isolating triangle-level feedback loops; and a harmonic term capturing cyclic flow around topological holes. This separation makes it possible to disentangle feed-forward drive from recurrent circulation, which conventional measures conflate. We further develop a permutation-based hypothesis-testing layer that identifies nodes and triangular motifs whose information-flow signatures change significantly between conditions. We validate the framework on simulations with known ground-truth structure and apply it to local field potential recordings from a rodent model of focal ischemic stroke. In three of four animals, we find a post-stroke shift toward hierarchical, source-driven propagation at the expense of recurrent feedback, while the fourth shows no significant change.

2026-06-07T02:11:09Z 45 pages, 15 figures Anass El-Yaagoubi Moo K. Chung Hernando Ombao http://arxiv.org/abs/2603.21161v2 An information criterion for detecting periodicities in functional time series 2026-06-07T00:39:43Z

We propose an information criterion for determining an unknown number of periodic components in functional time series. Identifying the number of frequencies in large-scale time series has been a central focus. To achieve this goal, we suggest an iterative procedure, utilizing the residual process obtained through least squares fitting. This iterative approach demonstrates broad applicability. We establish the consistency of the estimated number of periodic components by minimizing the information criterion. The efficacy of the procedure is illustrated through numerical simulations. In real data analysis, we apply this information criterion to temperature data and sunspot data.

2026-03-22T10:28:54Z Computational Statistics & Data Analysis (2026) 108430 Rinka Sagawa Yan Liu Valentin Patilea 10.1016/j.csda.2026.108430 http://arxiv.org/abs/2306.06756v3 Semi-Parametric Inference for Doubly Stochastic Spatial Point Processes: An Approximate Penalized Poisson Likelihood Approach 2026-06-06T23:12:19Z

Doubly-stochastic point processes model the occurrence of events over a spatial domain as an inhomogeneous Poisson process conditioned on the realization of a random intensity function. They are flexible tools for capturing spatial heterogeneity and correlation. However, existing implementations of doubly-stochastic spatial models are computationally demanding, often have limited theoretical guarantee, and/or rely on restrictive assumptions. We propose a penalized regression method for estimating covariate effects in doubly-stochastic point processes that is computationally efficient and does not require a parametric form or stationarity of the underlying intensity. Our approach is based on an approximate (discrete and deterministic) formulation of the true (continuous and stochastic) intensity function. We show that consistency and asymptotic normality of the covariate effect estimates can be achieved despite the model misspecification, and develop a covariance estimator that leads to a valid, albeit conservative, statistical inference procedure. A simulation study shows the validity of our approach under less restrictive assumptions on the data generating mechanism, and an application to Seattle crime data demonstrates better prediction accuracy compared with existing alternatives.

2023-06-11T19:48:39Z Si Cheng Jon Wakefield Ali Shojaie http://arxiv.org/abs/2606.08322v1 Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA 2026-06-06T20:20:55Z

To characterize the US airline profit cycles from 1995 to 2020, the authors of Renold et al. (2023) combine k-means clustering, principal component analysis, and system dynamic modelling. We replicate their clustering experiment in three spaces -- the original 7-dimensional raw-variable space, a 3-dimensional PC score space, and a 4-dimensional PC score space using their dataset gratefully included in the paper. We show that the six-cluster taxonomy is geometrically robust: k-means in 3-PC space produces bit-for-bit identical cluster assignments relative to 7D raw space. As a nonlinearity check we apply kernel PCA under six kernels spanning three families plus a linear baseline. All six kernels preserve the six-cluster assignment in 2D. A 1D diagnostic tightens this: the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas all five non-baseline kernels shift C_3 to overlap only the post-financial-crisis cluster C_5. Agreement across the kernel families confirms an intrinsically linear manifold with no hidden curvature. The silhouette criterion reveals that the dataset structurally supports only three clusters, not six. Collinearity in the raw 7D space suppresses the silhouette signal that would otherwise identify k=3 as the structurally motivated choice.

2026-06-06T20:20:55Z Andreas Schlapbach http://arxiv.org/abs/2606.05450v2 Eigenvector Spatial Filters Nuclear Norm Matrix Completion with Application to Air Quality Data 2026-06-06T19:32:05Z

Reliable reconstruction of missing observations in environmental panel datasets is essential for accurate exposure assessment and policy analysis. Traditional nuclear norm matrix completion methods effectively impute missing entries in low-rank matrices, yet often overlook the spatial dependence inherent to air quality processes. This paper introduces the Eigenvector Spatial Filters Nuclear Norm Matrix Completion (ESFNNMC) method, an extension of nuclear norm fixed-effects matrix completion that replaces unit-specific intercepts with a set of Moran-type eigenvectors capturing spatial autocorrelation in the data. To estimate the model, we propose a Block-Coordinate Descent (BCD) approach for multiconvex optimization problems, with soft-thresholded singular value decomposition and cross-validated regularization. Through comprehensive simulations varying missingness patterns, the level of spatial and temporal autocorrelation, and dimension, shape, and rank structure of the matrices, ESFNNMC demonstrates substantial improvements in imputation accuracy over the standard fixed-effects approach, while keeping the computational cost approximately unchanged. The method is applied to impute missing entries in daily PM10 measurements in 64 monitoring stations in Lombardy, Italy, during the year 2021.

2026-06-03T21:11:18Z 29 pages, 5 figures, 14 tables, draft version (to do not cite yet) Rodolfo Metulini http://arxiv.org/abs/2605.14610v3 Parametrically Adaptive Transition Polynomial: a Signed-Parity Continuous-alpha Extension of Kunchenko Stochastic Polynomials 2026-06-06T18:36:54Z

Kunchenko's method of polynomial maximization provides a semiparametric apparatus for parameter estimation under non-Gaussian errors, but its classical power basis relies on finite higher-order integer moments. This paper introduces the Parametrically Adaptive Transition Polynomial (PATP), a signed-parity fractional-power family controlled by a continuous parameter alpha in [0,1]. The quadratic exponent map p_i(alpha) connects the fractal regime p_i(0)=1/i, the degenerate linear point p_i(1/2)=1, and the signed-parity integer-power regime p_i(1)=i. For the degree-S=2 case we derive a closed-form variance-reduction coefficient g_2(alpha) in terms of signed and absolute fractional moments, identify the singular behavior at alpha=1/2, and state the moment and regularity conditions under which the formula is meaningful. The construction should be read as a Form-B PATP analogue within Kunchenko's generalized apparatus, not as an exact recovery of the canonical even-power PMM basis at alpha=1. Numerical illustrations on canonical distributions are used to examine the finite-sample behavior of the signed-parity estimator and to mark the boundary of applicability for extremely heavy-tailed cases such as Cauchy.

2026-05-14T09:26:53Z 35 pages, 8 figures. Code supplement: https://github.com/SZabolotnii/Ku-PATP-code-supplement Serhii Zabolotnii http://arxiv.org/abs/2606.08289v1 Direct domain estimation via regression-tree-assisted estimators in the production of official statistics 2026-06-06T18:20:26Z

National statistical offices (NSOs) produce their estimates under a single weighting system (uni-weight approach): one set of weights, independent of the variable of interest, is used to estimate multiple parameters and multiple subpopulations (domains). In this paper we study, within the family of model-assisted estimators and from a design-based perspective of direct estimation, the use of regression trees as the assisting model for estimating totals in unplanned domains. We distinguish two strategies: (i) fitting a single tree at the population level and deriving from it uni-weight weights applicable to any domain, and fitting a domain-specific tree. We show that both estimators can be written as weighted sums with weights that do not depend on $y$, preserving the uni-weight property and additivity benchmarking with respect to the population total. Extending to trees the classical result, we argue why the estimator built from a population-level model tends to behave like the Horvitz-Thompson estimator within domains, whereas the domain-specific model can achieve substantial variance reductions. A simulation study based on microdata from the Uruguayan Continuous Household Survey (ECH) illustrates the behavior of the estimators at the population level and by department

2026-06-06T18:20:26Z Juan Pablo Ferreira http://arxiv.org/abs/2606.08261v1 Sparse Longitudinal Functional Principal Component Analysis for Episodic Ambulatory Behavioral Assessments 2026-06-06T17:16:37Z

Accurately monitoring mental fatigue is critical for improving workplace safety and productivity. A recent study examined unobtrusively collected smartphone typing speed as a potential ambulatory proxy assessment of mental fatigue using data from the Intern Health Study (IHS). While population-level average typing speed patterns were found to be consistent with validated measures of mental fatigue, how these trajectories vary across participants and days may inform opportune moments for just-in-time interventions and remains an open question. Treating typing speed trajectories as sparsely observed functional data, we propose a novel sparse longitudinal functional principal component analysis (sparse LFPCA) method for decomposing variability and predicting individual curves. Specifically, sparse data are accommodated by casting covariance estimation as a structured penalized spline regression problem, enabling simultaneous estimation and smoothing of multiple covariance components while borrowing information across locations in the functional domain. Simulations show that sparse LFPCA (1) accurately estimates eigenfunctions and generates reasonable predictions for underlying curves, and (2) achieves similar or superior performance compared to existing alternatives. Our analysis of typing speed data collected from IHS reveals new and interpretable participant- and day-level patterns not captured by previous analyses and can be used to tailor behavioral interventions.

2026-06-06T17:16:37Z Nidhi Pai Yu Fang Srijan Sen Zhenke Wu Erjia Cui http://arxiv.org/abs/2501.04615v4 Doubly Robust and Efficient Calibration of Prediction Sets for Right-Censored Time-to-Event Outcomes 2026-06-06T17:11:41Z

Our objective is to construct well-calibrated prediction sets for a time-to-event outcome subject to right-censoring with guaranteed coverage. Inspired by modern conformal inference, our approach avoids the need for a well-specified parametric or semiparametric survival model. Unlike existing conformal methods for survival data, which assume Type-I censoring with fully observed censoring times, we consider the more common right-censoring setting in which only the censoring time or only the event time is observed, whichever comes first. Under a standard conditional independence censoring condition, we propose and analyze several lower prediction bounds for the survival time of a future observation, including inverse-probability-of-censoring weighting, and its augmented version based on the semiparametric efficient influence function for the relevant marginal quantile of the outcome accounting for dependent censoring. We formally establish asymptotic coverage guarantees of the proposed methods, and demonstrate both theoretically and through empirical experiments, that the augmented approach substantially improves efficiency over all other proposed methods. Specifically, its coverage error bound is doubly robust, and therefore of second order, thus ensuring that it is asymptotically negligible relative to the coverage error of the other methods.

2025-01-08T16:57:18Z 48 pages, 11 figures Rebecca Farina Eric J. Tchetgen Tchetgen Arun Kumar Kuchibhotla http://arxiv.org/abs/2410.20885v3 A Distributed Lag Approach to the Generalised Dynamic Factor Model 2026-06-06T16:43:28Z

We propose a simple estimator for the dynamic decomposition of the Generalized Dynamic Factor Model that avoids frequency-domain methods. First, we show that it is a reasonable approximation to assume that the dynamic common component of the Generalized Dynamic Factor Model admits a representation in terms of current and lagged statically pervasive factors. Then, assuming finite lag order, this simplification reduces estimation to a regression of the observed variables on estimated factors and their lags, where the factors are extracted via static principal components. The proposed approach naturally accommodates weak, non-pervasive factors within the dynamic common space. We establish consistency and asymptotic normality for both the dynamic and weak common components under a new asymptotic framework that allows for such weak factors. In an application to three high-dimensional time series panels of European macroeconomic data we detect a sizeable weak common component share in several key macroeconomic indicators.

2024-10-28T10:07:06Z Philipp Gersing http://arxiv.org/abs/2512.03983v2 Statistical hypothesis testing for differences between layers in dynamic multiplex networks 2026-06-06T15:56:43Z

With the emergence of dynamic multiplex networks, corresponding to graphs where multiple types of edges evolve over time, a key inferential task is to determine whether the layers associated with different edge types differ in their connectivity. In this work, we introduce a hypothesis testing framework, under a latent space network model, for assessing whether the layers share a common latent representation. The method we propose extends previous literature related to the problem of pairwise testing for random graphs and enables global testing of differences between layers in multiplex graphs. While we introduce the method as a test for differences between layers, it can easily be adapted to test for differences between time points. We construct a test statistic based on a spectral embedding of an unfolded representation of the graph adjacency matrices and demonstrate its ability to detect differences across layers in the asymptotic regime where the number of nodes in each graph tends to infinity. The finite-sample properties of the test are empirically demonstrated by assessing its performance on both simulated data and a biological dataset describing the neural activity of larval Drosophila.

2025-12-03T17:14:33Z 12 pages, 3 figures Maximilian Baum Francesco Sanna Passino Axel Gandy