https://arxiv.org/api/RHXfXBZGxXIXqQTFhgMkbnvBduM2026-06-09T23:29:56Z361014515http://arxiv.org/abs/2606.08468v1Nonparametric undirected graphical model selection using diffusion models2026-06-07T06:22:13ZUndirected graphical models provide a fundamental framework for representing conditional independence structures among high-dimensional random variables. While undirected graphical model selection has become a central problem in high-dimensional statistics, most existing methods are restricted to parametric settings. In this paper, we develop a nonparametric approach to undirected graphical model selection based on diffusion models. Recent work has shown that diffusion models can adapt to the unknown graph structure of the underlying distribution, yet utilizing these models for explicit graph estimation remains unexplored. To bridge this gap, we introduce a novel diffusion-based method for nonparametric undirected graphical model selection. We establish the model selection consistency of the proposed method and demonstrate its empirical performance through extensive simulations and two real data analyses.2026-06-07T06:22:13ZHyeok Kyu KwonMyeonggu KangMinwoo ChaeWanjie Wanghttp://arxiv.org/abs/2103.11066v6Treatment Allocation under Uncertain Costs2026-06-07T03:51:53ZWe consider the problem of learning how to optimally allocate treatments whose cost is uncertain and can vary with pre-treatment covariates. This setting may arise in medicine if we need to prioritize access to a scarce resource that different patients would use for different amounts of time, or in marketing if we want to target discounts whose cost to the company depends on how much the discounts are used. Here, we show that the optimal treatment allocation rule under budget constraints is a thresholding rule based on priority scores (those with a higher score are treated first), and we propose a number of practical methods for learning these priority scores using data from a randomized trial. Our formal results leverage a statistical connection between our problem and that of learning heterogeneous treatment effects under endogeneity using an instrumental variable. We find our method to perform well in a number of empirical evaluations.2021-03-20T00:36:28ZGeorgy KalashnovEvan MunroHao SunShuyang DuStefan Wagerhttp://arxiv.org/abs/2606.08418v1TS-Neyman: Posterior Sampling for Adaptive Stratified Estimation2026-06-07T02:36:16ZMany model evaluation tasks reduce to estimating an average loss, error rate, or subgroup metric on a stratified pool when each label, human rating, or simulator call is costly. The precision-optimal Neyman allocation depends on within-stratum variances, which must be learned from the same observations used for estimation. We formulate this as a sequential allocation problem and use the exact one-step marginal variance reduction as the priority index. Replacing the unknown variances by independent inverse-chi-squared posterior draws yields TS-Neyman, a Thompson-sampling rule that preserves the oracle marginal-gain structure while randomizing over variance uncertainty. For any fixed finite number of strata, we prove almost-sure convergence of the TS-Neyman allocation proportions to the Neyman target, asymptotic optimality of the variance proxy, and a central limit theorem for the resulting adaptive stratified estimator. In two five-stratum budget-scaling benchmarks, one bounded-loss benchmark and one binary model-error benchmark in the spirit of Dai et al. 2023, TS-Neyman's relative efficiency stays within 5 percent of the oracle on the bounded-loss population and within about 15 percent on the binary benchmark. In an additional CivilComments real-data replay with confidence-based strata, it stays within about 8 percent of the oracle and improves on equal allocation by roughly 7 to 14 percent in MSE across budgets, while plug-in greedy and two-stage plug-in can degrade by over an order of magnitude under sparse pilots. Common-pilot warm-start and prior-sensitivity studies show that this behavior is stable under working-model and working-prior misspecification.2026-06-07T02:36:16ZKosuke MorikawaMst Moushumi PervinJae Kwang Kimhttp://arxiv.org/abs/2606.08409v1Matrix representations and distance metrics for unlabeled ranked phylogenetic networks2026-06-07T02:17:03ZPhylogenetic networks are graphs inferred from molecular sequence data that represent ancestral histories shaped by reticulate processes such as recombination, hybridization, and horizontal gene transfer. We introduce a family of distance metrics for rooted, ranked, unlabeled phylogenetic networks, extending a previously developed distance for ranked trees. Our approach relies on a bijective triangular matrix representation of phylogenetic networks that captures the temporal order of internal events, speciations, and hybridizations. Our metrics, defined as standard matrix norms, allow efficient quantitative comparisons of network topologies, timed networks and networks with differing numbers of hybridizations. Our distance can be used for both isochronous networks where all tips are sampled at one time point, and heterochronous networks where tips are allowed to be sampled at different time points. We show that our metrics capture biologically meaningful differences among evolutionary histories in both simulations and empirical posterior distributions of viral phylogenetic networks. These tools fill a methodological gap, enabling principled comparisons of ranked, unlabeled phylogenetic networks, including ancestral recombination graphs.2026-06-07T02:17:03Z25 pages, 11 figures. Submitted to the Proceedings of the National Academy of Sciences (PNAS)Jiayang WangJulia A. PalaciosClaudia SolĂs-Lemushttp://arxiv.org/abs/2606.08407v1Topological Effective Connectivity Modeling in Brain Networks2026-06-07T02:11:09ZCharacterizing directed information flow in brain networks is difficult because neural circuits are full of recurrent feedback loops. Many existing tools for directed dependence assume a directed acyclic graph (DAG) structure to resolve directional ambiguity, and therefore cannot represent these loops. We present a nonparametric, information-theoretic framework that addresses this by coupling the discrete Hodge decomposition with lead-lag mutual information, splitting the resulting edge flow into three orthogonal components: a gradient term capturing hierarchical, feed-forward relationships; a curl term isolating triangle-level feedback loops; and a harmonic term capturing cyclic flow around topological holes. This separation makes it possible to disentangle feed-forward drive from recurrent circulation, which conventional measures conflate. We further develop a permutation-based hypothesis-testing layer that identifies nodes and triangular motifs whose information-flow signatures change significantly between conditions. We validate the framework on simulations with known ground-truth structure and apply it to local field potential recordings from a rodent model of focal ischemic stroke. In three of four animals, we find a post-stroke shift toward hierarchical, source-driven propagation at the expense of recurrent feedback, while the fourth shows no significant change.2026-06-07T02:11:09Z45 pages, 15 figuresAnass El-YaagoubiMoo K. ChungHernando Ombaohttp://arxiv.org/abs/2603.21161v2An information criterion for detecting periodicities in functional time series2026-06-07T00:39:43ZWe propose an information criterion for determining an unknown number of periodic components in functional time series. Identifying the number of frequencies in large-scale time series has been a central focus. To achieve this goal, we suggest an iterative procedure, utilizing the residual process obtained through least squares fitting. This iterative approach demonstrates broad applicability. We establish the consistency of the estimated number of periodic components by minimizing the information criterion. The efficacy of the procedure is illustrated through numerical simulations. In real data analysis, we apply this information criterion to temperature data and sunspot data.2026-03-22T10:28:54ZComputational Statistics & Data Analysis (2026) 108430Rinka SagawaYan LiuValentin Patilea10.1016/j.csda.2026.108430http://arxiv.org/abs/2306.06756v3Semi-Parametric Inference for Doubly Stochastic Spatial Point Processes: An Approximate Penalized Poisson Likelihood Approach2026-06-06T23:12:19ZDoubly-stochastic point processes model the occurrence of events over a spatial domain as an inhomogeneous Poisson process conditioned on the realization of a random intensity function. They are flexible tools for capturing spatial heterogeneity and correlation. However, existing implementations of doubly-stochastic spatial models are computationally demanding, often have limited theoretical guarantee, and/or rely on restrictive assumptions. We propose a penalized regression method for estimating covariate effects in doubly-stochastic point processes that is computationally efficient and does not require a parametric form or stationarity of the underlying intensity. Our approach is based on an approximate (discrete and deterministic) formulation of the true (continuous and stochastic) intensity function. We show that consistency and asymptotic normality of the covariate effect estimates can be achieved despite the model misspecification, and develop a covariance estimator that leads to a valid, albeit conservative, statistical inference procedure. A simulation study shows the validity of our approach under less restrictive assumptions on the data generating mechanism, and an application to Seattle crime data demonstrates better prediction accuracy compared with existing alternatives.2023-06-11T19:48:39ZSi ChengJon WakefieldAli Shojaiehttp://arxiv.org/abs/2606.08322v1Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA2026-06-06T20:20:55ZTo characterize the US airline profit cycles from 1995 to 2020, the authors of Renold et al. (2023) combine k-means clustering, principal component analysis, and system dynamic modelling. We replicate their clustering experiment in three spaces -- the original 7-dimensional raw-variable space, a 3-dimensional PC score space, and a 4-dimensional PC score space using their dataset gratefully included in the paper. We show that the six-cluster taxonomy is geometrically robust: k-means in 3-PC space produces bit-for-bit identical cluster assignments relative to 7D raw space. As a nonlinearity check we apply kernel PCA under six kernels spanning three families plus a linear baseline. All six kernels preserve the six-cluster assignment in 2D. A 1D diagnostic tightens this: the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas all five non-baseline kernels shift C_3 to overlap only the post-financial-crisis cluster C_5. Agreement across the kernel families confirms an intrinsically linear manifold with no hidden curvature. The silhouette criterion reveals that the dataset structurally supports only three clusters, not six. Collinearity in the raw 7D space suppresses the silhouette signal that would otherwise identify k=3 as the structurally motivated choice.2026-06-06T20:20:55ZAndreas Schlapbachhttp://arxiv.org/abs/2606.05450v2Eigenvector Spatial Filters Nuclear Norm Matrix Completion with Application to Air Quality Data2026-06-06T19:32:05ZReliable reconstruction of missing observations in environmental panel datasets is essential for accurate exposure assessment and policy analysis. Traditional nuclear norm matrix completion methods effectively impute missing entries in low-rank matrices, yet often overlook the spatial dependence inherent to air quality processes. This paper introduces the Eigenvector Spatial Filters Nuclear Norm Matrix Completion (ESFNNMC) method, an extension of nuclear norm fixed-effects matrix completion that replaces unit-specific intercepts with a set of Moran-type eigenvectors capturing spatial autocorrelation in the data. To estimate the model, we propose a Block-Coordinate Descent (BCD) approach for multiconvex optimization problems, with soft-thresholded singular value decomposition and cross-validated regularization. Through comprehensive simulations varying missingness patterns, the level of spatial and temporal autocorrelation, and dimension, shape, and rank structure of the matrices, ESFNNMC demonstrates substantial improvements in imputation accuracy over the standard fixed-effects approach, while keeping the computational cost approximately unchanged. The method is applied to impute missing entries in daily PM10 measurements in 64 monitoring stations in Lombardy, Italy, during the year 2021.2026-06-03T21:11:18Z29 pages, 5 figures, 14 tables, draft version (to do not cite yet)Rodolfo Metulinihttp://arxiv.org/abs/2605.14610v3Parametrically Adaptive Transition Polynomial: a Signed-Parity Continuous-alpha Extension of Kunchenko Stochastic Polynomials2026-06-06T18:36:54ZKunchenko's method of polynomial maximization provides a semiparametric apparatus for parameter estimation under non-Gaussian errors, but its classical power basis relies on finite higher-order integer moments. This paper introduces the Parametrically Adaptive Transition Polynomial (PATP), a signed-parity fractional-power family controlled by a continuous parameter alpha in [0,1]. The quadratic exponent map p_i(alpha) connects the fractal regime p_i(0)=1/i, the degenerate linear point p_i(1/2)=1, and the signed-parity integer-power regime p_i(1)=i. For the degree-S=2 case we derive a closed-form variance-reduction coefficient g_2(alpha) in terms of signed and absolute fractional moments, identify the singular behavior at alpha=1/2, and state the moment and regularity conditions under which the formula is meaningful. The construction should be read as a Form-B PATP analogue within Kunchenko's generalized apparatus, not as an exact recovery of the canonical even-power PMM basis at alpha=1. Numerical illustrations on canonical distributions are used to examine the finite-sample behavior of the signed-parity estimator and to mark the boundary of applicability for extremely heavy-tailed cases such as Cauchy.2026-05-14T09:26:53Z35 pages, 8 figures. Code supplement: https://github.com/SZabolotnii/Ku-PATP-code-supplementSerhii Zabolotniihttp://arxiv.org/abs/2606.08289v1Direct domain estimation via regression-tree-assisted estimators in the production of official statistics2026-06-06T18:20:26ZNational statistical offices (NSOs) produce their estimates under a single weighting system (uni-weight approach): one set of weights, independent of the variable of interest, is used to estimate multiple parameters and multiple subpopulations (domains). In this paper we study, within the family of model-assisted estimators and from a design-based perspective of direct estimation, the use of regression trees as the assisting model for estimating totals in unplanned domains. We distinguish two strategies: (i) fitting a single tree at the population level and deriving from it uni-weight weights applicable to any domain, and fitting a domain-specific tree. We show that both estimators can be written as weighted sums with weights that do not depend on $y$, preserving the uni-weight property and additivity benchmarking with respect to the population total. Extending to trees the classical result, we argue why the estimator built from a population-level model tends to behave like the Horvitz-Thompson estimator within domains, whereas the domain-specific model can achieve substantial variance reductions. A simulation study based on microdata from the Uruguayan Continuous Household Survey (ECH) illustrates the behavior of the estimators at the population level and by department2026-06-06T18:20:26ZJuan Pablo Ferreirahttp://arxiv.org/abs/2606.08261v1Sparse Longitudinal Functional Principal Component Analysis for Episodic Ambulatory Behavioral Assessments2026-06-06T17:16:37ZAccurately monitoring mental fatigue is critical for improving workplace safety and productivity. A recent study examined unobtrusively collected smartphone typing speed as a potential ambulatory proxy assessment of mental fatigue using data from the Intern Health Study (IHS). While population-level average typing speed patterns were found to be consistent with validated measures of mental fatigue, how these trajectories vary across participants and days may inform opportune moments for just-in-time interventions and remains an open question. Treating typing speed trajectories as sparsely observed functional data, we propose a novel sparse longitudinal functional principal component analysis (sparse LFPCA) method for decomposing variability and predicting individual curves. Specifically, sparse data are accommodated by casting covariance estimation as a structured penalized spline regression problem, enabling simultaneous estimation and smoothing of multiple covariance components while borrowing information across locations in the functional domain. Simulations show that sparse LFPCA (1) accurately estimates eigenfunctions and generates reasonable predictions for underlying curves, and (2) achieves similar or superior performance compared to existing alternatives. Our analysis of typing speed data collected from IHS reveals new and interpretable participant- and day-level patterns not captured by previous analyses and can be used to tailor behavioral interventions.2026-06-06T17:16:37ZNidhi PaiYu FangSrijan SenZhenke WuErjia Cuihttp://arxiv.org/abs/2501.04615v4Doubly Robust and Efficient Calibration of Prediction Sets for Right-Censored Time-to-Event Outcomes2026-06-06T17:11:41ZOur objective is to construct well-calibrated prediction sets for a time-to-event outcome subject to right-censoring with guaranteed coverage. Inspired by modern conformal inference, our approach avoids the need for a well-specified parametric or semiparametric survival model. Unlike existing conformal methods for survival data, which assume Type-I censoring with fully observed censoring times, we consider the more common right-censoring setting in which only the censoring time or only the event time is observed, whichever comes first. Under a standard conditional independence censoring condition, we propose and analyze several lower prediction bounds for the survival time of a future observation, including inverse-probability-of-censoring weighting, and its augmented version based on the semiparametric efficient influence function for the relevant marginal quantile of the outcome accounting for dependent censoring. We formally establish asymptotic coverage guarantees of the proposed methods, and demonstrate both theoretically and through empirical experiments, that the augmented approach substantially improves efficiency over all other proposed methods. Specifically, its coverage error bound is doubly robust, and therefore of second order, thus ensuring that it is asymptotically negligible relative to the coverage error of the other methods.2025-01-08T16:57:18Z48 pages, 11 figuresRebecca FarinaEric J. Tchetgen TchetgenArun Kumar Kuchibhotlahttp://arxiv.org/abs/2410.20885v3A Distributed Lag Approach to the Generalised Dynamic Factor Model2026-06-06T16:43:28ZWe propose a simple estimator for the dynamic decomposition of the Generalized Dynamic Factor Model that avoids frequency-domain methods. First, we show that it is a reasonable approximation to assume that the dynamic common component of the Generalized Dynamic Factor Model admits a representation in terms of current and lagged statically pervasive factors. Then, assuming finite lag order, this simplification reduces estimation to a regression of the observed variables on estimated factors and their lags, where the factors are extracted via static principal components. The proposed approach naturally accommodates weak, non-pervasive factors within the dynamic common space. We establish consistency and asymptotic normality for both the dynamic and weak common components under a new asymptotic framework that allows for such weak factors. In an application to three high-dimensional time series panels of European macroeconomic data we detect a sizeable weak common component share in several key macroeconomic indicators.2024-10-28T10:07:06ZPhilipp Gersinghttp://arxiv.org/abs/2512.03983v2Statistical hypothesis testing for differences between layers in dynamic multiplex networks2026-06-06T15:56:43ZWith the emergence of dynamic multiplex networks, corresponding to graphs where multiple types of edges evolve over time, a key inferential task is to determine whether the layers associated with different edge types differ in their connectivity. In this work, we introduce a hypothesis testing framework, under a latent space network model, for assessing whether the layers share a common latent representation. The method we propose extends previous literature related to the problem of pairwise testing for random graphs and enables global testing of differences between layers in multiplex graphs. While we introduce the method as a test for differences between layers, it can easily be adapted to test for differences between time points. We construct a test statistic based on a spectral embedding of an unfolded representation of the graph adjacency matrices and demonstrate its ability to detect differences across layers in the asymptotic regime where the number of nodes in each graph tends to infinity. The finite-sample properties of the test are empirically demonstrated by assessing its performance on both simulated data and a biological dataset describing the neural activity of larval Drosophila.2025-12-03T17:14:33Z12 pages, 3 figuresMaximilian BaumFrancesco Sanna PassinoAxel Gandy