https://arxiv.org/api/5aIIHhhQN1YKCErCuwAPYg3j0gw 2026-06-21T12:43:43Z 36316 1035 15 http://arxiv.org/abs/2605.18339v1 Compositional Periodic Spline Approximation for Circular Density Data in Bayes Spaces 2026-05-18T12:52:16Z

This paper proposes a novel framework for the approximation and analysis of circular density data using compositional periodic splines within Bayes spaces with the Hilbert space structure. By applying the centered log-ratio transformation, densities are represented in a subspace of the standard $L^2$ space of real-valued functions, which enables the use of functional data analysis tools while preserving the relative nature of distributions and their periodic structure. A coefficient-based construction of periodic splines with a zero-integral constraint is developed, together with matrix formulations for both smoothing splines and penalized splines, allowing efficient estimation and implementation. The methodology is applied to long-term wind direction data, where it provides smooth and interpretable density estimates and supports further statistical analysis, including functional regression. The results demonstrate the practical relevance of the proposed approach and its potential for extensions to more complex density-valued data.

2026-05-18T12:52:16Z Jitka Machalová Jana Heckenbergerová Karel Hron http://arxiv.org/abs/2605.09782v2 Near-Linear Time Generalized Sinkhorn Algorithms for Bounded Genus Graphs 2026-05-18T11:42:00Z

We present GenusSink, a new class of approximate generalized Sinkhorn algorithms with shortest-path-distance costs for bounded genus (e.g. planar) graphs, providing near-linear time: (1) pre-processing, (2) iteration step, (3) final transport plan matrix querying and near-linear memory. Graphs handled by GenusSink include in particular planar graphs and bounded-genus meshes approximating 3D objects. GenusSink addresses total quadratic time complexity of its brute-force counterpart by leveraging separator-based decomposition of graphs, computational geometry techniques, and new results on fast matrix-vector multiplications with generalized distance matrices, using, in particular, Fourier analysis and low displacement rank theory. It is inspired by recent breakthroughs in graph theory on approximating bounded genus metrics with small treewidth metrics \citep{minor-free-paper}. The graph-centric approach enables us to target optimal transport problem with the corresponding distributions defined on the manifolds approximated by weighted graphs and with cost functions given by geodesic distances. We conduct rigorous theoretical analysis of GenusSink, provide practical implementations, leveraging newly introduced in this paper \textit{separation graph field integrators} (S-GFIs) data structures and present empirical verification. GenusSink provides orders of magnitude more accurate computations than other efficient Sinkhorn algorithms, while still guaranteeing significant computational improvements, as compared to the baseline. As a by-product of the developed methods, we show that GenusSink is \textbf{numerically equivalent} to the brute-force geodesic Sinkhorn algorithm on $n$-vertex graphs with treewidth $O(\log \log (n))$ (e.g. on trees).

2026-05-10T22:00:42Z Krzysztof Choromanski Derek Long Ananya Parashar Dwaipayan Saha http://arxiv.org/abs/2605.18206v1 A tool to determine the degrees of freedom in tree-structured varying coefficient models 2026-05-18T10:45:31Z

The tree-structured varying coefficient (TSVC) model is a flexible approach for generalized regression, where the linear effects of the covariates are allowed to vary with the values of effect modifiers. Relevant effect modifiers and interactions are identified using recursive partitioning. In TSVC models, analogously to other semi- and nonparametric regression approaches, one needs to account for the cost of data-driven model building when deriving the model degrees of freedom (DoF). To address this issue, we develop an easy-to-apply formula to approximate the DoF of a TSVC model. This formula is employed for model selection based on the Bayesian information criterion (BIC) and compared to the naive solution, setting the DoF to the number of free model parameters, in a simulation study. To illustrate the proposed DoF method, TSVC models using BIC-based selection were fitted to data from the Survey of Health, Ageing, and Retirement in Europe. Results indicated that calculation of the DoF using the proposed formula resulted in more accurate selection results with improved predictive ability.

2026-05-18T10:45:31Z Nikolai Spuck Moritz Berger http://arxiv.org/abs/2605.18167v1 1-truncated C-vine copula mixed models for network meta-analysis of multiple diagnostic tests 2026-05-18T10:09:58Z

As meta-analysis of multiple diagnostic tests impacts clinical decision making and patient health, there is growing interest in statistical models that synthesize evidence from studies comparing multiple diagnostic tests. To compare the accuracy of multiple diagnostic tests in a single study, three designs are commonly used: (i) the multiple test comparison design; (ii) the randomized design, and (iii) the non-comparative design. Generalized linear mixed models (GLMMs) are currently the recommended approach for jointly meta-analyzing data from all three designs, enabling simultaneous inference. In this context, 1-truncated C-vine copula mixed models are proposed as a flexible and powerful alternative. These models generalize the GLMM framework by allowing for arbitrary univariate distributions of the random effects and capturing tail dependencies and asymmetries. We demonstrate the utility of our methods with an extensive simulation study and by insightfully re-analysing a case study on the network meta-analysis of diagnostic tests for deep vein thrombosis. Findings indicate that 1-truncated C-vine copula mixed models can offer improvements over GLMMs, supporting their adoption for network meta-analysis of multiple diagnostic tests.

2026-05-18T10:09:58Z Aristidis K. Nikoloulopoulos http://arxiv.org/abs/2605.18134v1 Optimal Sampling for Kernel Quadrature on Unbounded Domains 2026-05-18T09:39:11Z

Kernel quadrature is widely used to approximate integrals of smooth functions, with worst-case error typically decaying at the minimax rate $n^{-α/d}$ for smoothness $α$ in dimension $d$. Existing rate-optimal methods often depend on deterministic point sets tailored to a specific kernel, making them sensitive to misspecification and less robust in practice. In this work, we study randomized quadrature methods with a focus on robustness rather than kernel-specific optimality. We construct an explicit, $n$-dependent sampling distribution that achieves minimax rates for worst-case error over smoothness classes without requiring knowledge of the kernel. This kernel-agnostic design improves robustness while retaining optimal rates. Our analysis includes unbounded sampling measures such as Gaussian and Student-$t$ distributions, extending beyond compact domains. The results provide both theoretical guarantees and a practical recipe for robust, rate-optimal randomized quadrature.

2026-05-18T09:39:11Z Edoardo Bandoni CEREMADE Christian Robert CEREMADE Julien Stoehr CEREMADE http://arxiv.org/abs/2605.14692v2 Asymptotic Anytime-Valid Inference for U-statistics 2026-05-18T09:05:31Z

We study asymptotic anytime-valid confidence sequences for degree-two U-statistics under continuous monitoring. In the nondegenerate case, Hoeffding's projection reduces the problem to a time-uniform central limit theory for the partial sums of the first-order projection, while the canonical remainder is shown to be negligible under mild moment assumptions. A leave-one-out jackknife estimator then yields a fully data-driven procedure, leading to confidence sequences with asymptotic coverage guarantee for the parameter of interest. In the degenerate case, we show that the U-statistic is approximated by a centered quadratic Gaussian-chaos rather than by a simple Gaussian, which poses significant challenges for sequential inference. To address this issue, we novelly develop the Spectrally Allocated Gaussian-chaos Excursion (SAGE) boundary, and then provide plug-in implementations based on truncated spectrum estimation with consistency guarantees. The resulting widths can attain the expected time-uniform optimal rates: $\sqrt{\log\log n/n}$ in the nondegenerate regime and $\log\log n/n$ in the degenerate regime. Several widely used U-statistics are discussed within the proposed framework, and numerical experiments further support the validity of the derived theory.

2026-05-14T11:09:31Z Leheng Cai Qirui Hu Weijia Li http://arxiv.org/abs/2604.01160v2 Machine learning methods for finite population parameter estimation in survey sampling 2026-05-18T08:23:04Z

This pedagogical review examines the use of machine learning methods in finite-population inference for survey sampling, with an emphasis on design-based validity and statistical inference. While flexible prediction tools offer substantial gains in estimation accuracy, they also introduce important challenges, primarily due to the dependence between the fitted predictors and the sample. We focus on settings in which such predictions enter survey estimation through model-assisted estimation, item nonresponse imputation, and unit nonresponse adjustment. For model-assisted estimation and item nonresponse, we show how cross-fitting and Neyman-orthogonal estimating equations can adapt ideas from double/debiased machine learning to survey data, allowing the use of high-dimensional or nonparametric learners while preserving root-n consistency and asymptotic normality under suitable conditions. In contrast, for unit nonresponse, standard inverse-probability weighting remains outcome-agnostic and operationally attractive, but this same feature makes doubly robust and orthogonal constructions harder to deploy in official statistics. We also briefly discuss related developments in small area estimation and probability/nonprobability data integration. Overall, the paper highlights both the promise of machine learning and the fundamental inferential challenges it raises for survey practice.

2026-04-01T17:12:37Z Mehdi Dagdoug David Haziza http://arxiv.org/abs/2605.18030v1 A robust nonparametric test for spatial isotropy in lattice data 2026-05-18T08:21:10Z

This paper proposes a robust test for assessing isotropy based on the variogram of spatial data on a two-dimensional regular grid. The test is based on the non-robust subsampling test for isotropy of Guan et al. (2004), which uses the idea of comparing variogram estimates in diff erent directions at the same distance. The robust test employs robust variogram esti- mators which are based on estimators of univariate or multivariate scatter and perform well in the presence of isolated or block outliers. Additionally, a diff erent resampling method, called block permutation, is proposed. Compared with the subsampling test, the block per- mutation test maintains the signifi cance level even for strong dependencies in the data and is robust to outliers. The methods are illustrated by an application to Landsat 8 satellite data, where outlier blocks may occur due to, for example, clouds.

2026-05-18T08:21:10Z 33 pages, 11 figures, 7 tables Jana Gierse Roland Fried http://arxiv.org/abs/2501.08492v2 Bayesian Sphere-on-Sphere Regression with Optimal Transport Maps 2026-05-18T07:26:33Z

Spherical regression, in which both covariates and responses lie on the sphere, arises in many scientific applications and has attracted considerable methodological attention in recent years. Despite this progress, constructing flexible and expressive regression models between spherical domains remains challenging, particularly because a single global mapping is often insufficient to capture complex relationships across the entire sphere. A natural strategy is therefore to partition the spherical domain and allow distinct mappings within each region, though this introduces the additional challenge of modeling the partition structure itself. To address these issues, we propose an approach based on optimal transport to model spherical partitions, combined with parametric mappings defined locally within each region. We adopt a Bayesian framework to jointly model both the partitioning and the associated regression maps. This framework enables the identification of heterogeneous regions on the sphere while providing principled uncertainty quantification. Through real-data applications, we demonstrate that the proposed method achieves strong predictive performance, yields meaningful uncertainty estimates, and reveals interpretable clustering structure in spherical data.

2025-01-14T23:39:50Z Tin Lok James Ng Kwok-Kun Kwong Jiakun Liu Andrew Zammit-Mangion http://arxiv.org/abs/2512.22098v4 Exact inference via quasi-conjugacy in two-parameter Poisson-Dirichlet hidden Markov models 2026-05-18T06:48:39Z

We introduce a nonparametric model for inferring time-evolving, unobserved probability distributions from discrete-time data consisting of unlabelled partitions. The latent process is a two-parameter Poisson-Dirichlet diffusion, and observations arise via exchangeable sampling. Applications include social and genetic data where only aggregate clustering summaries are observed. To address the intractable likelihood, we develop a tractable inferential framework that avoids label enumeration and direct simulation of the latent state. We exploit a duality between the diffusion and a pure-death process on partitions, together with coagulation operators that encode the effect of new data. These yield closed-form, recursive updates for forward and backward inference. We compute exact posterior distributions of the latent state at arbitrary times and predictive distributions of future or interpolated partitions. This enables online and offline inference and forecasting with full uncertainty quantification, bypassing MCMC and sequential Monte Carlo. Compared to particle filtering, our method achieves higher accuracy, lower variance, and substantial computational gains. We illustrate the methodology with synthetic experiments and a social network application, recovering interpretable patterns in time-varying heterozygosity.

2025-12-26T17:54:58Z Final accepted version. To appear in JASA Marco Dalla Pria Matteo Ruggiero Dario Spanò http://arxiv.org/abs/2605.17934v1 Conditional Predictive Inference for General Structured Data with Group Symmetries 2026-05-18T06:41:20Z

We study distribution-free predictive inference for data with group symmetries, aiming to establish near-conditional coverage guarantees beyond exchangeability for structured data. While many predictive inference methods achieve a target coverage level, most provide marginal coverage. In practice, conditional predictive inference is often preferred, as it quantifies uncertainty for black-box predictions given observed attributes, thereby accommodating heterogeneity. Although many efforts have pursued efficient conditional coverage, existing methods rely on the i.i.d. or exchangeable assumption, often violated in structured settings such as networks, clusters, and imaging data. Recently, SymmPI introduced a unified approach to predictive inference under group symmetries beyond exchangeability; nevertheless, its guarantees remain marginal and do not account for population heterogeneity. To bridge this gap, we introduce C-SymmPI, a framework that achieves near-conditional coverage under general data structures with group symmetries, extending beyond exchangeability to cover networks, cluster-level data, and related structures. Inspired by relaxed multi-accuracy, our approach reformulates conditional coverage as miscoverage error over a user-specified function class. We establish theoretical guarantees under distributional invariance and distribution shift, and derive convergence rates for linear and RKHS function classes, recovering state-of-the-art results in the exchangeable setting as special cases. For computational efficiency, we develop two variants: a projection-based algorithm for high-dimensional observations, and a sampling-based algorithm for large or infinite groups. We demonstrate effectiveness on hierarchical and network data. Empirical results show that C-SymmPI delivers more informative and stable conditional coverage with improved accuracy compared to existing methods.

2026-05-18T06:41:20Z Yichen Shen Mengxin Yu http://arxiv.org/abs/2605.17920v1 Multivariate reconciliation for hierarchical time series 2026-05-18T06:29:24Z

Some time series can be hierarchically organized into levels based on certain characteristics, such as geography or other attributes of interest. These series are referred to as hierarchical time series. Typically, forecasts are generated at all levels to ensure coherence, meaning that the forecasts should satisfy the same aggregation constraints as the observed data. Various approaches have been proposed to guarantee this coherence by using a set of base forecasts. The process through which these forecasts are adjusted to become coherent is known as forecast reconciliation. Similar to the univariate case, multivariate time series can also be structured hierarchically. However, all existing approaches are limited to a single variable. As a result, ensuring coherent forecasts requires reconciling each variable separately. However, this process does not account for correlations among multiple variables. To address this limitation, this paper proposes a multivariate reconciliation methodology that ensures coherent forecasts and incorporates relationships among variables. The proposed methodology was tested through numerical simulations, considering distinct scenarios within the series hierarchy and across multiple variables. Additionally, some base forecasting models were evaluated. The methodology was also applied to real employment data of admissions and dismissals in Brazil. The results demonstrated that multivariate reconciliation yielded more accurate outcomes than the other methods considered, both in simulated data and in practical applications.

2026-05-18T06:29:24Z 22 pages, 7 figures, 8 tables Ana Caroline Pinheiro Rodrigo de Souza Bulhões Rob J. Hyndman Paulo Canas Rodrigues http://arxiv.org/abs/2605.17910v1 Double/Debiased Machine Learning for Continuous Treatment Effects in Panel Data with Endogeneity 2026-05-18T06:16:18Z

We propose a double/debiased machine learning framework to estimate average derivative effects in nonparametric panel models with two-way fixed effects. It extends instrumental variable methods to panel settings, handles continuous treatments and various forms of endogeneity, and introduces a cross-fitting scheme to restore independence after eliminating time fixed effects. A penalized GMM debiasing term enables automatic debiased machine learning with endogeneity. Our estimators for contemporaneous, dynamic, and aggregated effects are consistent and asymptotically normal with a valid variance estimator. Simulations show reduced regularization bias and accurate confidence intervals. An application to ECLS-K data reveals rich dynamics in the effect of family SES on childhood BMI.

2026-05-18T06:16:18Z Peikai Wu Kuan Sun Zhiguo Xiao http://arxiv.org/abs/2605.17864v1 Wavelet Based Time Series Models with Time-Varying Thresholds 2026-05-18T05:15:52Z

This paper develops a threshold model with a time-varying threshold, represented using a wavelet series expansion. The model adequately captures irregular and abrupt variations, as well as smooth changes in the threshold parameter, allowing greater flexibility than Fourier-based approaches. Simulation experiments and real-data applications are used to evaluate the model's performance.

2026-05-18T05:15:52Z Rhea Davis N. Balakrishna http://arxiv.org/abs/2605.17778v1 Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models 2026-05-18T02:56:57Z

Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with $s$ spikes, $s$-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that $s$ steps are necessary for optimality: any $(s-k)$-step distilled estimator is strictly suboptimal for $1 \leq k \leq s$. For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.

2026-05-18T02:56:57Z 103 pages, 8 figures Radu Lecoiu Debarghya Mukherjee Pragya Sur