https://arxiv.org/api/s99HCntyfVkP9SJulBtVAjNMUuk 2026-06-11T04:29:26Z 36146 435 15 http://arxiv.org/abs/2606.11235v1 Few-Shot Resampling for Scalable Statistically-Sound Data Mining 2026-05-29T09:00:26Z

A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.

2026-05-29T09:00:26Z Accepted to KDD 2026 Leonardo Pellegrina Fabio Vandin 10.1145/3770855.3817752 http://arxiv.org/abs/2605.17126v2 Multi-task Linear Regression without Eigenvalue Lower Bounds: Adaptivity, Robustness, and Safety 2026-05-29T08:34:47Z

We study the multi-task linear regression problem in the presence of contaminated tasks. We address the setting where the unknown parameters of a majority of tasks are close in the $\ell_2$-norm, while a fraction of tasks are arbitrary outliers. Existing theoretical frameworks for this problem rely heavily on the assumption that the empirical second moment of each task has a minimum eigenvalue bounded away from zero (order $Ω(1)$). Crucially, this assumption fails in many high-dimensional scenarios, rendering prior guarantees vacuous. To overcome this limitation, we propose an estimator based on matrix-weighted norm regularization. We also introduce a relative balancedness condition, quantified by a balancedness constant, that compares each task's second moment with the average inlier geometry and relaxes the need for taskwise second-moment lower bounds. In favorable regimes with moderate balancedness, our prediction MSE bounds match the rate of Duan and Wang (2023) under substantially weaker spectral assumptions; the resulting task-overall MSE is minimax optimal up to logarithmic factors. Furthermore, we demonstrate that our estimator enjoys a safety guarantee: when the relevant balancedness constant is large or infinite, or when tasks are unrelated, the method performs no worse than independent task learning.

2026-05-16T19:06:54Z Accepted at ICML 2026 Seok-Jin Kim http://arxiv.org/abs/2508.18624v2 Unified theory of testing relevant hypotheses in functional time series 2026-05-29T01:55:32Z

In this paper, we develop a {\em unified} framework for testing relevant hypotheses in functional time series. The proposed approach accommodates one-sample, two-sample, and change point problems for contaminated observations under arbitrary sampling schemes. Combining B-spline estimation with self-normalization, we construct nuisance-parameter-free tests that bypass auxiliary estimation of long-run covariance functions and measurement-error variance functions. We establish asymptotic validity by exploiting a sequential Gaussian approximation for dependent random vectors of moderately high dimension, which leads to a pivotal limiting distribution. We also provide sufficient conditions for the non-degeneracy of the self-normalizer and establish consistent decision rules. A key theoretical finding is that the proposed tests detect $n^{-1/2}$-local alternatives under arbitrary sampling frequencies. This uncovers a sparse-to-dense phase transition distinct from those typically observed in functional data analysis: while the sampling frequency affects the asymptotic variance, the detection rate remains $n^{-1/2}$, even in sparsely sampled regimes. We further study multiple change point alternatives and extend the theory to settings where consistent change point estimates are available. We also discuss the choice of self-normalizers, including the recently developed range-adjusted self-normalizer. Extensive simulations support the theoretical results, and applications to the AU.SHF implied volatility and traffic volume datasets demonstrate the practical utility of the proposed methods.

2025-08-26T03:01:48Z Leheng Cai Qirui Hu http://arxiv.org/abs/2605.30718v1 Moment-Based Inference for Regression with Latent Dirichlet Covariates 2026-05-29T01:24:17Z

Topic models are often used as dimension-reduction tools before regression, with estimated document-level topic shares treated as observed covariates. This plug-in workflow creates two inferential difficulties: valid inference requires a regular first-stage-to-second-stage expansion that propagates topic-estimation uncertainty, and, at fixed document length, a document's topic mixture cannot be consistently recovered from its own words even when the population topic matrix is known. Corrected spectral moment methods for latent Dirichlet allocation (LDA) offer a starting point: when the total Dirichlet concentration is known, low-order word moments can be corrected to yield operators diagonal in the latent topic basis. We extend this to downstream regression. Under a finite LDA model with response residuals orthogonal to the low-order token moments used for identification, response-weighted word moments admit the same correction, and the resulting supervised operator identifies the regression coefficient $β$ directly, without estimating document-level topic shares. The main obstacle is that the correction depends on the unknown total concentration $α_0$. We show that, for $k\ge3$ topics and under a generic finite-probe condition, $α_0$ is identified by commutativity: at the true value a family of corrected word-moment operators commute, whereas away from it they generically do not. This yields a feasible estimator and lets uncertainty in $\hatα_0$ propagate into inference for $β$. The estimator is asymptotically linear as the number of documents grows with fixed document length, with sandwich standard errors from document-level moment contributions. Simulations show near-nominal coverage where plug-in topic-share regressions can undercover, and an application to top economics journals illustrates contrast inference for latent topic effects.

2026-05-29T01:24:17Z Ziyu Jiang http://arxiv.org/abs/2501.12500v3 Learning General Causal Structures with Hidden Dynamic Process for Climate Analysis 2026-05-29T00:42:04Z

Understanding climate dynamics requires going beyond correlations in observational data to uncover the underlying causal process. Latent drivers such as atmospheric processes play a central role in temporal dynamics, while direct causal influences also exist among geographically proximate observed variables. Traditional Causal Representation Learning (CRL) typically focuses on latent factors but overlooks such observable-to-observable causal relations, which limits its applicability to climate analysis. In this paper, we introduce a unified framework that jointly uncovers (i) causal relations among observed variables and (ii) latent driving forces together with their interactions. We establish conditions under which both the hidden dynamic process and the causal structure among observed variables are simultaneously identifiable from time-series data, and our guarantees continue to hold in the nonparametric setting through contextual information that recovers latent variables and causal relations. Building on these insights, we propose CaDRe (Causal Discovery and Representation learning), a time-series generative model with structural constraints that integrates CRL and causal discovery. Experiments on synthetic datasets validate our theoretical results. On real-world climate datasets, CaDRe delivers competitive forecasting accuracy and recovers visualized causal graphs aligned with domain expertise, thereby offering interpretable insights into climate systems. Code is available at https://github.com/MinghaoFu/CaDRe.

2025-01-21T21:04:08Z Accepted by ICML 2026 Minghao Fu Biwei Huang Zijian Li Yujia Zheng Ignavier Ng Guangyi Chen Yingyao Hu Kun Zhang http://arxiv.org/abs/2605.30674v1 Spectral subsampling MCMC for Lévy-driven continuous-time ARMA models with expensive likelihood contributions 2026-05-29T00:07:45Z

Subsampling-based Markov chain Monte Carlo (MCMC) algorithms aim to accelerate Bayesian inference by evaluating the likelihood using only a subset of the data at each iteration. However, in many standard tall-data applications, individual likelihood contributions are inexpensive to evaluate and the resulting reductions in actual computing time are often substantially smaller than the nominal reduction in data size due to computational overhead. We study a different computational regime arising in frequency-domain inference for continuous-time processes observed at equally spaced discrete time points. This gives rise to aliasing, whereby each contribution to the Whittle likelihood requires summation over shifted frequency components, unlike standard discrete-time spectral settings where spectral evaluations do not require such summation. We demonstrate that this structure makes subsampling MCMC, a subsampling-based MCMC approach that estimates the log-likelihood using data subsampling and efficient control variates, particularly effective for reducing computational cost. We illustrate the approach for Bayesian frequency-domain inference in discretely observed continuous-time autoregressive moving average models driven by finite second-moment Lévy processes.

2026-05-29T00:07:45Z Thomas Goodwin Matias Quiroz Robert Kohn Feng Li http://arxiv.org/abs/2605.30658v1 Consistent Bayesian Local Spatial Feature Selection with Application to Spatial Multimodal Omics 2026-05-28T23:36:49Z

Motivated by a high-dimensional regression problem in spatial multimodal omics (SMO), we propose a Bayesian framework for local spatial feature selection, where a random domain partition prior is introduced to divide the spatial domain into several contiguous clusters with flexible shapes and an unknown number of clusters, conditional on which a local feature selection prior is imposed within each cluster. The notion of "feature" is general and may include both covariates and functional bases, allowing the framework to perform both local variable selection and local basis selection, the latter being essential for adaptively approximating spatially varying functions with localized characteristics. We derive coupled hyperparameter conditions linking domain partition and local feature selection priors, under which the consistency theory and posterior contraction rates of both the domain partition and feature selection are established. We develop an efficient informed reversible jump Markov chain Monte Carlo algorithm to address the computational challenges encountered in joint posterior sampling of domain partitions and selected features. Simulation studies demonstrate the effectiveness of the proposed model and algorithm, highlighting its advantages over existing methods. The application of our model to an SMO dataset reveals biologically meaningful spatial patterns within breast cancer tissue.

2026-05-28T23:36:49Z Kun Huang Xiyu Peng Huiyan Sang Ligang Lu http://arxiv.org/abs/2501.02672v3 Re-examining Granger Causality with Causal Bayesian Networks and Reichenbachs Principles 2026-05-28T22:11:48Z

Characterising cause-effect relationships in complex systems is fundamental to understanding their underlying mechanisms. Granger causality (GC) remains a widely used computational tool for identifying causal relationships in time series data. However, like other causal discovery methods, GC has limitations and has been criticised for lacking a rigorous causal foundation. In this work, we present a fix to this criticism by reinterpreting GC through the lenses of Reichenbach's principles and causal Bayesian networks. This reinterpretation was implemented as an algorithm we call causalized Granger causality (c-GC). We demonstrate, both theoretically and graphically, that this reformulation endows GC with a robust causal interpretation under specific assumptions. c-GC yields satisfactory results on synthetic data, offering a more principled framework for causal discovery in observational datasets.

2025-01-05T21:49:19Z S. A. Adedayo http://arxiv.org/abs/2605.30609v1 Rectified Linear Unit Regression 2026-05-28T22:02:55Z

This paper develops a regression framework for the direct estimation of integrated functionals of conditional outcome distributions. The proposed method, termed rectified linear unit (ReLU) regression, projects the ReLU-transformed outcome onto covariates and admits a closed-form estimator. Its population regression function coincides with the integrated conditional distribution function of the outcome, and its convex conjugate, obtained via the Legendre-Fenchel transformation, recovers the integrated conditional quantile function. Both the regression and its conjugate require only mild distributional assumptions and accommodate non-continuous outcomes. We establish the uniform asymptotic distribution of the estimator and develop inference for the conjugate functional via the delta method for Hadamard directionally differentiable maps. Building on these results, we establish identification and inference for average quantile treatment effects over arbitrary subintervals of probability levels. This broadens the set of distributional parameters available to empirical work.

2026-05-28T22:02:55Z Tatsushi Oka http://arxiv.org/abs/2605.30607v1 Orthogonalized Kernel Regression for Spatial and Spatio-Temporal Residual Risk: Application to School Shootings in the Contiguous United States 2026-05-28T21:59:53Z

Distinguishing background heterogeneity from excess risk is a central challenge in case-control event data when both covariates and residual spatial or spatio-temporal structure matter. We develop a covariate-adjusted kernel regression framework that embeds an orthogonalized residual risk surface within a semiparametric binary model, and extend the approach from purely spatial to explicit spatio-temporal analysis. We apply the method to 959 gun violence incidents at public schools in the contiguous United States from 2000 to 2024, using incidents from the K-12 School Shooting Database linked to official school records for the corresponding year. The fitted models identify stable school-level associations, including markedly higher risk for larger schools and for middle and high schools, while also revealing substantial residual structure beyond the background distribution of schools. After adjustment for covariates, excess risk is found to remain concentrated in a persistent central-eastern corridor of the United States, with the strongest evidence appearing in recent years. More broadly, the analysis shows how residual risk surfaces can sharpen inference by separating background heterogeneity from anomalous structure in case-control event processes evolving over space and time.

2026-05-28T21:59:53Z 31 pages, 4 figures Tilman M. Davies Michael R. Desjardins Alexander Hohl Guangzhen Wu http://arxiv.org/abs/2605.30577v1 Dynamic Co-Expression Network Estimation via Multivariate Mixed-Effects Models 2026-05-28T21:10:12Z

High-throughput sequencing technologies have enabled the collection of large-scale longitudinal -omics data, providing new opportunities for studying co-expression networks among molecular nodes such as genes and proteins. However, the high dimensionality and temporal dependence inherent in such data require specialized statistical methods. We propose a novel approach to infer dynamic co-expression networks among features over time (DCENt), where each node (feature) is modeled with a mixed-effects model, and dependencies among nodes are captured through correlated random effects. We develop two innovative penalized algorithms which harness the state of the art of threshold covariance estimators to estimate the random-effects covariance structure. Simulation studies show improved performance over existing approaches in terms of both mean square error and mean absolute error. We further apply the methods to data from the CARDIA study to investigate how the protein co-expression networks evolve over time as well as the association between protein trajectory patterns.

2026-05-28T21:10:12Z Samuel Ozminkowski Lifang Hou David R Jacobs Hongmei Jiang http://arxiv.org/abs/2605.30517v1 Restricted mean time lost for survival and competing risks data using mets in R 2026-05-28T19:54:46Z

This paper introduces software implemented in the mets R-package for calculating non-parametric and regression estimates of Restricted Mean Survival Time (RMST) and Restricted Mean Time Lost (RMTL), including RMTL due to specific causes. A unique feature is the ability to compute the non-parametric estimates of RMST and RMTL, as well as their standard errors, for all time horizons simultaneously. Regression modeling in mets is based on Inverse Probability of Censoring Weighting (IPCW) methods. The package implements different versions of IPCW adjusted estimating equations. A critical technical contribution is the provision of influence functions for all models, which enables the computation of standard errors and allows the estimates to be used as building blocks for more complex statistics, such as the while-alive estimate in recurrent events settings. To expand capabilities in causal inference, the mets package also implements methods for standardization estimates (G-computation) and the estimation of Average Treatment Effects (ATE) for both RMST and RMTL in the competing risks setting. Importantly, the computations scale linearly with the number of observations, making the software efficient for use with large datasets.

2026-05-28T19:54:46Z Thomas Harder Scheike Klaus Kähler Holst http://arxiv.org/abs/2605.30516v1 Benchmark of Likelihood-Free Inference Methods based on Neural and Optimal Transport Approaches 2026-05-28T19:53:30Z

Simulation-based inference (SBI) has become an increasingly important framework for parameter estimation in models for which simulation is feasible, including cases where likelihood evaluation is unavailable or costly. While recent work has introduced benchmark frameworks to compare likelihood-free methods, these studies often do not account for structural features such as heavy-tails or discreteness. In this article, we investigate how the performance of likelihood-free inference methods depends on these structural properties. We consider four approaches: MLE, NBE, EOT and AW--NBE and evaluate them using simulations. This study highlights the importance of carefully selecting evaluation tools in the presence of extremes and discrete data.

2026-05-28T19:53:30Z Samira Aka Marie Kratz Philippe Naveau http://arxiv.org/abs/2603.29972v2 Do covariates explain why these groups differ? The choice of reference group can reverse conclusions in the Oaxaca-Blinder decomposition 2026-05-28T19:24:56Z

Scientists often want to explain why an outcome is different in two groups. For instance, differences in patient mortality rates across two hospitals could be due to differences in the patients themselves (covariates) or differences in medical care (outcomes given covariates). The Oaxaca--Blinder decomposition (OBD) is a standard tool to tease apart these factors. It is well known that the OBD requires choosing one of the groups as a reference, and the numerical answer can vary with the reference. To the best of our knowledge, there has been no systematic investigation into whether the choice of OBD reference can yield different substantive conclusions and how common this issue is. In the present paper, we give existence proofs in real and simulated data that the OBD references can in fact yield substantively different conclusions. Our empirical exercises find that this sensitivity is more common when the OBD is extended to more complex regression models, including a pretrained transformer. Our theoretical and empirical results together establish that these conclusion reversals are not entirely driven by model misspecification, small data, or adversarial parameter choices. Our results suggest that practitioners should always report both directions of the OBD; that modern machine learning and large datasets do not automatically resolve the conclusion reversal problem; and that further work on this problem is needed.

2026-03-31T16:34:48Z 28 pages, 4 figures Manuel Quintero Advik Shreekumar William T. Stephenson Tamara Broderick http://arxiv.org/abs/2605.30492v1 Shrinkage-Constrained Functional Calibration for Complex Computer Models 2026-05-28T19:16:36Z

We propose a new Bayesian model calibration formalism as an alternative to the Kennedy O'Hagan (KOH) framework which we term integrated bias with full uncertainty (IBFU). In KOH, calibration parameters are modeled as fixed, but unknown distributions with relatively weak prior constraints, and their posteriors are inferred jointly with an additive discrepancy Gaussian Process (GP). This formulation often provides limited regularization and leads to confounding pathologies when applied to inexact models with sparse, noisy measurements. By contrast, we represent each calibration parameter as the sum of a fixed best estimate value and a parameter correction represented by an independent GP over the input space, equipped with strong shrinkage priors. Any residual discrepancy that cannot be addressed via parameter correction is captured by an additive discrepancy GP operating on the simulator, similar to KOH. We then impose orthogonality constraints to mitigate confounding between the simulator and modeled additive discrepancy and colinearity between model parameters. Imposing strong complexity shrinkage via conservative hyperpriors forces the mean parameter correction to remain flat across the domain, resulting in predictions that essentially converge with the KOH formulation. However, upon relaxing complexity shrinkage, should the data provide evidence that the effective calibration parameter varies across the domain, the mean parameter correction is allowed to become a function of the domain in a controlled, structured manner. In this sense, our approach is more universal: it effectively nests KOH as a special case while extending it to input dependent calibration, and it is more tightly constrained, because it anchors the true values around the best estimates and the shrinkage prior actively regularizes the calibration parameters.

2026-05-28T19:16:36Z Template for submission to CMAME (Elsevier) Liam Myhill Enrique Martinez Sez Russcher