https://arxiv.org/api/edpwZT0V5w2ssIL2xKGSm+4KZZg 2026-03-20T20:21:39Z 9966 90 15 http://arxiv.org/abs/2506.04082v4 Adaptive tuning of Hamiltonian Monte Carlo methods 2026-03-03T11:07:53Z

With the recently increased interest in probabilistic models, the efficiency of an underlying sampler becomes a crucial consideration. Hamiltonian Monte Carlo (HMC) is one popular option for models of this kind. Performance of the method, however, strongly relies on a choice of parameters associated with an integration for Hamiltonian equations. Up to date, such a choice remains mainly heuristic or introduces time complexity. We propose a novel computationally inexpensive and flexible approach (we call it Adaptive Tuning or ATune) that, by combining a theoretical analysis of the multivariate Gaussian model with simulation data generated during a burn-in stage of a HMC simulation, detects a system specific splitting integrator with a set of reliable sampler's hyperparameters, including their credible randomization intervals, to be readily used in a production simulation. The method automatically eliminates those values of simulation parameters which could cause undesired extreme scenarios, such as resonance artifacts, low accuracy or poor sampling. The new approach is implemented in the in-house software package HaiCS, with no computational overheads introduced in a production simulation, and can be easily incorporated in any package for Bayesian inference with HMC. The tests on popular statistical models reveal the superiority of adaptively tuned standard and generalized HMC (GHMC) methods in terms of stability, performance and accuracy over conventional HMC tuned heuristically and coupled with the well-established integrators. We also claim that GHMC is preferable for achieving high sampling performance. The efficiency of the new methodology is assessed in comparison with state-of-the-art samplers, e.g. NUTS, in real-world applications, such as endocrine therapy resistance in cancer, modeling of cell-cell adhesion dynamics and influenza A epidemic outbreak.

2025-06-04T15:44:32Z Elena Akhmatskaya Lorenzo Nagar Jose Antonio Carrillo Leonardo Gavira Balmacz Hristo Inouzhe Martín Parga Pazos María Xosé Rodríguez Álvarez http://arxiv.org/abs/2603.02593v1 Composite Wavelet Matrix-Based Transforms and Applications 2026-03-03T04:39:20Z

Orthogonal wavelet transforms are a cornerstone of modern signal and image denoising because they combine multiscale representation, energy preservation, and perfect reconstruction. In this paper, we show that these advantages can be retained and substantially enhanced by moving beyond classical single-basis wavelet filterbanks to a broader class of composite wavelet-like matrices. By combining orthogonal wavelet matrices through products, Kronecker products, and block-diagonal constructions, we obtain new unitary transforms that generally fall outside the strict wavelet filterbank class, yet remain fully invertible and numerically stable. The central finding is that such composite transforms induce stronger concentration of signal energy into fewer coefficients than conventional wavelets. This increased sparsity, quantified using Lorenz curve diagnostics, directly translates into improved denoising under identical thresholding rules. Extensive simulations on Donoho-Johnstone benchmark signals, complex-valued unitary examples, and adaptive block constructions demonstrate consistent reductions in mean-squared error relative to single-basis transforms. Applications to atmospheric turbulence measurements and image denoising of the Barbara benchmark further confirm that composite transforms better preserve salient structures while suppressing noise.

2026-03-03T04:39:20Z 30 pages, 9 figures, 6 tables Radhika Kulkarni Brani Vidakovic http://arxiv.org/abs/2603.02574v1 An Augmented Rating System for Test cricket: adapting Glicko's model 2026-03-03T03:45:50Z

ICC's current ranking system does not adequately account for key contextual factors such as home advantage, toss impact and scheduling imbalances; leading to inconsistencies in team evaluation in Test cricket. This study develops an enhanced rating framework by adapting and enhancing Glicko's model to incorporate these influences alongside Margin of Victory, an important indicator of dominance a contest. This enables a more dynamic and probabilistically grounded assessment of team performance. Using past match data, the model demonstrates improved expected score estimation and predictive accuracy. Robustness of the resulting ratings is demonstrated through bootstrap resampling, confirming stability with respect to match scheduling. Overall, the framework provides a fairer and more statistically consistent approach to ranking Test teams.

2026-03-03T03:45:50Z 23 pages, 17 tables, 1 figure Rhitankar Bandyopadhyay Diganta Mukherjee http://arxiv.org/abs/2603.02467v1 CCMnet: A Software Package for Network Generation with Congruence Class Models 2026-03-02T23:17:37Z

We introduce CCMnet, an R package designed to generate network ensembles that accurately reflect the uncertainty inherent in empirical data. While traditional network modeling often results in ensembles with fixed property values or model-determined levels of variability, CCMnet enables a continuous spectrum of variability for network properties, including edge counts, degree distribution, and mixing patterns. By defining probability distributions directly over congruence classes of networks, the package allows researchers to specify the uncertainty in network properties across the generated ensemble to match a specific sampling design or empirical distribution. Furthermore, this formulation provides a principled framework that encompasses several classic models (e.g., Erdős--Rényi model, stochastic block models, and certain exponential random graph models) that implicitly share this structural basis, while offering the flexibility to specify arbitrary, even non-parametric, distributions for network properties. CCMnet implements a Markov chain Monte Carlo (MCMC) framework to sample from these models. The utility of the package is illustrated by generating posterior predictive network ensembles representing school friendship networks.

2026-03-02T23:17:37Z 27 pages, 9 figures, 2 tables Ravi Goyal Victor De Gruttola Natasha K. Martin Lior Rennert Jukka-Pekka Onnela http://arxiv.org/abs/2602.19988v2 Change point analysis of high-dimensional data using random projections 2026-03-02T22:43:02Z

This paper develops a novel change point identification method for high-dimensional data using random projections. By projecting high-dimensional time series into a one-dimensional space, we are able to leverage the rich literature for univariate time series. We propose applying random projections multiple times and then combining the univariate test results using existing multiple comparison methods. Simulation results suggest that the proposed method tends to have better size and power, with more accurate location estimation. At the same time, random projections may introduce variability in the estimated locations. To enhance stability in practice, we recommend repeating the procedure, and using the mode of the estimated locations as a guide for the final change point estimate. An application to an Australian temperature dataset is presented. This study, though limited to the single change point setting, demonstrates the usefulness of random projections in change point analysis.

2026-02-23T15:54:12Z Yi Xu Yeonwoo Rho http://arxiv.org/abs/2603.02437v1 Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models 2026-03-02T22:19:49Z

Analysts routinely use Bayesian hierarchical models to understand natural processes. The no-U-turn sampler (NUTS) is the most widely used algorithm to sample high-dimensional, continuously differentiable models. But NUTS is slowed by high correlations, especially in high dimensions, limiting the complexity of applied analyses. Here we introduce Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix ($Q$). We use Template Model Builder (TMB) to efficiently compute $Q$ from the mode of the Laplace approximation to the marginal posterior, then pass the preconditioned posterior to NUTS through the Bayesian software Stan for sampling. We apply SNUTS to seventeen diverse case studies to demonstrate that preconditioning with $Q$ converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners. SNUTS also outperforms preconditioning with the inverse of the covariance estimated with Pathfinder variational inference. SNUTS does not improve sampling efficiency for models with the highly varying curvature found in funnels, wide tails, or multiple modes. SNUTS is most advantageous, and can be scaled beyond $10^4$ parameters, in the presence of high dimensionality, sparseness, and high correlations, all of which are widespread in applied statistics. An open-source implementation of SNUTS is provided in the R package SparseNUTS.

2026-03-02T22:19:49Z 26 pages, 12 figures including appendices Cole C. Monnahan NOAA Fisheries Kasper Kristensen Technical University of Denmark James T. Thorson NOAA Fisheries Bob Carpenter Flatiron Institute http://arxiv.org/abs/2404.08480v2 Using ChatGPT for Data Science Analyses 2026-03-02T17:58:41Z

As a result of recent advancements in generative AI, the field of data science is prone to various changes. The way practitioners construct their data science workflows is now irreversibly shaped by recent advancements, particularly by tools like OpenAI's Data Analysis plugin. While it offers powerful support as a quantitative co-pilot, its limitations demand careful consideration in empirical analysis. This paper assesses the potential of ChatGPT for data science analyses, illustrating its capabilities for data exploration and visualization, as well as for commonly used supervised and unsupervised modeling tasks. While we focus here on how the Data Analysis plugin can serve as co-pilot for Data Science workflows, its broader potential for automation is implicit throughout.

2024-04-12T13:57:30Z 19 pages with figures and appendix Harvard Data Science Review, 8(1) (2026) Ozan Evkaya Miguel de Carvalho 10.1162/99608f92.c9429f07 http://arxiv.org/abs/2407.08086v2 The GeometricKernels Package: Heat and Matérn Kernels for Geometric Learning on Manifolds, Meshes, and Graphs 2026-03-02T15:50:42Z

Kernels are a fundamental technical primitive in machine learning. In recent years, kernel-based methods such as Gaussian processes are becoming increasingly important in applications where quantifying uncertainty is of key interest. In settings that involve structured data defined on graphs, meshes, manifolds, or other related spaces, defining kernels with good uncertainty-quantification behavior, and computing their value numerically, is less straightforward than in the Euclidean setting. To address this difficulty, we present GeometricKernels, a Python software package which implements the geometric analogs of classical Euclidean squared exponential - also known as heat - and Matérn kernels, which are widely-used in settings where uncertainty is of key interest. As a byproduct, we obtain the ability to compute Fourier-feature-type expansions, which are widely used in their own right, on a wide set of geometric spaces. Our implementation supports automatic differentiation in every major current framework simultaneously via a backend-agnostic design. In this companion paper to the package and its documentation, we outline the capabilities of the package and present an illustrated example of its interface. We also include a brief overview of the theory the package is built upon and provide some historic context in the appendix.

2024-07-10T23:09:23Z Journal of Machine Learning Research, 2025 Peter Mostowsky Vincent Dutordoir Iskander Azangulov Noémie Jaquier Michael John Hutchinson Aditya Ravuri Leonel Rozo Alexander Terenin Viacheslav Borovitskiy http://arxiv.org/abs/2510.22835v3 Clustering by Denoising: Latent plug-and-play diffusion for single-cell data 2026-03-02T15:34:01Z

Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.

2025-10-26T21:03:56Z Dominik Meier Shixing Yu Sagnik Nandy Promit Ghosal Kyra Gan http://arxiv.org/abs/2406.02701v3 MPCR: Multi-Precision Computations Package in R 2026-03-02T09:49:19Z

In the early days of computing, severe memory constraints made it necessary to use lower floating-point precision. As hardware capabilities have advanced, modern systems, particularly in computational statistics and scientific computing, have widely adopted 64-bit precision to reduce numerical errors and support complex calculations. However, in some applications, double-precision accuracy exceeds practical requirements, prompting interest in lower-precision alternatives that decrease computational complexity while maintaining adequate accuracy. This trend has accelerated with the advent of hardware optimized for low-precision computations, such as leveraging Tensor Cores technology in recent NVIDIA GPUs. Although lower precision can introduce numerical and accuracy challenges, many applications demonstrate robustness under these conditions. Consequently, new multi-precision algorithms have been developed to balance accuracy and computational cost. To facilitate the adoption of these approaches in statistical computing, this article introduces MPCR, a new R package that supports arithmetic operations at 16-, 32-, and 64-bit precision. Written in C++ and integrated with Rcpp, MPCR delivers highly optimized multi-precision computations on both CPU and GPU, enabling seamless low-precision operations. Several examples demonstrate the benefits of MPCR across both performance and accuracy.

2024-06-04T18:28:11Z Mary Lai O. Salvana Sameh Abdulah Minwoo Kim David Helmy Ying Sun Marc G. Genton http://arxiv.org/abs/2504.08214v2 An Optimal Transport-Based Generative Model for Bayesian Posterior Sampling 2026-03-01T20:43:38Z

We investigate the problem of sampling from posterior distributions with intractable normalizing constants in Bayesian inference. Our solution is a new generative modeling approach based on optimal transport (OT) that learns a deterministic map from a reference distribution to the target posterior through constrained optimization. The method uses structural constraints from OT theory to ensure uniqueness of the solution and allows efficient generation of many independent, high-quality posterior samples. The framework supports both continuous and mixed discrete-continuous parameter spaces, with specific adaptations for latent variable models and near-Gaussian posteriors. Beyond computational benefits, it also enables new inferential tools based on OT-derived multivariate ranks and quantiles for Bayesian exploratory analysis and visualization. We demonstrate the effectiveness of our approach through multiple simulation studies and a real-world data analysis.

2025-04-11T02:42:04Z Ke Li Wei Han Yuexi Wang Yun Yang http://arxiv.org/abs/2603.01230v1 Stochastic Neural Networks for Causal Inference with Missing Confounders 2026-03-01T19:02:40Z

Unmeasured confounding is a fundamental obstacle to causal inference from observational data. Latent-variable methods address this challenge by imputing unobserved confounders, yet many lack explicit model-based identification guarantees and are difficult to extend to richer causal structures. We propose Confounder Imputation with Stochastic Neural Networks (CI-StoNet), which parameterizes the conditional structure of a causal directed acyclic graph using a stochastic neural network and imputes latent confounders via adaptive stochastic-gradient Hamiltonian Monte Carlo. Under SUTVA and overlap, and assuming that the structural components of the data-generating process are well approximated by a capacity-controlled sparse deep neural network class, we establish model identification and consistent estimation of the mean potential outcome under a fixed intervention within this class. Although the latent confounder is identifiable only up to reparameterizations that preserve the joint treatment-outcome distribution, the causal estimand is invariant across this observationally equivalent class. We further characterize the effect of overlap on estimation accuracy. Empirical results on simulated and benchmark datasets demonstrate accurate performance, and the framework extends naturally to proxy-variable and multiple-cause settings with overlap diagnostics and bootstrap-based uncertainty quantification.

2026-03-01T19:02:40Z Accepted at the International Conference on Learning Representations (ICLR) 2026 In Proceedings of the International Conference on Learning Representations (ICLR), 2026 Yaxin Fang Faming Liang http://arxiv.org/abs/2603.01184v1 Scaling of learning time for high dimensional inputs 2026-03-01T16:51:18Z

Representation learning from complex data typically involves models with a large number of parameters, which in turn require large amounts of data samples. In neural network models, model complexity grows with the number of inputs to each neuron, with a trade-off between model expressivity and learning time. A precise characterization of this trade-off would help explain the connectivity and learning times observed in artificial and biological networks. We present a theoretical analysis of how learning time depends on input dimensionality for a Hebbian learning model performing independent component analysis. Based on the geometry of high-dimensional spaces, we show that the learning dynamics reduce to a unidimensional problem, with learning times dependent only on initial conditions. For higher input dimensions, initial parameters have smaller learning gradients and larger learning times. We find that learning times have supralinear scaling, becoming quickly prohibitive for high input dimensions. These results reveal a fundamental limitation for learning in high dimensions and help elucidate how the optimal design of neural networks depends on data complexity. Our approach outlines a new framework for analyzing learning dynamics and model complexity in neural network models.

2026-03-01T16:51:18Z 14 pages, 5 figures Carlos Stein Brito http://arxiv.org/abs/2603.01085v1 Recovery-Informed Forecasting Strategy Enhancement 2026-03-01T12:57:04Z

We propose a three-stage framework named as Recovery-Informed Strategy Enhancement (RISE) to forecast the recovery of Chinese outbound tourism following the coronavirus disease 2019 pandemic. The framework decomposes the forecasts into three parts: the initial forecasts, the terminal forecasts and the recovery curve forecasts that connect the two points. We integrate multiple sources of information and employ forecast combination techniques in all stages, enhancing both the accuracy and robustness of recovery forecasts. Compared with conventional forecasting approaches, our framework provides a structured and transparent pipeline to integrate model-based forecasts with expert-informed judgment under structural breaks and high uncertainty. Our findings demonstrate the effectiveness of this framework, offering an adaptable tool for recovery trajectory forecasting in post-crisis contexts.

2026-03-01T12:57:04Z Feng Li Taozhu Ruan http://arxiv.org/abs/2410.08939v2 Linear-cost unbiased posterior estimates for crossed effects and matrix factorization models via couplings 2026-03-01T11:32:27Z

We design and analyze unbiased Markov chain Monte Carlo (MCMC) schemes based on couplings of blocked Gibbs samplers (BGSs), whose total computational costs scale linearly with the number of parameters and data points. Our methodology is designed for and applicable to high-dimensional BGS with conditionally independent blocks, which are often encountered in Bayesian modeling. We provide bounds on the expected number of iterations needed for coalescence for Gaussian targets, as well as on the tails of the coalescence times distribution. These imply that practical two-step coupling strategies achieve coalescence times that match the relaxation times of the original BGS scheme up to logarithmic factors. To illustrate the practical relevance of our methodology, we apply it to high-dimensional crossed random effect and probabilistic matrix factorization models, for which we develop a novel BGS scheme with improved convergence speed. Our methodology provides unbiased posterior estimates at linear cost (usually requiring only a few BGS iterations for problems with thousands of parameters), matching state-of-the-art procedures for both frequentist and Bayesian estimation of those models.

2024-10-11T16:05:01Z 48 pages, 10 figures, 1 table Paolo Maria Ceriani Department of Decision Sciences, Bocconi University, Milan, Italy Andrea Pandolfi Department of Decision Sciences, Bocconi University, Milan, Italy Giacomo Zanella Department of Decision Sciences, Bocconi University, Milan, Italy Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy