https://arxiv.org/api/uYmnSC21IJKMoJbv5qN6lKsGrl0 2026-06-21T17:25:22Z 23582 705 15 http://arxiv.org/abs/2604.00763v2 Non-ignorable fuzziness in granular counts: the case of RNA-seq data 2026-04-27T11:21:59Z

RNA-seq count data are often affected by read-to-gene alignment ambiguity, especially in high-dimensional transcriptomics. This type of ambiguity can be conveniently expressed through granular counts, namely fuzzy-valued observations of latent discrete quantities. We study a class of fuzzy-reporting mechanisms and show that, when reporting exploits graded membership, ignorability fails generically, leading to a coarsening-not-at-random structure. A hierarchical model is then introduced as a tractable instance of this construction and illustrated using RNA-seq data.

2026-04-01T11:26:59Z 10 pages, 1 figure, 0 tables. Note: The compressed source folder contains the Supplementary Materials Statistics & Probability Letters, Elsevier, 2026 Antonio Calcagnì Arianna Consiglio Przemyslaw Grzegorzewski Corrado Mencar 10.1016/j.spl.2026.110808 http://arxiv.org/abs/2604.24116v1 Closing the Loop: A Software Framework for AI to Support Business Decision Making 2026-04-27T07:11:57Z

Create an idea, prototype it, evaluate if users like it, then learn. It is the circle of business. If AI can operate in all parts of the circle, it will enable rapid iteration and learning speeds for businesses. Experiment platforms that deploy experiments to evaluate return on investment for businesses are abundant, but systems that help businesses learn personalization, mechanisms, and what to ideate next, are rare. Among technologies that do exist, they cannot be well orchestrated in a single software interface that can be safely and efficiently leveraged by an AI agent. These challenges make it difficult to teach an AI agent how to learn within a robust experimentation framework, and difficult for an AI agent to operate and iterate for the business. We offer a two part solution: one half that is rooted in mathematical reductions to contain complexity, and one half that is rooted in software design to optimize for orchestration, software safety, and multiplicity. Our solution, a software framework, moves beyond the simple treatment effect computed as a difference in means. To create a better understanding of a business and its customers, we enrich causal analysis with heterogeneous effects, policy algorithms, mediation analysis, and forecasts of effects. To have an AI complete the iteration cycle faster, we further enrich the analysis with variance reduction and anytime valid inference. The enrichments are made compatible across different types of experiments, and are presented in a single software interface that is usable in an AI agent. We evaluate the approach on various objectives in experiment analysis, and show that the framework improves code correctness, reduces lines of code, and is more performant than a baseline analysis constructed by a vanilla agent.

2026-04-27T07:11:57Z Jeffrey Wong Antoine Creux http://arxiv.org/abs/2507.20058v4 Modeling Parkinson's Disease Progression Using Longitudinal Voice Biomarkers: A Comparative Study of Statistical and Neural Mixed-Effects Models 2026-04-27T06:59:28Z

Longitudinal voice biomarkers provide a non-invasive source of information for monitoring Parkinson's disease progression, but their statistical analysis is difficult because repeated measurements from the same subject are correlated, clinical cohorts are often small, and disease trajectories can vary substantially across individuals. This study evaluates statistical and neural mixed-effects approaches for modeling Parkinson's disease progression from telemonitoring voice data. Using the Oxford Parkinson's telemonitoring dataset (N=42), we compare Neural Mixed Effects (NME) models, Generalized Neural Network Mixed Models (GNMMs), and semi-parametric Generalized Additive Mixed Models (GAMMs) under the same longitudinal prediction setting. The results show that neural mixed-effects models provide flexible nonlinear representations but can overfit severely in this small-sample setting, whereas GAMMs achieve stronger predictive performance and retain interpretable smooth effects and subject-level structure. In particular, the GAMM-based approach attains the lowest prediction error (MSE 6.56), while the neural baselines have substantially larger errors (MSE > 90). These findings support the use of interpretable statistical mixed-effects models for small longitudinal telemonitoring studies and suggest that larger and more diverse cohorts are needed before highly flexible neural mixed-effects models can be reliably assessed in this application.

2025-07-26T20:56:32Z Published version: Computer Methods and Programs in Biomedicine Update, DOI: 10.1016/j.cmpbup.2026.100242. Version note: https://doi.org/10.5281/zenodo.19804672 Computer Methods and Programs in Biomedicine Update, Volume 9, 2026, Article 100242 Ran Tong Lanruo Wang Tong Wang Wei Yan 10.1016/j.cmpbup.2026.100242 http://arxiv.org/abs/2604.24000v1 Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction 2026-04-27T03:31:57Z

The Laplacian operator transforms the image into its Laplacian field, which usually is sparse and satisfies a stable distribution. On the other hand, an image can be uniquely reconstructed from its Laplacian field via solving a Poisson equation with a proper boundary condition. Such uniqueness is mathematically guaranteed. Thanks to these properties, we propose to use the sparse Laplacian field to present the image. We first show that the Laplacian field is sparse and satisfies a stable distribution on hundreds images. Then, we show that the image can be accurately reconstruct from its Laplacian field. For the reconstruction task, we propose a shared-kernel wavelet neural network, which solves the Poisson equation and has three advantages. First, it has less than {\bf 0.0002M} parameters, which is compact enough for most of devices. Second, it has linear computation complexity, leading to a real-time reconstruction. Third, it achieves higher accuracy than previous methods. Several numerical experiments are conducted to show the effectiveness and efficiency of the sparse Laplacian field and the proposed Poisson solver. The proposed method can be applied in a large range of applications such as image compression, low light enhancement, object tracking, etc.

2026-04-27T03:31:57Z Yuanhao Gong Tan Tang Qianyan Liu http://arxiv.org/abs/2604.23961v1 Extended State-dependent Hawkes Process for Limit Order Books: Mathematical Foundation and the Reproduction of Volatility Signature Plots 2026-04-27T02:08:33Z

This paper proposes an Extended State-Dependent Hawkes Process (ExsdHawkes) to model the intricate dynamics of Limit Order Books (LOBs). Our theoretical contribution lies in relaxing traditional constraints by allowing for state disappearances -- a phenomenon frequently observed in high-frequency trading. We mathematically prove, using Karush--Kuhn--Tucker (KKT) conditions, that the maximum likelihood estimation remains separable, justifying an efficient two-step procedure. In the empirical section, we apply our model to three months of high-frequency tick data of Mitsubishi UFJ Financial Group (8306). We demonstrate that ExsdHawkes uniquely reproduces the volatility signature plot's characteristic upward slope by capturing the "local super-criticality" triggered during disequilibrium states. Crucially, we identify Marketable Limit Orders (MLO) as the primary catalyst that forces the LOB into these unstable states. Comparative analysis reveals that models lacking physical constraints (e.g., standard SD-Hawkes) suffer from explosive branching ratios and fail to maintain simulation stability. Our findings suggest that physical consistency is not merely a mathematical nicety, but a prerequisite for accurately modeling macro-level volatility. By enforcing the physical geometry to `pause' the residual accumulation during inadmissible periods, ExsdHawkes uniquely maintains statistical integrity where unconstrained models succumb to structural bias.

2026-04-27T02:08:33Z 20 pages, 8 figures. This work was supported by JSPS KAKENHI Grant Number JP20K14366 and CREST, JST Akitoshi Kimura http://arxiv.org/abs/2603.03004v3 eTFCE: Exact Threshold-Free Cluster Enhancement via Fast Cluster Retrieval 2026-04-26T22:01:05Z

Threshold-free cluster enhancement (TFCE) is widely used for cluster-based inference in neuroimaging, but existing implementations typically rely on discretized approximations that may introduce numerical variability. We present eTFCE, an efficient framework that provides a numerically exact evaluation of the TFCE integral using an optimized cluster retrieval algorithm. Across multiple datasets, eTFCE and the standard implementation produce highly consistent inference results. Voxel-wise comparisons reveal a systematic asymmetry: the standard method yields smaller p-values for more voxels, while eTFCE concentrates stronger statistical evidence within a smaller subset. These differences are primarily confined to voxels near the inference boundary and have minimal impact on overall inference. This pattern is consistent with discretization effects in standard implementations, where the TFCE integral is approximated using a finite set of threshold levels, introducing subtle biases in statistical evidence accumulation across thresholds. Furthermore, eTFCE improves computational efficiency (71.3% of runtime on average) and enables unified computation of multiple cluster-based statistics within a single permutation framework. Overall, eTFCE provides an exact, efficient, and extensible approach to nonparametric neuroimaging inference.

2026-03-03T13:56:57Z Revised manuscript with updated analyses and clarifications Xu Chen Wouter D. Weeda Thomas E. Nichols Jelle J. Goeman http://arxiv.org/abs/2604.23834v1 Beyond the mean: Sequence analysis methods for clustering ordinal EMA data 2026-04-26T18:42:47Z

Ecological momentary assessment (EMA) ratings are widely used in studies of behavioral and psychological phenomena to capture real-time data in subjects' real-world environments. Because the data are collected repeatedly over the study period, they provide rich longitudinal rating profiles for each individual. However, the number of observations per subject is often large, while both sample size and sampling intensity can vary substantially across individuals, which complicates the analysis. In some settings, simplified summaries of individual profiles, such as averages computed across the study period, are used for downstream analyses, including regression-style modeling. Although such summaries can be convenient, they may fail to fully capture dynamic temporal patterns present in the complete longitudinal profiles. To address this, we borrow measures from sequence analysis that capture individual-level patterns over time and then applied principal component analysis (PCA) followed by $K$-means clustering to identify unobserved latent groups of individuals with similar profiles. We test our approach using simulated data from a categorical functional regression model and compare its performance with two commonly used methods for detecting unobserved group structures: latent class analysis (LCA), and latent transition analysis (LTA). Using EMA stress observations from a large sample of U.S. adults (Newman et al., 2024, 2025), we identify distinct latent stress profile groups and show that they improve characterization of the impact on cognitive performance.

2026-04-26T18:42:47Z 22 pages, 11 figures, 7 tables Tianyi Wang Anna L. Smith Jillian R. Silva-Jones Wendy Berry Mendes Lauren N. Whitehurst http://arxiv.org/abs/2604.23755v1 Sparse Reduced-rank Regression Methods for Spatially Misaligned Data with Application to Spatial Transcriptomics 2026-04-26T15:11:21Z

Understanding the spatiotemporal dynamics of disease progression in relation to transcriptomic profiles provides key insights into complex conditions such as Alzheimer disease. To enable such investigations, STARmap PLUS technology offers joint profiling of high-resolution spatial transcriptomics and protein detection within the same tissue section. Motivated by data from Zeng et al. (2023), we develop a novel kernel-weighted regression framework that models plaque size as a collective effect of the spatial transcriptomics of neighboring cells, automatically integrating across cell types and tissue samples from different disease states. To further strengthen interpretability and efficiency, we incorporate a sparse low-rank factorization that enables gene selection while borrowing strength across genes, cell types, and time points. The proposed approach is implemented in a fully automated manner with data-driven specification of key model components. Through simulation studies, we demonstrate the robustness of the proposed method and its superiority across a range of specification scenarios. Applied to Alzheimer disease data, the proposed framework uncovers biologically meaningful associations, highlighting its potential for advancing the understanding of disease mechanisms.

2026-04-26T15:11:21Z 35 pages, 4 figures, 2 tables Zitian Wu Susmita Datta Arkaprava Roy http://arxiv.org/abs/2604.23744v1 How temperature regimes near the equinox synchronize spring biological events 2026-04-26T14:47:28Z

Many biological processes, including plant leafout and flowering, occur once cumulative temperatures reach a threshold (the thermal-sum model). In this way, temperatures are thought to coordinate the timing of biological events. But growing evidence suggests that as climates warm, both the advancement of spring has slowed (declining sensitivity) and the variance in the timing of spring events has increased (declining synchrony), raising questions about the resilience of temperature-based coordination to anthropogenic climate change. To answer these questions, researchers have complicated the thermal-sum model, introducing additional factors and mechanisms. We consider whether such complexity is necessary. Using results from the theory of stopped random walks, we show that sensitivity and synchrony are exactly as predicted by the basic thermal-sum model. The theory suggests a nonlinear relationship between temperatures and both the timing and synchrony of biological events. In particular, it predicts that as temperatures increase and springtime events shift from the equinox toward the solstice, the events themselves become less coordinated and more variable. We verify these predictions using experimental and real-world data, including 10,000 observations of common lilacs (United States, 1956-2025). We conclude that the theory provides a powerful tool for understanding the thermal-sum model, particularly when considering additional complexity.

2026-04-26T14:47:28Z Jonathan Auerbach Andrew Gelman E. M. Wolkovich http://arxiv.org/abs/2605.06686v1 Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices 2026-04-25T22:48:03Z

Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).

2026-04-25T22:48:03Z 13 pages, 2 figures, 10 tables Kirk Bansak Elisabeth Paulson Dominik Rothenhäusler Jeremy Ferwerda Jens Hainmueller Michael Hotard http://arxiv.org/abs/2510.08893v2 Quantifying Very Extreme Precipitation and Temperature Using Huge Ensembles Generated by Machine Learning-based Climate Model Emulators 2026-04-25T22:26:02Z

Weather extremes produce major impacts on society and ecosystems and are likely to change in likelihood and magnitude with climate change. However, very low probability events are hard to characterize statistically using observations or even climate model output because of short records/runs. For precipitation, consideration of such events arises in quantifying Probable Maximum Precipitation (PMP), namely estimating extreme precipitation magnitudes for designing and assessing critical infrastructure. A recent National Academies report on modernizing PMP estimation proposed using very large climate model-based ensembles to estimate extreme quantiles, possibly through machine learning-based ensemble boosting. Here we assess statistical aspects of such an approach for the contiguous United States using a huge ensemble (10560 years) produced by a state-of-the-art emulator (ACE2) trained on ERA5 reanalysis. The results indicate that one can practically estimate very extreme precipitation and temperature quantiles, provided one uses appropriate statistical extreme value techniques. More specifically, the results provide evidence for (1) the use of threshold-exceedance methods with a sufficiently high threshold (necessary for precipitation) for reliable estimation, (2) the robustness of results to variation in extremes by season and storm type, and (3) the sufficiency of the ensemble for well-constrained statistical uncertainty. Our results also show that the emulator produces extremes outside the range of the ERA5 training data. While encouraging for emulators' potential use for quantifying the climatology of extremes, more investigation is needed to assess whether emulators are fit for this purpose. Our focus is on how to use huge ensembles to estimate very extreme statistics; we expect the results to be relevant for future improved emulators.

2025-10-10T01:12:51Z 28 pages, 11 figures, 5 appendix figures. Published online in Bulletin of the American Meteorological Society on 2026-03-30 Christopher J. Paciorek Daniel Cooley 10.1175/BAMS-D-25-0178.1 http://arxiv.org/abs/2405.16730v2 "Noisier" Noise Contrastive Eestimation is (Almost) Maximum Likelihood 2026-04-25T21:17:25Z

Noise Contrastive Estimation (NCE) has fueled major breakthroughs in representation learning and generative modeling. Yet a long-standing challenge remains: accurately estimating ratios between distributions that differ substantially, which significantly limits the applicability of NCE on modern high-dimensional and multimodal datasets. We revisit this problem from a less explored perspective: the magnitude of the noise distribution. Specifically, we show that with a virtually scaled (\ie, artificially increased) noise magnitude, the gradient of the NCE objective can closely align with that of Maximum Likelihood, enabling a trajectory-wise approximation from NCE to MLE, and faster convergence both theoretically and empirically. Building on this insight, we introduce ``Noisier'' NCE, a simple drop-in modification to vanilla NCE that incurs little to no extra computational cost, while effectively handling density-ratio estimation in challenging regimes where traditional MLE and NCE struggle. Beyond improving classical density-ratio learning, ``Noisier'' NCE proves broadly applicable: it achieves strong results across image modeling, anomaly detection, and offline black-box optimization. On CIFAR-10 and ImageNet64x64 datasets, it yields 10-step and even 1-step samplers that match or surpass state-of-the-art methods, while cutting training iterations by up to half.

2024-05-27T00:11:53Z ICLR 2026 Peiyu Yu Dinghuai Zhang Hengzhi He Xiaojian Ma Sirui Xie Ruiyao Miao Yifan Lu Yasi Zhang Deqian Kong Ruiqi Gao Jianwen Xie Guang Cheng Ying Nian Wu http://arxiv.org/abs/2604.23438v1 Estimating Causal Attribution of Anthropogenic Forcing on High-Temperature Extremes Using a Latent Gaussian Spatial Model 2026-04-25T20:52:00Z

Climate change has become a significant global concern due to its capacity to cause substantial disruption to daily life by increasing the frequency and intensity of extreme weather events. Given the rising trend of human interventions in the climate system over recent decades, this study aims to quantify the relative contribution of anthropogenic forcing to the increasing likelihood of climate extremes, with a particular emphasis on high-temperature extremes. Our analysis focuses on annual temperature maxima from the IPSL-CM6A model in the CMIP6 experiment. We propose a novel causal inference framework that focuses on differences in return levels derived from annual temperature maxima between the factual and counterfactual worlds. While jointly modeling the annual maxima from the two worlds using a bivariate generalized extreme value distribution, we model the spatially-varying coefficients using a latent Gaussian framework. Specifically, given that the data are available over a $1^\circ \times 1^\circ$ grid, we employ the multivariate intrinsic conditional autoregressive model for the latent layer in the proposed hierarchical model, ensuring proper posterior distributions. We implement a recently developed highly-efficient approximate Bayesian inference technique, `Max-and-smooth', that uses a Laplace approximation of the likelihood and then performs Gibbs sampling based on the approximate posterior. The results include posterior estimates of the causal effect of anthropogenic forcing on high-temperature extremes, along with the trends in this effect, over the factual world. Furthermore, we estimate credible regions for a significant causal effect to facilitate hotspot detection across the mainland United States.

2026-04-25T20:52:00Z 31 pages, 6 figures, 3 tables, 1 algorithm Ritik Roshan Giri Arnab Hazra http://arxiv.org/abs/2605.06685v1 An audio-to-analysis pipeline with certified transcription for information-theoretic profiling of the piano repertoire 2026-04-25T20:26:38Z

We present an audio-to-analysis pipeline that produces composer-level information-theoretic profiles : reflecting compositional vocabulary as it emerges from aggregated performances : from raw recordings, built on a transcription layer whose accuracy we certify on a standard benchmark (F1 = 0.9791 on the MAESTRO v3.0.0 test set). Applied to 1,238 pieces and 15 MAESTRO composers with at least ten attributed pieces, spanning the Baroque through the early twentieth century, the pipeline derives empirical distributions over harmonic scale degrees and analyzes them through Shannon entropy, asymmetric Kullback-Leibler divergence, and Zipfian rank-frequency modeling. The resulting profiles (i) order composers along an interpretable axis of harmonic predictability, with a narrow entropy range (3.33-3.86 bits) that reveals the marginal-level similarity of tonal vocabularies; (ii) recover known stylistic lineages (Haydn-Beethoven, Liszt-Rachmaninoff, Schubert-Schumann) through the smallest KL divergences in the corpus, with Mendelssohn emerging as a stable outlier within this corpus; and (iii) separate contemporary neoclassical artists (Richter, Frahm, Glass, Arnalds, Jóhannsson) from historical composers on the quality of Zipfian fit to the transition distribution, with mean $R^2 = 0.78$ for neoclassical versus 0.46 for historical (N $\geq$ 10 pieces each). This gap is larger than the spread within either group and is consistent with a minimalist compositional tendency: a compact transition vocabulary used with sharper frequency-rank regularity than historical composers. All estimates are reported with Laplace-smoothed bootstrap 95% confidence intervals.

2026-04-25T20:26:38Z 25 pages, 4 figures, 25 references Fred Jalbert-Desforges http://arxiv.org/abs/2510.23976v4 Forecasting Arctic Temperatures With Quantile Machine Learning 2026-04-25T18:25:11Z

Using data from the Longyearbyen weather station, quantile gradient boosting ("small AI") is applied to forecast daily temperatures in Svalbard, Norway. Temperatures above 0 degrees Celsius are of special interest because of their impact on ice, snow, and tundra permafrost. To improve forecasting skill for warmer temperatures, the target quantile is 0.60; forecast underestimates are weighted 1.5 times more heavily than forecast overestimates when the quantile loss is computed. Predictors include eight routinely collected indicators of weather conditions, each lagged by 14 days, yielding temperature forecasts with a two-week lead time. Adaptive conformal prediction regions quantify forecasting uncertainty with provably valid coverage. Using a holdout sample, a forecast of 0 degrees Celsius is correct 14 days later at least 80% of the time. Implications for Arctic adaptation policy are discussed.

2025-10-28T01:16:11Z 30 pages, 8 figures Richard Berk