https://arxiv.org/api/4uu3tBuLxQTOzt7Zs53cOv2SCRo 2026-06-14T00:12:12Z 78354 180 15 http://arxiv.org/abs/2510.12744v2 Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps 2026-06-08T01:14:11Z We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $ε$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training. 2025-10-14T17:23:44Z Do Tien Hai, Trung Nguyen Mai, and TrungTin Nguyen are co-first authors. In Proceedings of The 29th International Conference on Artificial Intelligence and Statistics, AISTATS 2026 Spotlight, Acceptance rate 2.5% over 2102 submissions Do Tien Hai Trung Nguyen Mai TrungTin Nguyen Nhat Ho Binh T. Nguyen Christopher Drovandi http://arxiv.org/abs/2505.13410v2 Joint stochastic localization and applications 2026-06-07T23:23:04Z Stochastic localization is a pathwise analysis technique that has emerged as a powerful tool in high-dimensional probability and sampling. In this work, we extend stochastic localization to a joint framework for coupling probability measures and explore its applications in distributional data analysis. We first unify existing stochastic localization processes under Eldan's $α$-scheme and characterize their localization rates. Building on this, we introduce a joint scheme to couple probability measures via concurrent $α$-schemes driven by a shared Brownian motion. This construction is canonical and induces a family of metrics on the space of probability measures, which we call Eldan's $α$-distance. Alternative variants that extrapolate optimal Gaussian couplings to log-concave measures are also discussed. We study the theoretical properties of Eldan's $α$-distance, including its restriction to Gaussian measures and its behavior under affine transformations. For $α= 0$, we show it is topologically equivalent to the $2$-Wasserstein distance for measures supported on a common compact set; we also relate its weighted variants to linearized optimal transport in Wiener space and to score-matching objectives in training diffusion models. Computationally, we develop efficient estimators for Eldan's $α$-distance in the cases $α=0$ and $α=1/2$, with rigorous error guarantees for log-concave and finitely supported measures in the former setting and Gaussian measures in the latter. Finally, we apply Eldan's $α$-distance as a scalable surrogate for the $2$-Wasserstein distance to enable fast pairwise distance estimation and approximate computation of Wasserstein barycenters. 2025-05-19T17:47:05Z 68 pages; substantial revision including correcting an error in Theorem 3.1 (iii) in the previous version and adding a few new results Tom Alberts Yiming Xu Qiang Ye http://arxiv.org/abs/2512.00239v2 Self-Supervised Dynamical System Representations for Physiological Time-Series 2026-06-07T22:52:48Z The effectiveness of self-supervised learning (SSL) for physiological time series depends on the ability of a pretraining objective to preserve information about the underlying physiological state while filtering out unrelated noise. However, existing strategies are limited due to reliance on heuristic principles or poorly constrained generative tasks. To address this limitation, we propose a pretraining framework that exploits the information structure of a dynamical systems generative model across multiple time-series. This framework reveals our key insight that class identity can be efficiently captured by extracting information about the generative variables related to the system parameters shared across similar time series samples, while noise unique to individual samples should be discarded. Building on this insight, we propose PULSE, a cross-reconstruction-based pretraining objective for physiological time series datasets that explicitly extracts system information while discarding non-transferrable sample-specific ones. We establish theory that provides sufficient conditions for the system information to be recovered, and empirically validate it using a synthetic dynamical systems experiment. Furthermore, we apply our method to diverse real-world datasets, demonstrating that PULSE learns representations that can broadly distinguish semantic classes, increase label efficiency, and improve transfer learning. 2025-11-28T22:53:31Z Accepted to ICML 2026 Yenho Chen Maxwell A. Xu James M. Rehg Christopher J. Rozell http://arxiv.org/abs/2606.08854v1 sGPO: Trading Inference FLOPs for Training Efficiency in RLVR 2026-06-07T21:47:31Z Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included. 2026-06-07T21:47:31Z Shivchander Sudalairaj Kai Xu Akash Srivastava Giorgio Giannone http://arxiv.org/abs/2606.08850v1 Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability 2026-06-07T21:43:37Z Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification. 2026-06-07T21:43:37Z preprint Giorgio Giannone Mustafa Eyceoz Shabana Baig Shivchander Sudalairaj Anna C. Doris Faez Ahmed Akash Srivastava Kai Xu http://arxiv.org/abs/2407.01718v2 Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets 2026-06-07T21:28:05Z Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose Entropic Optimal Transport (EOT) eigenmaps, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align them in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We analyze a generative model in which two observed high-dimensional datasets share latent variables supported on a common low-dimensional manifold, while each dataset is subject to translation, geometric distortion, orthogonal nuisance structure, and noise. In a large-sample, high-dimensional regime, we prove that the EOT plan concentrates around a population kernel on an effective manifold determined by the geometric mean of the distortions, with invariance to translations, orthogonal nuisance structure, and noise. Subsequently, we relate our embedding to eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios. 2024-07-01T18:48:55Z Boris Landa Yuval Kluger Rong Ma http://arxiv.org/abs/2602.00797v3 Zero-Flow Encoders 2026-06-07T16:59:20Z Flow-based methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. However, few existing works have exploited this unique capability to resolve fine-grained structural details beyond generation tasks. This paper presents a flow-inspired framework for representation learning. First, we demonstrate that a rectified flow trained using independent coupling is zero everywhere at $t=0.5$ if and only if the source and target distributions are identical. We term this property the \emph{zero-flow criterion}. Second, we show that this criterion can certify conditional independence, thereby extracting \emph{sufficient information} from the data. Third, we translate this criterion into a tractable, simulation-free loss function that enables learning amortized Markov blankets in graphical models and latent representations in self-supervised learning tasks. Experiments on both simulated and real-world datasets demonstrate the effectiveness of our approach. The code reproducing our experiments can be found at: https://github.com/probabilityFLOW/zfe. 2026-01-31T16:11:01Z Yakun Wang and Leyang Wang contributed equally to this work; As published at ICML 2026 Yakun Wang Leyang Wang Song Liu Taiji Suzuki http://arxiv.org/abs/2506.01052v3 A Robust $\widetilde{\mathcal{O}}(1/\sqrt{T})$ Rate for Unprojected TD Learning with Linear Function Approximation 2026-06-07T16:29:36Z We investigate the finite-time convergence properties of Temporal Difference (TD) learning with linear function approximation, a cornerstone of reinforcement learning. We are interested in the so-called ``robust'' setting, where the convergence guarantee does not depend on the potential function's minimal curvature. While prior work has established convergence guarantees in this setting, these results typically rely on the artificial assumption that each iterate is projected onto a bounded set. Removing such a condition was left as an open problem by Bhandari et al. (COLT'18), hypothesizing the need for additional ``regularity conditions''. In this paper, we show that the simple unprojected TD(0) converges with a rate of $\widetilde{\mathcal{O}}\left(\frac{\|θ^*\|^2_2}{\sqrt{T}}\right)$ in expectation, even in the presence of Markovian noise. We do not require an additional regularity condition, but only a minor polylog correction to the learning rate. Our analysis reveals a novel self-bounding property of the TD updates and exploits it to guarantee bounded iterates. 2025-06-01T15:39:00Z Wei-Cheng Lee Francesco Orabona http://arxiv.org/abs/2502.15131v4 Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling 2026-06-07T15:55:48Z We study the fundamental problem of calibrating a linear binary classifier of the form $σ(\hat{w}^\top x)$, where the feature vector $x$ is Gaussian, $σ$ is a link function, and $\hat{w}$ is an estimator of the true linear weight $w^\star$. By interpolating with a noninformative $\textit{chance classifier}$, we construct a well-calibrated predictor whose interpolation weight depends on the angle $\angle(\hat{w}, w_\star)$ between the estimator $\hat{w}$ and the true linear weight $w_\star$. We establish that this angular calibration approach is provably well-calibrated in a high-dimensional regime where the number of samples and features both diverge, at a comparable rate. The angle $\angle(\hat{w}, w_\star)$ can be consistently estimated. Furthermore, the resulting predictor is uniquely $\textit{Bregman-optimal}$, minimizing the Bregman divergence to the true label distribution within a suitable class of calibrated predictors. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions. Additionally, we identify conditions under which a classical Platt-scaling predictor converges to our Bregman-optimal calibrated solution. Thus, Platt-scaling also inherits these desirable properties provably in high dimensions. 2025-02-21T01:24:27Z Yufan Li Pragya Sur http://arxiv.org/abs/2606.08679v1 Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation 2026-06-07T15:31:29Z Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards. 2026-06-07T15:31:29Z Bitya Neuhof Yuval Benjamini http://arxiv.org/abs/2412.16457v3 Robust Random Graph Matching in Dense Graphs via an Approximate Message Passing Type Algorithm 2026-06-07T12:25:36Z In this paper, we focus on the matching recovery problem between a pair of correlated Gaussian Wigner matrices with a latent vertex correspondence. We are particularly interested in a robust version of this problem such that our observation is a perturbed input $(A+E,B+F)$ where $(A,B)$ is a pair of correlated Gaussian Wigner matrices and $E,F$ are adversarially chosen matrices supported on an unknown $εn * εn$ principal minor of $A,B$, respectively. We propose an approximate message passing (AMP) type iterative algorithm that succeeds in polynomial time as long as the correlation $ρ$ between $(A,B)$ is a non-vanishing constant and $ε= o\big( \tfrac{1}{(\log n)^{20}} \big)$. A key distinction from standard AMP is the introduction of a time-dependent matrix multiplication step within the iteration, which simultaneously enlarges the feature dimension and cancels the correlation during the iteration. The main methodological inputs for our result are the iterative random graph matching algorithm proposed in \cite{DL22+, DL23+} and the spectral preprocessing procedure proposed in \cite{IS24+}. To the best of our knowledge, our algorithm is the first efficient random graph matching type algorithm that is robust under any adversarial perturbations of $n^{1-o(1)}$ size. 2024-12-21T03:15:38Z 46 pages; accepted by IEEE Trans. Inf. Theory Zhangsong Li http://arxiv.org/abs/2606.08587v1 Improving the sharpness in neural network-based parametric post-processing of ensemble forecasts 2026-06-07T11:57:09Z Statistical post-processing has proven to be an effective tool in improving ensemble forecast of different weather variables. Case studies show that post-processing can remedy the typically underdispersive and potentially biased behaviour of the ensemble while optimizing a proper scoring rule expressing the forecast skill. The price of these positive effects is generally a deterioration in sharpness; the width of the central prediction intervals and the uncertainty of the predictions are increasing, especially for shorter lead times. This work aims to reduce the extent of the latter phenomenon for neural network-based parametric post-processing methods by extending the network's loss function with a penalty term. We demonstrate the effect of the proposed technique for 2m temperature ensemble forecasts of the European Centre for Medium-Range Weather Forecasts downloaded from the EUPPBench benchmark dataset and verified against synoptic observations. Here, the predictive distribution is Gaussian, and we use the continuous ranked probability score (CRPS) as loss function. The case studies confirm a substantial relative decrease ($8.2\%-12.5\%$) in the width of the nominal central prediction interval compared to the width of the predictive distribution computed without the penalty term, while there is no deterioration in the mean CRPS of probabilistic forecasts and in the RMSE of the predictive mean. 2026-06-07T11:57:09Z 18 pages Ágnes Baran Máté Mihalina http://arxiv.org/abs/2606.05797v2 Causal Longitudinal Prior-Fitted Networks for Counterfactual Outcome Prediction 2026-06-07T11:10:09Z Longitudinal treatment decisions from multivariate time-series data require predicting potential outcomes under future treatment sequences in the presence of time-varying confounding, heterogeneous patient dynamics, and limited domain-specific data. Existing longitudinal causal estimators typically address this problem by training a new model for each cohort or simulator. We introduce Causal Longitudinal Prior-Fitted Networks (CausalLongPFN), a prior-fitted network for time-series causal inference in longitudinal treatment-response data and zero-shot in-context counterfactual outcome prediction. The model is pretrained entirely on synthetic episodes sampled from a broad prior over temporal structural causal models, exposing it to treatment-confounder feedback, latent heterogeneity, nonlinear state evolution, delayed effects, and cumulative treatment responses. At test time, CausalLongPFN remains frozen and is used zero-shot: it conditions on support trajectories, a query history, and a planned future treatment sequence, and returns a predictive distribution over future outcomes without gradient updates or propensity-model fitting. Multi-step predictions are obtained by recursively applying the one-step predictor under the specified treatment sequence. We evaluate the model on branchable cancer, HIV, and warfarin benchmarks with ground-truth counterfactual labels, and on factual-only rolling-origin prediction in MIMIC-III ICU trajectories. CausalLongPFN is competitive with domain-trained longitudinal baselines on counterfactual benchmarks and performs strongly on factual MIMIC-III prediction, suggesting that broad synthetic causal pretraining can provide a frozen, amortized alternative for zero-shot longitudinal treatment-response prediction when repeated domain-specific training is costly or impractical. 2026-06-04T07:26:40Z 31 pages, 10 tables Amirhossein Zare Amirhessam Zare Herlock Rahimi Reza Salarikia Mohammad Kashkooli http://arxiv.org/abs/2606.08560v1 CP-factorization for high dimensional tensor time series and double projection iterations 2026-06-07T10:34:03Z We adopt the canonical polyadic (CP) decomposition to model high-dimensional tensor time series. Our primary goal is to identify and estimate the factor loadings in the CP decomposition. We propose a one-pass estimation procedure through standard eigen-analysis for a matrix constructed based on the serial dependence structure of the data. The asymptotic properties of the proposed estimator are established under a general setting as long as the factor loading vectors are linearly independent, allowing the factors to be correlated and the factor loading vectors to be not nearly orthogonal. The procedure adapts to the sparsity of the factor loading vectors, accommodates weak factors, and demonstrates strong performance across a wide range of scenarios. To further reduce estimation errors, we also introduce an iterative algorithm based on a novel double projection approach. We theoretically justify the improved convergence rate of the iterative estimator, and derive the associated limiting distribution. A consistent estimator of the asymptotic variance is also provided, which plays a key role in the related inference problems. All results are validated through extensive simulations and two real data applications. 2026-06-07T10:34:03Z Jinyuan Chang Guanglin Huang Qiwei Yao Long Yu http://arxiv.org/abs/2507.12843v3 Are Two Datasets Close Enough With Statistical Significance? A Kernel Distributional Closeness Testing Approach 2026-06-07T06:41:08Z Are two distributions close to each other with statistical significance? Distribution closeness testing (DCT) formalizes this question by testing whether the distance between a distribution pair is at least epsilon-far. Existing DCT methods mainly measure discrepancies between distribution pairs defined on discrete spaces, for example using total variation, which limits their application to complex data such as images. To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measure of distributional discrepancy between complex distributions, into DCT scenarios. However, empirical results indicate that many distribution pairs can have the same MMD value despite having different norms in the same reproducing kernel Hilbert space (RKHS). These pairs may exhibit different finite-sample distinguishability and reflect different practical closeness levels, making MMD less informative for DCT. To mitigate this issue, we design a new measure of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales the MMD value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we propose NAMMD-based DCT to assess the closeness level of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power than MMD-based DCT while maintaining bounded type-I error. This is further validated by extensive experiments on multiple types of data, including synthetic noise and real images. Our code is available at https://github.com/zhijianzhouml/NAMMD. 2025-07-17T07:08:54Z Zhijian Zhou Liuhua Peng Xunye Tian Mingming Gong Feng Liu