https://arxiv.org/api/RPZYIiS4EE5+IuQBx4/HH3IUOx0 2026-06-14T02:32:47Z 78354 210 15 http://arxiv.org/abs/2601.21522v2 More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD) 2026-06-06T17:02:11Z

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for a given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs across coding (HumanEval), math (GSM8K), and reasoning (MMLU-Pro) benchmarks demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws. ReD's advantage is maintained for imperfect verifiers and outperforms the tested allocation baselines.

2026-01-29T10:37:32Z Sagi Meir Tommer D. Keidar Noam Levi Shlomi Reuveni Barak Hirshberg http://arxiv.org/abs/2606.08218v1 How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs 2026-06-06T15:12:43Z

Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths $r$, the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold $r_c(d) = Θ(\sqrt{d})$ above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for $r$ below the threshold $r_c(d)$ the prior converges to a limit distribution $π_{\bar{Z}}$. We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions $d$, and demonstrate a complex multimodal behaviour of the limit distributions $π_{\bar{Z}}$ -- a regime that becomes increasingly narrow with $d$ and would be hard to identify without knowing the threshold.

2026-06-06T15:12:43Z Mark Kozdoba Shie Mannor http://arxiv.org/abs/2606.08203v1 Stable and Scalable Probabilistic Numerical Solvers for Stiff and High-Dimensional ODEs 2026-06-06T14:42:03Z

Filtering-based probabilistic numerical solvers for ordinary differential equations (ODEs) have been established as a flexible and efficient simulation framework with built-in numerical uncertainty quantification. However, problems that are both stiff and high-dimensional remain a challenge, as current methods are either stable and have cubic cost in the ODE dimension, or scale linearly at the expense of stability. In this paper, we close this gap and develop probabilistic ODE solvers that are both stable and scalable. We propose two complementary strategies. First, we develop a matrix-free update step that uses Jacobian-vector products, iterative linear solvers, and stochastic covariance estimation to enable linear scaling, all while retaining stability. Second, we propose iterative re-linearization to further improve stability without sacrificing scalability, turning probabilistic ODE solvers into fully implicit methods. We evaluate the proposed approaches on a range of stiff and high-dimensional problems and demonstrate improved stability and scalability over established probabilistic solvers.

2026-06-06T14:42:03Z Nathanael Bosch http://arxiv.org/abs/2606.08202v1 Vector Space of Cycles 2026-06-06T14:40:32Z

Most statistical and machine learning methods for directed interactions focus on pairwise effects among variables. Even existing cyclic models represent feedback primarily through node-level dependencies, making large-scale recurrent organization difficult to estimate and compare. This limitation is particularly acute in biological and neural systems, where interactions are highly recurrent and involve many overlapping cycles. We introduce a variational framework for statistical inference on cyclic interactions. Directed interactions are represented as edge flows on a simplicial complex and evolved under an energy-minimizing dynamical system. The resulting dynamics separate transient interaction components from persistent harmonic flows, yielding a low-dimensional cycle space that captures stable recurrent organization. Rather than enumerating individual cycles, the proposed framework represents cyclic interactions as elements of a Hilbert space, enabling projection, averaging, comparison, and population-level statistical inference. We establish theoretical properties of the harmonic projection, including characterization of the cycle space, variance reduction, and population inference. Simulations demonstrate substantially improved recovery of cyclic structure in dense recurrent systems compared with existing directed-interaction methods. Applied to resting-state fMRI from 400 human subjects, the framework reveals reproducible large-scale cyclic organization that is not detectable through edgewise averaging. These results provide a scalable statistical framework for studying recurrent interactions in high-dimensional dynamical systems.

2026-06-06T14:40:32Z Moo K. Chung Anass B. El-Yaagoubi Hernando Ombao http://arxiv.org/abs/2602.05869v2 Wedge Sampling: Efficient Tensor Completion with Nearly-Linear Sample Complexity 2026-06-06T14:29:26Z

We introduce Wedge Sampling, a new non-adaptive sampling scheme for low-rank tensor completion. We study recovery of an order-$k$ low-rank tensor of dimension $n \times \cdots \times n$ from a subset of its entries. Unlike the standard uniform entry model (i.e., i.i.d. samples from $[n]^k$), wedge sampling allocates observations to structured length-two patterns (wedges) in an associated bipartite sampling graph. By directly promoting these length-two connections, the sampling design strengthens the spectral signal that underlies efficient initialization, in regimes where uniform sampling is too sparse to generate enough informative correlations. Our main result shows that this change in sampling paradigm enables polynomial-time algorithms to achieve both weak and exact recovery with nearly linear sample complexity in $n$. The approach is also plug-and-play: wedge-sampling-based spectral initialization can be combined with existing refinement procedures (e.g., spectral or gradient-based methods) using only an additional $\tilde{O}(n)$ uniformly sampled entries, substantially improving over the $\tilde{O}(n^{k/2})$ sample complexity typically required under uniform entry sampling for efficient methods. Overall, our results suggest that the statistical-to-computational gap highlighted in Barak and Moitra (2022) is, to a large extent, a consequence of the uniform entry sampling model for tensor completion, and that alternative non-adaptive measurement designs that guarantee a strong initialization can overcome this barrier.

2026-02-05T16:47:13Z COLT 2026 arXiv version. 65 pages, 3 figures Hengrui Luo Anna Ma Ludovic Stephan Yizhe Zhu http://arxiv.org/abs/2606.08196v1 Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables 2026-06-06T14:25:03Z

We study causal discovery from observational data when some variables are hidden and the data-generating process follows a location-scale noise model (LSNM). Existing methods that handle hidden confounders typically assume additive noise, but in practice, causes often modulate not just the mean but also the variance of their effects. We prove that acyclic directed mixed graphs (ADMGs) satisfying a bow-free condition are identifiable under LSNM with hidden variables, establishing the first identifiability result for causally insufficient models beyond noise additivity. We further provide sufficient conditions for identifying causal direction even when the bow-free assumption is violated. Our two-stage algorithm, LSNM-UV, is sound and complete, and experiments demonstrate improved performance over additive baselines on heteroscedastic data.

2026-06-06T14:25:03Z 33 pages, 4 figures Mariyam Khan Shohei Shimizu Thong Pham http://arxiv.org/abs/2507.00260v3 Disentangled Feature Importance 2026-06-06T14:24:30Z

When predictors are statistically dependent, the appropriate definition of feature importance depends on the operational goal. Conditional-incremental measures are well-suited for feature selection, acquisition, and compression, where shared predictive information is treated as redundancy. For post-hoc interpretation, however, the goal is often to attribute predictive signals across correlated measurement channels. We introduce Disentangled Feature Importance (DFI), a population-level attribution framework for this setting. DFI maps covariates to an independent latent representation under a specified entropic optimal transport geometry, computes latent importance, and attributes it back to the original covariates through barycentric sensitivities. We show that broad conditional-incremental FI functionals target conditional incremental predictive value under squared-error loss, and therefore answer a different question from attribution of shared predictive signal under dependence. Under fixed transport cost, reference law, and regularization level, DFI defines a well-specified family of estimands. Latent scores admit a functional ANOVA interpretation, and in the Gaussian linear case, the attributed DFI recovers the classical $R^2$ decomposition for correlated regressors. We derive influence-function-based inference under nuisance-rate and smoothness conditions, and show in simulations and an HIV-1 neutralization-resistance analysis that DFI yields stable, interpretable, uncertainty-quantified attributions of shared predictive signal.

2025-06-30T20:54:48Z 29 main and 44 supplementary pages Jin-Hong Du Kathryn Roeder Larry Wasserman http://arxiv.org/abs/2604.25965v2 Adversarial Robustness of NTK Neural Networks 2026-06-06T14:19:58Z

Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression. We establish minimax optimal rates for adversarial regression in Sobolev spaces and then show that NTK neural networks, trained via gradient flow with early stopping, can achieve this optimal rate. However, in the overfitting regime, we prove that the minimum norm interpolant is vulnerable to adversarial perturbations.

2026-04-28T04:49:31Z Yuxuan Hou http://arxiv.org/abs/2505.24066v2 Adaptive Resolution for Finite-Rank Gaussian Processes 2026-06-06T13:00:34Z

Finite-rank approximations are widely used to scale Gaussian process (GP) regression, but their posterior behavior can differ from that of the corresponding parent GP prior. We study a class of finite-rank GP priors built from locally supported basis expansions with dependent Gaussian coefficients. Our framework covers finite-element approximations based on the stochastic partial differential equation (SPDE) representation of Matérn GPs and regular-grid GP interpolation schemes. We show that, with a suitable prior on the resolution parameter $N$, these finite-rank expansions inherit the same posterior contraction rate as the corresponding parent GP prior under the same bandwidth specification used for that parent prior. Consequently, the interpolation construction under a squared-exponential parent GP attains the minimax-optimal rate up to logarithmic factors under a hierarchical prior on the bandwidth parameter and on $N$, while the SPDE construction attains the same rate under a bandwidth scaling depending on the sample size and the smoothness of the true function, together with a prior on $N$. We also develop a posterior sampler for the hierarchical interpolation model that jointly updates the resolution and bandwidth parameters, and we provide numerical studies that support the theory.

2025-05-29T23:18:33Z 48 pages, 5 figures Jaehoan Kim Anirban Bhattacharya Debdeep Pati http://arxiv.org/abs/2601.07013v2 Conditional Normalizing Flows for Forward and Backward Joint State and Parameter Estimation 2026-06-06T10:35:05Z

Traditional filtering algorithms for state estimation -- such as classical Kalman filtering, unscented Kalman filtering, and particle filters -- show performance degradation when applied to nonlinear systems whose uncertainty follows arbitrary non-Gaussian, and potentially multi-modal distributions. This study reviews recent approaches to state estimation via nonlinear filtering based on conditional normalizing flows, where the conditional embedding is generated by standard MLP architectures, transformers or selective state-space models (like Mamba-SSM). In addition, we test the effectiveness of an optimal-transport-inspired kinetic loss term in mitigating overparameterization in flows consisting of a large collection of transformations. We investigate the performance of these approaches on applications relevant to autonomous driving and patient population dynamics, paying special attention to how they handle time inversion and chained predictions. Finally, we assess the performance of various conditioning strategies for an application to real-world COVID-19 joint SIR system forecasting and parameter estimation.

2026-01-11T18:01:42Z Luke S. Lagunowich Guoxiang Grayson Tong Daniele E. Schiavazzi http://arxiv.org/abs/2606.08084v1 Assessing model calibration with boosting trees 2026-06-06T10:14:36Z

The main goal in regression modelling consists in approximating the conditional mean of a response given a set of features. A regression function is said to be calibrated if the resulting mean estimates match the true conditional means for almost every set of features. Aiming for calibration seems not achievable in practice as one typically deals with finite samples of noisy observations. A weaker notion of calibration is auto-calibration, and it means that the expectation of responses being given the same mean estimate matches this estimate. This notion is important, e.g., in insurance pricing as it ensures no cross-subsidization between different price cohorts. In this paper, we show that boosting trees can be used to test necessary conditions for calibration and auto-calibration, respectively. The practical relevance of our approach is supported by a numerical example, in which the proposed tests prove to be very powerful on a large insurance dataset.

2026-06-06T10:14:36Z 36 pages Selim Gatti http://arxiv.org/abs/2605.10406v2 Multi-Fidelity Quantile Regression 2026-06-06T08:18:07Z

High-fidelity (HF) data are often expensive to collect and therefore scarce, making conditional quantiles difficult to estimate accurately. We propose a two-stage, model-agnostic method for multi-fidelity quantile regression. The central idea is a local quantile link: at each covariate value, the HF quantile is represented as a low-fidelity (LF) quantile evaluated at a covariate-dependent level. This reformulation reduces the problem to estimating the level function, which can be smoother than the HF quantile itself when the LF and HF conditional distributions have similar shapes. We also study the complementary regime in which this advantage weakens and introduce a correction step to improve robustness. Our theory characterizes when the proposed estimator converges faster than direct quantile regression using HF data alone and when the correction step provides further improvement. Experiments on synthetic and real data show that our method yields more accurate quantile estimates and tighter conformal prediction intervals.

2026-05-11T11:43:38Z 69 pages, 12 figures, 3 tables Yixiang Liu Yao Zhang http://arxiv.org/abs/2606.08032v1 Variational Proximal Policy Optimization 2026-06-06T07:50:50Z

Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization ($\textsc{VP}_2\textsc{O}$), a particle-based variational inference framework that maps policy optimization to Stein Variational Gradient Descent within a Mixture-of-Experts architecture. By leveraging functional kernels over localized expert prototypes alongside an expert orthogonalization loss, $\textsc{VP}_2\textsc{O}$ introduces a geometry-based proximal-control mechanism that can reduce reliance on fixed clipping or KL schedules. Our results on a 33B/4B sparse Mixture-of-Experts model show several improvements across complex reasoning benchmarks, establishing a $+\mathbf{179}$ ELO gain on Codeforces and a $\mathbf{32\%}$ reduction in token count on AIME mathematical reasoning tasks.

2026-06-06T07:50:50Z Ousmane Amadou Dia http://arxiv.org/abs/2205.01970v9 Non-Stationary Bandit Learning via Predictive Sampling 2026-06-06T06:49:00Z

Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to nonstationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. Theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.

2022-05-04T09:37:16Z Yueyang Liu Xu Kuang Benjamin Van Roy http://arxiv.org/abs/2606.07986v1 Inference for High-Dimensional Sparse Spectral Precision Matrices 2026-06-06T05:27:35Z

Gaussian graphical models in the spectral domain offer a principled approach for recovering conditional dependence structures in stationary high-dimensional time series. Inference on the spectral precision matrix at a fixed frequency enables tests of frequency-specific conditional associations among time series components. The problem is challenging because finite-sample discrete Fourier transforms induce truncation and smoothing biases, while the complex-valued nature of the spectral precision matrix complicates high-dimensional variance estimation, rendering methods for i.i.d. samples not directly applicable. Existing approaches do not provide full likelihood-based inference for the discrete Fourier transforms. We propose a high-dimensional inference framework for sparse spectral precision matrices using the full likelihood of neighboring discrete Fourier transforms. We construct a debiased complex graphical lasso estimator at any fixed frequency. Using asymptotic theory for quadratic forms of multivariate time series, we establish its asymptotic normality and construct entry-wise consistent covariance estimators by aggregating information across neighboring frequencies. The key theoretical contribution is the simultaneous control of regularization, finite-sample truncation, and smoothing biases, enabling valid inference. Simulation studies show reliable coverage away from zero frequency and improved detection power over the benchmark, with false discovery rates near the desired level.

2026-06-06T05:27:35Z 47 pages, 5 figures, 5 tables Navonil Deb Younghoon Kim Sumanta Basu