https://arxiv.org/api/sNkmLJNHXtePXNLoyj8leGy/5fU2026-06-11T08:02:39Z3614648015http://arxiv.org/abs/2605.29395v1Low Rank for Rank: Uncertainty-Aware Task-Specific LLM Ranking under Sparse Pairwise Comparisons2026-05-28T05:44:43ZPairwise human-preference platforms such as Chatbot Arena have become central to large language model (LLM) evaluation, yet reliable task-specific ranking remains challenging. Global leaderboards mask task heterogeneity, while ranking each fine-grained task independently is unstable under sparse, imbalanced comparisons. We propose a low-rank framework for task-specific LLM ranking from sparse pairwise comparisons, modeling the task-by-model ability matrix $Θ^\star \in \mathbb{R}^{d_t \times d_m}$ as low rank so that information is shared across related tasks while task-specific differences are preserved. We first develop a max-norm ($\ell_\infty$) accurate estimator for the latent scores, combining a convex initializer with alternating-minimization refinement, and prove task-wise top-$K$ recovery guarantees under sparse sampling. Our main contribution is an uncertainty quantification framework for task-specific ranking. We construct cross-fitted one-step debiased estimators for fixed score contrasts -- such as the task-specific ability gap between two models -- yielding asymptotically valid confidence intervals that attain the semiparametric efficiency bound. We then extend the inference to the high-dimensional ranking regime, where per-task ranks and top-$K$ membership are determined by many dependent score-gap hypotheses. Using Gaussian and multiplier-bootstrap calibration, we obtain simultaneous confidence sets for per-task ranks and valid top-$K$ membership tests across many tasks and models. Experiments on synthetic data and Chatbot Arena show that low-rank sharing improves sample efficiency over independent task-wise Bradley-Terry estimation and produces tighter, better-calibrated ranking certificates, with the largest gains in the sparse regime typical of real LLM benchmarks.2026-05-28T05:44:43ZJiachun LiDavid Simchi-LeviWill Wei Sunhttp://arxiv.org/abs/2605.26653v2Nonparametric Regression via Tree-Guided Feature Aggregation2026-05-28T04:35:57ZIn regression problems where covariates are naturally organized in a hierarchical tree structure, a central challenge is to select the resolution at which covariates enter the model. Determining this level of feature aggregation is of intrinsic scientific interest and can improve statistical efficiency by inducing sparsity. While a rich literature addresses this problem in the linear setting, extending feature aggregation to the nonlinear setting remains an open challenge. In this work, we propose to simultaneously perform model selection and feature aggregation through a penalized Nadaraya-Watson-type estimator. Our proposed estimator, Kernel Regression with Tree-EXploring AggregationS (KR-TEXAS), constructs adaptive penalty weights for the features based on pilot estimators of the regression function's partial derivatives. Under mild conditions, we establish model selection consistency for a well-defined target aggregation set, and our simulations show strong performance in both model selection and prediction. Finally, we demonstrate the utility of our procedure by applying it to a microbiome data set to predict short chain fatty acids. A user-friendly implementation of our procedure is available in the R package krtexas.2026-05-26T07:37:45ZSithija ManageY. Samuel WangMartin T. Wellshttp://arxiv.org/abs/2605.29315v1Generalized Spectral Testing with Sample Splitting2026-05-28T03:45:18ZResidual-based goodness-of-fit tests for parametric time-series models are often complicated by parameter-estimation effects, which can alter the limiting behavior of diagnostic statistics. We propose a sample-splitting generalized spectral test (in the spirit of Escanciano(2006)) for assessing conditional mean specification in linear and nonlinear time-series models. The procedure estimates the model parameter on a fitting subsample and constructs a generalized spectral Cramer-von Mises statistic from residuals computed on a checking/testing subsample. The statistic aggregates pairwise conditional mean restrictions over all lags and is therefore bandwidth-free and free of truncation-lag selection. Under mild regularity conditions and a score-alignment condition, the residual-based process has the same limiting null distribution as the infeasible oracle process based on the true errors. Although the resulting limiting law is still non-pivotal, it can be consistently approximated by a simple multiplier bootstrap that does not require generating bootstrap time series or re-estimating parameters. Such an oracle-equivalence property is in sharp contrast to the original full-sample test, for which parameter estimation contributes an additional first-order term to the limiting process, and requires re-estimating parameters in each bootstrapped sample. We further establish consistency of the proposed test against fixed alternatives and nontrivial power against local alternatives. Extensive simulations and real data analyses show that the proposed test controls size well, has comparable power, and delivers substantial computational savings in models where repeated estimation is costly.2026-05-28T03:45:18ZYuxin TaoFeiyu JiangXiaofeng Shaohttp://arxiv.org/abs/2605.29284v1Rapid Approximation Prediction for Kriging2026-05-28T03:11:05ZExact Kriging and conditional simulation (CS) for uncertainty quantification are computationally infeasible for modern spatial analyses with large numbers of observations and dense prediction grids. We present a rapid approximation to the Kriging prediction step for stationary Gaussian processes for a regular prediction grid by approximating each off-grid covariance vector by a sparse linear combination of on-grid covariances within a local $L$-order neighborhood of $M = (2L)^2$ neighboring grid points. This reformulation reduces complexity from $O(N n^3)$ to $O(N \log N + nM + M^3)$ while preserving accuracy. A factorial study shows that approximation error decreases systematically with increased Matérn smoothness, neighbor order $L$, and grid resolution, aligning with bounds from kernel approximation theory. In a North American summer-rainfall application ($n=1368$), our method produces predictions visually indistinguishable from exact Kriging with point-wise errors on the order of $10^{-5}$ inches and achieves more than $150$ times speedups at a $350\times350$ grid, also outperforming Vecchia and LatticeKrig predictions. Embedded in a fast CS scheme, the approach reproduces Kriging standard errors and scales favorably with both $n$ and $N$. We recommend a practical workflow that uses a fast method for parameter estimation followed by our rapid predictor for fine-grid mapping and uncertainty quantification.2026-05-28T03:11:05Z11 figures, 38 pagesZiyu LiGregory FasshauerDouglas Nychkahttp://arxiv.org/abs/2605.28341v2Identification and Inference for Structural Accelerated Failure Time Models via Instrument Interactions2026-05-28T03:07:53ZWe study causal inference for time-to-event outcomes under right censoring in the presence of unmeasured confounding. Focusing on structural accelerated failure time models, we develop an identification and inference framework that exploits interactions among instrumental variables. The proposed approach does not rely on classical instrumental variable validity and yields valid causal inference under both valid and invalid instruments, provided that the interaction-based identification condition holds. To accommodate right censoring, we construct a censoring-adjusted observed data moment function using an augmented inverse probability censoring weighting approach. The resulting moment function is Neyman orthogonal with respect to nuisance functions and enjoys a double robustness property, enabling valid inference under flexible nuisance estimation. Estimation and inference are conducted using generalized empirical likelihood, which is well suited to settings with many potentially weak interaction-based moment conditions. We establish consistency, and asymptotic normality under many weak moment asymptotics, and develop diagnostic tools to assess interaction-based identification strength and overidentifying restrictions. Simulation studies demonstrate favorable finite sample performance across a range of censoring rates and instrument configurations. An application to UK Biobank data illustrates the practical relevance of the proposed method for causal survival analysis in large-scale observational studies.2026-05-27T11:47:09ZQiushi BuWen SuXinyu ZhangXingqiu ZhaoZhonghua Liuhttp://arxiv.org/abs/2605.29255v1Outcome-Calibrated Regression and Predicted Outcome-Based Inference2026-05-28T02:16:05ZRegression is a fundamental tool in scientific research. Ordinary least squares (OLS), one of the most widely used regression methods, enjoys several desirable properties, including the best linear unbiased estimator (BLUE) property. It is well known that, under the assumptions of the standard model, the OLS is conditionally unbiased given the covariates, i.e., $\mathbb{E}(\widehat Y-Y\mid X=x)=0$. However, an often-overlooked property of OLS is that the prediction error is generally not unbiased conditional on the outcome, i.e., $\mathbb{E}(\widehat Y-Y\mid Y=y)\neq 0$. As a consequence of minimizing mean squared error, OLS predictions are systematically shrunk toward the outcome mean, which explains the classical phenomenon of regression to the mean (RTM): large outcome values tend to be underpredicted, whereas small outcome values tend to be overpredicted. This conditional prediction bias creates a nonignorable problem for predicted outcome-based inference, where scientific inference is performed using the predicted outcome $\widehat Y$ and another variable $W$. In applications such as brain-age analysis and causal inference, we show that inference based on regression-predicted outcomes can be systematically biased. To address this issue, we propose outcome-calibrated regression (OCR), a new regression framework with a closed-form solution that directly enforces outcome calibration. The proposed OCR estimator eliminates conditional prediction bias with respect to the outcome and enables valid inference using regression-predicted outcomes.2026-05-28T02:16:05ZHwiyoung LeeShuo Chenhttp://arxiv.org/abs/2402.01866v3Parametric Bootstrap for Fixed Edge-Probability Network Models2026-05-28T02:06:25ZThis paper studies parametric bootstrap methods for network data, with the goal of quantifying the uncertainty of network statistics of interest. While existing network resampling methods primarily focus on count statistics under node-exchangeable graphon models, we consider more general network statistics, including local statistics, under the Chung-Lu model without assuming node exchangeability. We show that the natural network parametric bootstrap, which first estimates the network-generating model and then draws bootstrap samples from the estimated model, generally suffers from bootstrap bias. As a general remedy, we show that a two-level bootstrap procedure provably reduces this bias. This extends the classical idea of the iterative bootstrap to the network setting, where the number of parameters grows with the network size. Moreover, for many network statistics, the second-level bootstrap provides a way to construct confidence intervals with higher accuracy. As a by-product of this analysis, we also obtain a central limit theorem for subgraph counts under the inhomogeneous Erdos-Rényi model, which may be of independent interest.2024-02-02T19:44:22ZZhixuan ShaoCan M. Lehttp://arxiv.org/abs/2605.29222v1Valid and efficient possibilistic fusion2026-05-28T01:20:06ZBesides the classical motivation of fusing evidence from multiple sources, modern inferential procedures based on randomization, resampling, and data splitting often introduce analyst-generated multiplicity, where aggregating outputs across random realizations can improve robustness and stability. This emphasizes the importance of developing principled strategies for fusing measures of evidence across different inferential settings, while preserving the key properties of the adopted inferential framework. The present paper addresses this problem in the context of inferential models (IMs), a possibilistic approach for provably valid statistical inference. Although the fusion of possibility measures has been extensively studied in the possibility-theory literature, existing methods do not, in general, preserve IM validity. We propose a general validity-preserving framework for possibilistic fusion, motivated by the ranking--validification construction underlying IMs. We study the implementation of this framework under independence, arbitrary dependence, and exchangeability of the available IMs, thereby providing a unified approach for IM fusion across a broad range of practically relevant scenarios. The proposed framework also reveals important efficiency considerations, showing that intuitive and commonly used fusion operators may become inefficient in the IM context, so that alternative choices can sometimes be advantageous, including ones that might not appear natural from a purely intuitive standpoint.2026-05-28T01:20:06Z28 pages, 7 figuresLeonardo Cellahttp://arxiv.org/abs/2509.21734v2Optimal Stopping for Sequential Bayesian Experimental Design2026-05-28T01:10:56ZSequential Bayesian experimental design typically assumes that the number of experiments is fixed before data collection begins. In practical campaigns, however, experimentation may need to terminate early because additional measurements can provide diminishing information relative to their cost, raising the central decision question: when should one stop? Common threshold-based stopping rules are easy to implement but myopic, because they compare the current state with a fixed criterion without accounting for the expected value of future experiments. This work develops a Bayesian optimal stopping framework for sequential experimental design by formulating stopping and design as coupled decisions in a Markov decision process. We prove that, for any design policy, the optimal stopping rule terminates exactly when the immediate terminal reward exceeds the expected continuation value. We then derive a policy gradient method for learning value-based stopping and design policies. Naïve joint training can create a circular dependency that traps learning in early-stopping local optima. We address this difficulty with a curriculum learning strategy that gradually transitions from forced continuation to adaptive stopping during training. Numerical studies on a linear-Gaussian benchmark, a one-dimensional nonlinear test problem, and a contaminant source detection problem show that the proposed approach learns stable design-stopping policies and improves resource-aware performance, with the largest gains in settings with strong sequential dependence.2025-09-26T01:02:24ZChen ChengXun Huanhttp://arxiv.org/abs/2605.29189v1Bayesian Multiplicity Correction in the Probabilistic Forward Stepwise Framework2026-05-27T23:57:47ZWe develop a natural Bayesian multiplicity-correcting prior distribution within the probabilistic forward stepwise representation of model space priors for regression problems. The proposed prior, obtained from making an analogy to the Holm procedure, exhibits behavior closely aligned with that of the Matryoshka doll prior. We compare both priors to several other priors, including some recently put forward as objective choices for model space prior probabilities. Our comparisons indicate that adequate multiplicity correction requires a degree of sparsity that many recommended priors do not provide, and we argue that multiplicity correction itself offers a principled and transparent criterion for specifying model space priors in regression.2026-05-27T23:57:47Z2 FiguresAndrew WomackDaniel Taylor-Rodriguezhttp://arxiv.org/abs/2605.29182v1A Latent Variable Model for Response Times with Individual-Specific Change-Points2026-05-27T23:37:57ZResponse times collected in computerised assessments provide information about the underlying response process and may exhibit within-person variation over the course of a test. We propose a latent variable model for log response times that incorporates individual-specific change-points. The model extends the log-normal response time model by allowing an item-specific shift in the mean structure after an unobserved change-point. The change-point is treated as a discrete latent variable, and its distribution is modeled as a function of latent speed. Estimation is carried out using marginal maximum likelihood. The framework yields posterior distributions for change-point locations, allowing uncertainty to be quantified at the individual level, and supports statistical inference for the change-point effect parameters. A simulation study examines parameter recovery and change-point estimation under varying boundary conditions, prevalence of changers, sample sizes, and test lengths. The results show accurate recovery of item and structural parameters. The proposed model provides a unified approach to modeling response times with within-person changes in behaviour.2026-05-27T23:37:57ZGabriel WallinNivedita Bhakthahttp://arxiv.org/abs/2603.19573v2Estimating within-cluster and between-cluster spillover effects in randomized saturation designs2026-05-27T23:34:55ZRandomized saturation designs are two-stage experiments: they first randomly assign treatment probabilities over the clusters and then randomly assign the treatment to the units within the clusters. The existing literature on randomized saturation designs focuses on estimating within-cluster spillover effects by assuming away between-cluster spillover effects. However, the units may interact across clusters in many practical randomized saturation designs. A leading example is that some units are geographically close to each other, so spillover effects arise across clusters. Based on the potential outcomes framework, we formulate the causal inference problem of estimating within-cluster and between-cluster spillover effects in randomized saturation designs. We clarify the causal estimands and establish the statistical theory for estimation and inference. We also apply our method to analyze a recent randomized saturation design of cash transfer on household expenditure in Kenya.2026-03-20T02:35:02ZTo appear in Social NetworksSizhu LuLei ShiPeng Dinghttp://arxiv.org/abs/2605.13168v2Variance-Aware Estimation and Inference for Michaelis--Menten Models with Heteroscedastic Errors and Clustered Measurements2026-05-27T23:32:29ZMichaelis--Menten analysis is often conducted by nonlinear least squares under a constant-variance assumption, even though enzyme-kinetic data frequently display concentration-dependent heteroscedasticity and often include repeated or clustered measurements. We develop a variance-aware procedure for Michaelis--Menten estimation and inference that is motivated by conditional moment restrictions and implemented through simple conditionally Gaussian working models. For single curves, the method reduces to one-dimensional root finding for $K_m$ followed by closed-form plug-in updates for $V_{\max}$ and a variance scale parameter; the same score logic yields a cluster-level extension through a random-effect-induced working covariance. In simulation, modeling heteroscedasticity improved variance recovery and interval efficiency relative to homoscedastic nonlinear least squares, while cluster-aware semiparametric and NLME fits restored fixed-effect coverage far more effectively than pooled analyses that ignored clustering. In self-driving laboratory and soil exoenzyme data, heteroscedastic models achieved lower information criteria than homoscedastic nonlinear least squares, with the square-root variance function giving the most stable empirical fit among the prespecified working models. We implement the workflow in the companion \texttt{inferMM} package for single-curve, grouped, and clustered Michaelis--Menten analysis. These results show that simple variance-function and covariance modeling can stabilize original-scale Michaelis--Menten inference when variability changes with substrate concentration or measurements are clustered.2026-05-13T08:30:44ZMijeong KimMinkyoung ChaAh Young Jeonghttp://arxiv.org/abs/2510.16060v2Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?2026-05-27T23:16:53ZThe recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.2025-10-17T01:41:24ZPublished as a conference paper at ICLR 2026Proceedings of ICLR 2026Coen AdlerYuxin ChangFelix DraxlerSamar AbdiPadhraic Smythhttp://arxiv.org/abs/2407.04142v2Bayesian Structured Mediation Analysis With Unobserved Confounders2026-05-27T21:47:45ZWe explore methods to reduce the impact of unobserved confounders on the causal mediation analysis of high-dimensional mediators with spatially smooth structures, such as brain imaging data. The key approach is to incorporate the latent individual effects, which influence the structured mediators, as unobserved confounders in the outcome model, thereby potentially debiasing the mediation effects. We develop BAyesian Structured Mediation analysis with Unobserved confounders (BASMU) framework, and establish its model identifiability conditions. Theoretical analysis is conducted on the asymptotic bias of the Natural Indirect Effect (NIE) and the Natural Direct Effect (NDE) when the unobserved confounders are omitted in mediation analysis. For BASMU, we propose a two-stage estimation algorithm to mitigate the impact of these unobserved confounders on estimating the mediation effect. Extensive simulations demonstrate that BASMU substantially reduces the bias in various scenarios. We apply BASMU to the analysis of fMRI data in the Adolescent Brain Cognitive Development (ABCD) study, focusing on four brain regions previously reported to exhibit meaningful mediation effects. Compared with the existing image mediation analysis method, BASMU identifies two to four times more voxels that have significant mediation effects, with the NIE increased by 41%, and the NDE decreased by 26%.2024-07-04T20:05:12ZYuliang XuShu YangJian Kang