https://arxiv.org/api/baAYC1Y+0IdVEyZkCjo54Z69xuE 2026-06-14T05:35:12Z 78354 255 15 http://arxiv.org/abs/2606.05919v2 Finding Most Influential Sets 2026-06-05T08:44:55Z

Identifying most influential sets (MIS) - size-$k$ subsets whose removal maximally changes a target estimand - is typically infeasible because it requires searching over $\binom{n}{k}$ subsets. For estimands with linear-fractional leave-set-out effects, we show that MIS selection reduces to a one-parameter sequence of top-$k$ problems. Dinkelbach's method yields an algorithm with $\mathcal{O}(n)$ cost per iteration and finite termination. For fixed residualized inputs, the algorithm returns a globally optimal set for the univariate ratio objective, including the oracle-residualized partial linear model. With estimated nuisance functions, uniform denominator and generated-score stability imply approximation to the first-order oracle orthogonal-score objective; exact set recovery follows under a separation condition. Simulations and applications show that the method recovers exact MIS that were previously computationally inaccessible.

2026-06-04T09:24:26Z Published as a conference paper at ICML 2026, fixed ref Lucas D. Konrad Nikolas Kuschnig http://arxiv.org/abs/2606.07694v1 Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head 2026-06-05T07:31:07Z

Accurate vessel traffic flow prediction is crucial for smart port operations and navigational safety. However, maritime traffic flow data are often highly sparse with intermittent bursts, making robust forecasting challenging. Under such conditions, conventional spatio-temporal graph neural networks (ST-GNNs) can degrade toward conservative near-zero predictions and fail to capture non-zero activity. Although zero-inflated negative binomial (ZINB) models partially address excess zeros, their two-part formulation can still remain conservative around abrupt transitions. To address these issues, we propose a model-agnostic learnable Tweedie head that can be attached as a plug-and-play output module to arbitrary ST-GNN backbones. Instead of likelihood-based Tweedie training, which typically requires surrogate objectives, our approach optimizes the closed-form Tweedie unit deviance and predicts the mean for point forecasting while learning a node-level variance power to capture heterogeneous variability across port areas. Experiments on a maritime traffic graph constructed from real-world AIS data in the Port of Los Angeles and Long Beach show that the proposed head consistently improves RMSE across multiple ST-GNN backbones, especially on non-zero events, leading to more reliable forecasts for practical maritime traffic control.

2026-06-05T07:31:07Z Kyeongjun Lee Heeyoung Kim http://arxiv.org/abs/2606.07693v1 Transfer learning for causal forest 2026-06-05T07:16:14Z

Transfer learning addresses the challenge of transfering knowledge from one domain to another. Traditional transfer learning focuses on adapting models trained on a source domain (with a lot of observations) to improve performance on a target domain (with few observations). In this work we consider the case of a model shift and we focus on the transfer learning applied to a causal forest namely HTERF. This causal forest aims to estimate the Conditional Average Treatment Effect (CATE). The approach considered is the offset method presented by Wang (2016) adapted to a causal context. This method relies on the use of intermediate models in order to estimate the offset between source and target distributions. Our main result is a bound on the CATE error of HTERF on target depending on the error of the intermediate models. Simulation studies show the good performances of this approach in different settings on simulations and on a real-world dataset.

2026-06-05T07:16:14Z Bérénice-Alexia Jocteur ICJ, PSPM Véronique Maume-Deschamps ICJ, PSPM Pierre Ribereau PSPM, ICJ http://arxiv.org/abs/2606.05967v2 Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. Samples 2026-06-05T07:03:28Z

In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. We establish a new convergence rate, for the Mean-Square Error (MSE) on the approximated function, that is (i) fast in the sense that it admits an optimal dependency in the number of iterations k (i.e., of order 1/k), (ii) robust to ill-conditioning: it only depends on an initial error and modelindependent constants and (iii) sharp up to a multiplicative constant lower than 11. In particular, it does not depend on the smallest eigenvalue of the uncentered covariance matrix of the linear parametrization, unlike all pre-existing O(1/k) rates in the TD(0) literature. We also introduce PCTD(0), a variant of TD(0), which benefits from better convergence properties under an additional assumption of strong mixing on the Markov Chain.

2026-06-04T10:10:29Z This is an extended version of a paper accepted at AISTATS 2026 AISTATS 2026, May 2026, Tanger, Morocco Ziad Kobeissi L2S Éloïse Berthier U2IS http://arxiv.org/abs/2606.06957v1 Deep Single-Index Fréchet Regression 2026-06-05T06:35:28Z

Predicting outputs that are located in non-Euclidean spaces, such as probability distributions, networks, and symmetric positive-definite matrices, is becoming increasingly important in modern data analysis, particularly when inputs are high-dimensional. We propose DeSI (Deep Single-Index Fréchet Regression), a semiparametric framework for regression with metric space-valued outputs and multivariate inputs that assumes a single-index structure for the conditional Fréchet mean. DeSI estimates an interpretable index direction, which quantifies the relative importance of inputs, using a deep neural network, and performs Fréchet regression along the resulting one-dimensional index in the target metric space. This structure mitigates the curse of dimensionality while retaining interpretability, which stands in contrast to standard deep neural networks. We establish theoretical guarantees for DeSI, including uniform approximation and convergence rates, and demonstrate its strong predictive performance through simulations on distributions, networks, and symmetric positive-definite matrices, as well as an application to compositional mood data from New Jersey.

2026-06-05T06:35:28Z Muqing Cui Yidong Zhou Su I Iao Hans-Georg Müller http://arxiv.org/abs/2606.06855v1 Stability beyond Bounded Differences: Sharp Generalization Bounds under Finite $L_p$ Moments 2026-06-05T02:59:49Z

While algorithmic stability is a central tool for understanding generalization of learning algorithms, existing high-probability guarantees typically rely on uniform boundedness or sub-Gaussian/sub-Weibull tail assumptions, which can be overly restrictive for modern settings with heavy-tailed or unbounded losses. We develop a stability-based framework that requires only a finite $L_p$ moment condition. Our first contribution is sharp concentration inequalities for functions of independent random variables under $L_p$ constraints, extending McDiarmid's bounded-differences techniques beyond the classical regime. Leveraging these results, we derive sharp high-probability generalization bounds across a range of learning paradigms, including empirical risk minimization, transductive regression, and meta-learning. These guarantees show that $L_p$ stability suffices for robust generalization even when boundedness fails, substantially weakening the standard assumptions in the stability literature.

2026-06-05T02:59:49Z Qianqian Lei Soham Bonnerjee Yuefeng Han Wei Biao Wu http://arxiv.org/abs/2508.02039v2 Model Recycling Framework for Multi-Source Data-Free Supervised Transfer Learning 2026-06-05T02:23:44Z

Increasing concerns for data privacy and other difficulties associated with retrieving source data for model training have created the need for source-free transfer learning, in which one only has access to pre-trained models instead of data from the original source domains. This setting introduces many challenges, as many existing transfer learning methods typically rely on access to source data, which limits their direct applicability to scenarios where source data is unavailable. Further, practical concerns make it more difficult, for instance efficiently selecting models for transfer without information on source data, and transferring without full access to the source models. So motivated, we propose a model recycling framework for parameter-efficient training of models that identifies subsets of related source models to reuse in both white-box and black-box settings. Consequently, our framework makes it possible for Model as a Service (MaaS) providers to build libraries of efficient pre-trained models, thus creating an opportunity for multi-source data-free supervised transfer learning.

2025-08-04T04:11:40Z Sijia Wang Ricardo Henao http://arxiv.org/abs/2606.06814v1 The Effect of Training Task Diversity on In-Context Learning through the Lens of Low-Dimensional Subspaces 2026-06-05T01:35:42Z

The transformer's emergent ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its underlying mechanisms. Existing works often study how training task diversity, defined either as the number of ICL training task vectors or as the number of function classes from which the task vectors are drawn, shapes both the learning dynamics and generalization capabilities of ICL. While both definitions have uncovered many interesting phenomena, many observations under the latter definition remain theoretically unexplained. This paper presents a minimal analytical model under which these phenomena provably emerge from the properties of the training data. By modeling the training task vectors as a mixture of low-rank Gaussians, we show how training task diversity, defined by the number of non-overlapping columns between subspaces that parameterize the covariance matrices, improves both the generalization and optimization trajectory of ICL with linear attention. In particular, we show that our model can explain (i) why training with task diversity shortens the ICL plateau and (ii) why ICL appears to achieve out-of-distribution generalization. We conclude by empirically demonstrating how our results extend to nonlinear transformers and nonlinear function classes. Overall, our work presents a tractable framework to unify existing observations.

2026-06-05T01:35:42Z Soo Min Kwon Alec S. Xu Can Yaras Dogyoon Song Laura Balzano Qing Qu http://arxiv.org/abs/2604.03146v2 Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization 2026-06-05T01:10:02Z

We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min-Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min-max characterization of key statistics, enabling approximation of the mean $μ_{\hatθ}$ and covariance $C_{\hatθ}$ of the ERM estimator $\hatθ$. Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate $x$ independent of the training data, the projection $\hatθ^\top x$ approximately follows the convolution of the generally non-Gaussian distribution of $μ_{\hatθ}^\top x$ with an independent centered Gaussian variable of variance $\mathrm{tr}(C_{\hatθ} \mathbb{E}[xx^\top])$. This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any $\mathcal{C}^2$ regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at $μ_{\hatθ}$. Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.

2026-04-03T16:07:02Z 28 pages, 5 figures, 1 table ICML 2026 Chiheb Yaakoubi Cosme Louart Malik Tiomoko Zhenyu Liao http://arxiv.org/abs/2109.02644v5 Resolvent convergence for sample covariance matrices with general covariance profiles and quadratic-form control 2026-06-05T01:01:08Z

We study the resolvent \[ G^z = \left(\frac{1}{n}XX^T - zI_p\right)^{-1}, \qquad z\in\mathbb C,\ \Im(z)>0, \] where $X=(x_1,\ldots,x_n)\in\mathcal M_{p,n}$ is a random matrix with independent, but not necessarily identically distributed, columns. Our bounds are expressed in terms of moments of the centered quadratic forms \[ q_i(A):=x_i^TAx_i-\mathbb E[x_i^TAx_i], \] for deterministic matrices $A$ with unit Hilbert--Schmidt norm. In particular, we do not assume independence between the entries of a given column $x_i$. In the quasi-asymptotic regime $p\le O(n)$, the matrix $G^z$ admits a natural deterministic equivalent $\tilde G^z$, depending only on the second moments of the column vectors $x_1,\ldots,x_n$. We show that, for any deterministic matrix $B\in\mathcal M_p$, the trace $\text{Tr}(BG^z)$ is close to $\text{Tr}(B\tilde G^z)$, with error controlled by $\|B\|_{\text{HS}}$ under first-moment bounds on the quadratic forms, and by $\|B\|_{\text{HS}}/\sqrt n$ under suitable second-moment bounds.

2021-09-06T14:21:43Z Main text 38p Cosme Louart http://arxiv.org/abs/2606.06785v1 Empirical Transfer Operators and Finite-Sample Change Detection for Noisy Expanding Interval Maps 2026-06-05T00:03:45Z

We study finite-sample change detection for one-dimensional noisy dynamical systems using partition-based empirical approximations of stationary behaviour. Given observations from an interval-valued process, we partition the state space, estimate a finite transition matrix from observed transitions between partition elements, and apply a small Doeblin-type regularisation to ensure a unique stationary distribution. From an initial reference segment, we compute a baseline empirical stationary distribution $\widehatπ_{0,ρ}$. For each later sliding window, we compute $\widehatπ_{t,ρ}$ and define the score \[ S_t=\|\widehatπ_{t,ρ}-\widehatπ_{0,ρ}\|_1. \] Large values of $S_t$ indicate a change in stationary behaviour relative to the baseline. The statistic detects changes in invariant density or stationary law, but not all possible changes in transition dynamics. Under explicit assumptions on empirical transition concentration, finite-state stationary distribution stability, partition approximation, regularisation bias, and noise stability, we derive a finite-sample bound for the empirical stationary density. The bound separates sampling error, regularisation bias, partition approximation error, and noise bias. We then obtain a single-window false-alarm guarantee and a sufficient detection condition when the invariant density changes by more than the estimation error. We illustrate the method on synthetic noisy beta-map change-point experiments.

2026-06-05T00:03:45Z 27 pages, 2 tables, 1 figure Aparna Rajput http://arxiv.org/abs/2606.06782v1 The Sharp Phase Transition of Tyler's M-Estimator for Robust Subspace Recovery 2026-06-04T23:55:23Z

Robust Subspace Recovery (RSR) aims to identify an underlying d-dimensional subspace from a dataset heavily corrupted by outliers. Complexity-theoretic results establish a threshold for the problem's computational hardness based on the dimension-scaled signal-to-noise ratio (DS-SNR): the problem is SSE-hard when the DS-SNR is strictly less than 1, and solvable via practical algorithms when it is greater than 1 under general position assumptions. However, the exact behavior of practical algorithms at the critical boundary DS-SNR = 1 has remained unknown. This work resolves the behavior of Tyler's M-estimator (TME) at this critical boundary, consequently establishing a sharp phase transition. Specifically, we prove that TME converges exactly to the true subspace for DS-SNR \geq 1 under a new stability condition, which is less restrictive than the general position assumptions used in prior literature. Our analysis utilizes a decomposition of the TME iterates within a majorization-minimization framework.

2026-06-04T23:55:23Z Gilad Lerman Teng Zhang http://arxiv.org/abs/2606.06772v1 Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods 2026-06-04T23:31:52Z

Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

2026-06-04T23:31:52Z 37 pages Junyu Zhou Puyu Wang Yunwen Lei Marius Kloft Yiming Ying http://arxiv.org/abs/2606.06764v1 Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks 2026-06-04T23:04:49Z

Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

2026-06-04T23:04:49Z 39 pages, 1 table Junyu Zhou Puyu Wang Yunwen Lei Yiming Ying Ding-Xuan Zhou http://arxiv.org/abs/2606.07677v1 Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference 2026-06-04T19:56:09Z

Electronic health records (EHR) pose large-scale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification. We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling around latent, risk-factor-modulated disease pathways. Risk factors act on hyperedges, latent disease subsets with shared risk patterns, allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence. To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.

2026-06-04T19:56:09Z ICML 2026 Oral Shengxian Ding Haonan Gao Pangpang Liu Xinyuan Tian Yize Zhao