https://arxiv.org/api/VfTYnEP6FZV/Kt+S49otqgRQ7042026-06-14T20:09:13Z7835446515http://arxiv.org/abs/2606.02363v1Minimax-Optimal Policy Regret in Partially Observable Markov Games2026-06-01T15:11:51ZWe study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.2026-06-01T15:11:51ZRaman Arorahttp://arxiv.org/abs/2606.02345v1Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization2026-06-01T14:54:29ZMany machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.2026-06-01T14:54:29ZLouise DavyStephan ClémençonCharlotte Laclauhttp://arxiv.org/abs/2601.16884v3Multigrade Neural Network Approximation2026-06-01T14:41:19ZWe study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly nonconvex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably certain one-hidden-layer ReLU models, training admits convex reformulations with global guarantees under appropriate settings, motivating learning paradigms that improve stability while scaling to depth. MGDL builds on this insight by training deep networks grade by grade: previously learned grades are frozen, and each newly added grade-wise subnetwork is composed on top of the previously learned grades and trained to fit the residual left by the current approximation, yielding a structured and interpretable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function defined on a hypercube, there exists a fixed-width multigrade ReLU scheme whose residuals are pointwise nonincreasing in magnitude and converge uniformly to zero, with strict $L^p$-norm decay at every nontrivial grade for $p\in [1,\infty)$. To the best of our knowledge, this work provides the first rigorous constructive approximation guarantee showing that a grade-wise residual refinement scheme can achieve vanishing error in a fixed-width multigrade ReLU architecture.2026-01-23T16:46:25ZShijun ZhangZuowei ShenYuesheng Xuhttp://arxiv.org/abs/2602.06065v3Deep networks learn to parse uniform-depth context-free languages from local statistics2026-06-01T13:46:20ZUnderstanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.2026-01-31T17:35:06ZAccepted as regular paper at ICML 2026Jack T. ParleyFrancesco CagnettaMatthieu Wyarthttp://arxiv.org/abs/2606.02247v1ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation2026-06-01T13:40:38ZShapley values are a principled attribution measure widely used in interpretable machine learning, but their exact computation scales exponentially with the number of players, motivating a wide range of approximation methods based on value function evaluations of sampled coalitions. This raises the question of whether approximation accuracy can be improved by adaptively selecting coalitions for evaluation based on previous evaluations. This is particularly relevant in settings where the value function is costly and the number of evaluations is severely limited, such as retraining-based feature importance, data valuation, and hyperparameter importance. For this purpose, we propose ShaplEIG, a Bayesian experimental design approach that approximates the expensive value function using a Gaussian process surrogate and adaptively selects coalitions based on their expected information gain about the Shapley values. By the linearity of the Shapley values in the value function, we show that the expected information gain is available in closed form. Furthermore, we propose an efficient computation scheme that reduces the complexity from exponential to polynomial in the number of players via elementary symmetric polynomials. In extensive experiments across diverse costly applications, our method consistently improves sample efficiency in the low-budget regime over state-of-the-art baselines.2026-06-01T13:40:38ZAccepted at the Forty-Third International Conference on Machine Learning (ICML 2026)David RundelFabian FumagalliMaximilian MuschalikBernd BischlMatthias Feurerhttp://arxiv.org/abs/2606.02231v1Identifiable Markov Switching Models with Instantaneous Effects and Exponential Families2026-06-01T13:25:58ZTemporal systems often exhibit non-stationary behaviour, such as seasonal climate variation or glucose fluctuations in patients with type-1 diabetes. One way to model non-stationarity is through discrete latent regimes, i.e., stationary segments of time. Such systems induce a Markov Switching Model (MSM), a class of Hidden Markov Models with autoregressive dependencies among latent regimes and observed variables. Identifying latent regimes is challenging in the presence of frequent regime switches and nonlinear and non-Gaussian dynamics, particularly when there are instantaneous effects between the variables, e.g., due to slow rates of measurements. In this work, we establish the identifiability of both latent regimes and regime-dependent causal structures under temporal regime dependencies, nonlinear lagged and instantaneous effects, and independent noise from the exponential family. Our identifiability theory subsumes non-temporal mixtures of causal models. Furthermore, we introduce FlowMSM, a regime detection framework that can be paired with any stationary causal discovery method to recover regime-dependent causal structures. Experiments on synthetic benchmarks and a financial economics dataset demonstrate the effectiveness of our approach to detect latent regimes and discover causal structures from non-stationary time series.2026-06-01T13:25:58ZInternational Conference on Machine Learning (ICML) 2026Roel HulsmanCarles Balsells-RodasSara Magliacanehttp://arxiv.org/abs/2606.02228v1Bayesian meta-learning for modeling Alzheimer's disease progression2026-06-01T13:24:56ZPredicting whether an individual with Alzheimer's disease will experience mild or severe disease progression is essential for personalized treatment. Typically, practitioners seek to predict the distribution of a discrete disease score, conditional on an individual's current MRI volume and their historical disease trajectory. Classical statistical regression models and single-task neural networks are not well-suited for this purpose because fitting separate models is infeasible (since each individual typically has few observations), while ignoring individual-level correlation leads to poor generalization. Meta-learning, in contrast, provides a natural avenue to dynamically predict distributions without retraining and model nonlinear relationships between the outcome and covariates. Motivated by this, we propose a Bayesian meta-learner that is trained on multiple individuals but tailors the predictive disease score distribution to each individual's historical data. Our model predicts on unseen individuals without retraining, scales linearly with the number of historical observations, and is guaranteed to be less overconfident when predicting long-term disease scores compared to its deterministic counterpart. On real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, our model achieves performance competitive with both single-task models and deterministic meta-learners, while substantially improving performance when predicting long-term disease progression.2026-06-01T13:24:56ZClara HoffmannNadja Kleinhttp://arxiv.org/abs/2603.23398v3Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation2026-06-01T12:59:54ZGenerative modeling of discrete data, such as graphs, underpins many scientific and industrial applications, including molecular discovery and materials design. In these domains, probabilistic inference is particularly valuable, as it enables composable generation and principled incorporation of desired constraints, such as structural or functional properties. Energy-based models naturally support this goal by capturing relative likelihoods and enabling composable inference by directly enforcing constraints during inference. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities, resulting in a fidelity gap compared to discrete diffusion models. To address this gap, we introduce Graph Energy Matching (GEM), a discrete generative framework inspired by the Jordan-Kinderlehrer-Otto (JKO) transport-map optimization perspective. GEM learns a permutation-invariant potential energy that simultaneously guides discrete transport from noise toward high-likelihood graph regions and refines samples within these regions. We further introduce a sampling protocol leveraging an energy-based switching strategy, seamlessly bridging rapid, gradient-guided transport and a local mixing regime for effective exploration. On molecular graph benchmarks, GEM matches or surpasses strong discrete diffusion baselines on most reported metrics. Beyond improving generation quality, GEM's relative likelihood modeling enables targeted exploration, facilitating compositional generation, property-constrained sampling, and interpolation between graphs. Project page: https://michalbalcerak.ai/graph-energy-matching/.2026-03-24T16:35:25ZMichal BalcerakSuprosanna ShitChinmay PrabhakarSebastian KaltenbachMichael S. AlbergoYilun DuBjoern Menzehttp://arxiv.org/abs/2506.10677v3Exploiting Similarities in A/B Testing with Off-Policy Estimation2026-06-01T12:59:24ZWe study A/B testing, the standard protocol for measuring the performance gain of a new decision system relative to a baseline. Traditional A/B testing treats both systems as black boxes, ignoring potential similarities between them. In practice, however, new and baseline systems are rarely radically different and often share significant structure, which can be captured by their propensities to make similar decisions. We show that in such cases, the commonly used difference-in-means estimator, though unbiased, is statistically suboptimal. Leveraging off-policy estimation, we introduce a family of A/B testing estimators that exploit the propensities of the tested systems to achieve improved concentration properties. This family is flexible enough to be tailored to practical decision-making. The resulting estimators are simple, robust to propensities misspecification, substantially more accurate when the tested systems exhibit similarities, and gracefully fall back to the difference-in-means estimator when such similarities are absent. Our theoretical analysis and empirical studies confirm their efficiency and practicality.2025-06-12T13:11:01ZKDD '26Otmane SakhiAlexandre GilotteDavid Rohdehttp://arxiv.org/abs/2510.09288v2A unifying Bayesian framework for adversarial robustness2026-06-01T12:51:35ZThe vulnerability of machine learning models to adversarial attacks remains a critical societal security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. These deterministic approaches do not account for uncertainty in the adversary's attack. While stochastic defenses placing a probability distribution on the adversary exist, they often lack statistical rigor and fail to make explicit their underlying assumptions. To resolve these issues, we introduce a formal Bayesian framework that models adversarial uncertainty through a stochastic channel, articulating all probabilistic assumptions. This yields two robustification strategies: a proactive defense enacted during training, aligned with adversarial training, and a reactive defense enacted during operations, aligned with adversarial purification. Several state-of-the-art defenses can be recovered as limiting cases of our model. We empirically validate our methodology, showcasing the benefits of explicitly modeling adversarial uncertainty.2025-10-10T11:28:30ZPablo G. ArceRoi NaveiroDavid Ríos Insuahttp://arxiv.org/abs/2602.06837v2Sharpness-Aware Hybrid Model Learning for Architecture-Agnostic Parameter Estimation2026-06-01T12:46:52ZHybrid modeling, the combination of machine learning models and scientific mathematical models, enables flexible and robust data-driven prediction with partial interpretability. However, the unknown parameters of the scientific model cannot necessarily be estimated properly, since the flexibility of the machine learning model might make the scientific model part effectively ignored in prediction. We may avoid it by applying some regularization, but the formulation of such regularizers typically depends on model architectures and domain knowledge. In this paper, we propose an architecture-agnostic method to learn hybrid models while properly estimating the scientific parameters. The idea is to use the flatness of loss minima to achieve model simplicity, based upon the Occam's razor principle. We employ the idea of sharpness-aware minimization and adapt it to the hybrid modeling setting. Numerical experiments demonstrate the effectiveness of the SAM-based hybrid model learning for scientific parameter estimation.2026-02-06T16:27:19ZNaoya Takeishihttp://arxiv.org/abs/2509.03456v2Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation2026-06-01T12:41:06ZOff-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as the action space grows. We show that estimator-aware policy parametrization can mitigate, but not fully resolve, optimization challenges. Building on this, we explore simpler weighted log-likelihood objectives and demonstrate that they enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.2025-09-03T16:25:45ZICML '26Imad AoualiOtmane Sakhihttp://arxiv.org/abs/2604.08149v2A Direct Approach for Handling Contextual Bandits with Latent State Dynamics2026-06-01T12:08:21ZWe consider a linear contextual bandit model where contexts and rewards are governed by a finite hidden Markov chain. We first revisit the simplified model by Nelson et al. (2022), in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts (called beliefs), rather than functions of the hidden states themselves. This simplified model may be handled through a direct reduction to standard linear contextual bandits. We extend the theoretical analysis of this reduction to take into account the estimation of the parameters of the hidden Markov model [HMM] in the regret bound and to provide high-probability bounds not depending anymore on the reward functions and only depending on the model through the estimation of the HMM parameters. Second, and most importantly, we instead study the more natural and more complex model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits). Under a classic HMM forgetting condition, the main algorithmic tool introduced to cope with the various statistical dependencies that the reward structure introduces is to only periodically update reward-model parameters.2026-04-09T12:09:45ZICML 2026 - Forty-Third International Conference on Machine Learning, Jul 2026, Seoul, South Korea, FranceZhen LiLMO, CELESTE, HEC ParisGilles StoltzLMO, CELESTE, HEC Parishttp://arxiv.org/abs/2509.18025v2Deep Learning as the Disciplined Construction of Tame Objects2026-06-01T12:03:50ZOne can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overview of some topics at the interface of tame geometry (also known as o-minimality), optimization theory, and deep learning theory and practice. To do so, we gradually introduce the concepts and tools used to build convergence guarantees for stochastic gradient descent in a general nonsmooth nonconvex, but tame, setting. This illustrates some ways in which tame geometry is a natural mathematical framework for the study of AI systems, especially within Deep Learning.2025-09-22T17:00:40Z39 pages, 10 figuresGilles BareillesAllen GehretJohannes AspmanJana LepšováJakub Marečekhttp://arxiv.org/abs/2410.14483v3Interventional Processes for Causal Uncertainty Quantification2026-06-01T12:01:29ZReliable uncertainty quantification for causal effects is crucial in high-stakes applications, but remains challenging when the target is an entire function rather than a scalar estimand. In this work, we introduce a GP-based approach for uncertainty quantification of interventional functions. The central idea is to build on recent work representing interventional functions as an inner-product of observational functions in a reproducing kernel Hilbert space (RKHS), by constructing appropriate GP priors for such functions and inferring posteriors from observational data. Our approach yields closed-form posterior moments and tractable training and inference, while avoiding pathologies of previous GP prior constructions for RKHS functions. We further derive a practical procedure for posterior coverage calibration. Across synthetic benchmarks, causal Bayesian optimization tasks, and a large-scale real dataset, our method improves uncertainty quantification while remaining competitive in causal effect estimation.2024-10-18T14:06:49ZHugh DancePeter OrbanzArthur Gretton