https://arxiv.org/api/je2PKokywMBt2eAJjILljilv8Dk2026-06-14T12:08:31Z7835434515http://arxiv.org/abs/2606.05258v1Harnessing Source Heterogeneity for Cluster-Structured Transfer Learning2026-06-03T16:09:15ZTransfer learning is a natural strategy when a target population has limited data but multiple related auxiliary sources are available. A central difficulty is source heterogeneity: auxiliary sources may not be equally useful, and their usefulness may vary in a structured, cluster-like fashion. Existing transfer-learning methods often reduce source selection to a binary informative/non-informative decision, overlooking subgroups of sources with differential transferability. Motivated by a suicide-risk study using data from the Connecticut Hospital Information Management Exchange (CHIME), comprising 636,758 patients across 27 hospitals, we propose Trans-GLMC, a cluster-structured transfer-learning procedure for generalized linear models. The CHIME setting illustrates the core challenge: hospital-specific risk models are unstable because suicide attempts are rare at any single facility, whereas indiscriminate pooling across hospitals can obscure facility-level differences in patient mix and risk profiles. Trans-GLMC first constructs a coefficient-based distance among the target and candidate sources to recover latent source clusters. It then combines global fusion, within-cluster refinement, and target debiasing to produce an estimator that adapts to the detected structure. We establish a non-asymptotic error bound that improves over its unclustered counterpart whenever a meaningful target cluster exists and matches the unclustered rate up to constants otherwise. In simulations and in the CHIME study, Trans-GLMC improves facility-specific prediction, identifies interpretable communities of hospitals with mutual transferability, and recovers clinically coherent suicide-risk factors.2026-06-03T16:09:15ZXiaohui YinJun JinShane J. SaccoRobert H. AseltineKun Chenhttp://arxiv.org/abs/2606.05046v1Graph Cascades: Contagion-Based Mesoscopic Rewiring for Structure-Aware Graph Machine Learning2026-06-03T16:07:24ZWe introduce Graph Cascades, a mesoscopic rewiring strategy for Graph Neural Networks (GNNs) and Graph Transformers (GTs) that captures intermediate-scale graph structure beyond purely local edges or fully global attention. Using contagion-based diffusion processes, Graph Cascades constructs, in O(|V|+|E|) time, an auxiliary graph where node pairs supported by repeated multi-hop reinforcement are promoted to direct neighbors. We theoretically characterize when reinforcement-based rewiring helps: sufficient conditions under which reinforcement-based edge selection is more label-aligned than direct adjacency, an SBM witness in which two-hop reinforcement is perfectly homophilic, and a formalization of mesoscopic connectivity via graph effective resistance. Empirically, across node-classification benchmarks, Graph Cascades improves multiple GNN and sparse-GT backbones, with the most reliable gains observed on heterophilic and moderate- to high-degree homophilic graphs. The theoretical conditions also identify regimes where mesoscopic rewiring is unlikely to be beneficial -- low-degree regular graphs and graphs with structural bottlenecks -- and these predictions match the observed failures. We additionally observe tight correlations between performance and structural properties in the rewired graphs.2026-06-03T16:07:24ZMeher ChaitanyaMy LeLuana Ruizhttp://arxiv.org/abs/2602.11406v2The Cost of Learning Under Multiple Change Points2026-06-03T16:06:48ZWe consider an online learning problem in environments with multiple change points. In contrast to the single change point problem that is widely studied using classical "high confidence" detection schemes, the multiple change point environment presents new learning-theoretic and algorithmic challenges. Specifically, we show that classical methods may exhibit catastrophic failure (high regret) due to a phenomenon we refer to as endogenous confounding. To overcome this, we propose a new class of learning algorithms dubbed Anytime Tracking CUSUM (ATC). These are horizon-free online algorithms that implement a selective detection principle, balancing the need to ignore "small" (hard-to-detect) shifts, while reacting "quickly" to significant ones. We prove that the performance of a properly tuned ATC algorithm is nearly minimax-optimal; its regret is guaranteed to closely match a novel information-theoretic lower bound on the achievable performance of any learning algorithm in the multiple change point problem. Experiments on synthetic as well as real-world data validate the aforementioned theoretical findings.2026-02-11T22:16:20ZA version of this work has been accepted for publication in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South KoreaTomer GafniGarud IyengarAssaf Zeevihttp://arxiv.org/abs/2602.06883v3Vision Transformer Finetuning Benefits from Non-Smooth Components2026-06-03T15:54:18ZThe smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emph{plasticity}. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies a low smoothness. Our theoretical analysis and extensive experiments -- over $1,000$ finetuning runs on large-scale vision transformers -- showcase that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on transformers' functional properties. The code is available at https://github.com/ambroiseodt/vit-plasticity.2026-02-06T17:12:22ZAccepted at ICML 2026Ambroise OdonnatLaetitia ChapelRomain TavenardIevgen Redkohttp://arxiv.org/abs/2605.20468v2CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support2026-06-03T15:45:32ZEffective medication management in Parkinson's Disease (PD) is challenging due to heterogeneous disease progression, variable patient response, and medication side effects. While AI models can forecast levodopa equivalent daily dose (LEDD) as a measure of medication needs, standard uncertainty quantification often fails to communicate the reliability of these predictions, treating high and low confidence clinical decisions identically. We introduce CASCADE (Calibrated Adaptive Scaling via Conformal And Distributional Estimation), a novel conformal prediction framework that propagates epistemic uncertainty from a screening classifier to adapt downstream predictions. Unlike standard conformal methods that rely on auxiliary residual regression, we leverage epistemic uncertainty from a primary classification task (identifying whether a medication change is needed) to dynamically scale the prediction intervals of a secondary regression task (predicting how much change). By mapping Venn-Abers multi-probabilistic uncertainty directly to non-conformity scores, our framework achieves continuous risk adaptation. We demonstrate that this ``cascade effect'' produces highly efficient intervals for confident patients (38.9% narrower than standard conformal baselines) while automatically expanding intervals to ensure robust coverage for uncertain cases, bridging the gap between discrete clinical decision-making and continuous dose forecasting in PD.2026-05-19T20:30:10ZAccepted to ICML 2026 AgenticUQ Workshop. 14 Pages, 3 FiguresRicardo Diaz-RinconMuxuan LiangAdolfo Ramirez-ZamoraBenjamin Shickelhttp://arxiv.org/abs/2605.18931v2Markov Chain Decoders Overcome the Heavy-Tail Limitations of Lipschitz Generative Models2026-06-03T15:30:42ZHeavy-tailed distributions are prevalent in performance evaluation, network traffic, and risk modeling. This behavior poses a fundamental challenge for modern deep generative models. Standard Variational Autoencoders (VAEs) employ Gaussian decoder likelihoods and Lipschitz-constrained neural networks, a combination that is structurally incapable of producing heavy-tailed outputs: the Gaussian tail decays exponentially, and Lipschitz continuity prevents the decoder from amplifying rare events from the latent space input to sufficiently overcome this decay. We provide both a theoretical characterization of this limitation and a controlled empirical demonstration using synthetic Pareto data across a grid of tail indices $α$ $\in$ {2, 3, 5, 30} and dimensions d $\in$ {1, 5, 10}. As a solution, we replace the Gaussian decoder with a Phase-Type (PH) distribution based on Markov chains, while keeping the encoder, latent space, and training procedure identical. PH distributions allow for arbitrarily precise approximations of any positive-valued distributions, including heavy-tailed families. Experiments showed that the PH-based model reduces tail Kolmogorov-Smirnov distance by up to x6 and extreme quantile error by up to x10 compared to the Gaussian baseline for heavy-tailed data. These results demonstrate that integrating Markov chain-based distributions into the decoder of a generative model institutes a principled and practically effective solution to the heavy-tail generation problem.2026-05-18T14:21:46Z22nd European Performance Engineering Workshop (EPEW 2026), Jun 2025, Grimstad, NorwayAbdelhakim ZianiAndras HorvathPaolo Ballarinihttp://arxiv.org/abs/2512.21917v3Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model2026-06-03T15:13:15ZPolicy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function. We formulate an $f$-divergence-constrained reward maximization problem and show that realizability in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-induced index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than impose identifiability of structural parameters of such a model and estimate them, as in econometrics, we develop methods that directly learn policies, with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable and nonparametric indices. We prove link-agnostic convergence guarantees in terms of generic function complexity measures and validate the methods and theory empirically. Code is available at https://github.com/causalml/spo/.2025-12-26T08:22:41ZNathan Kallushttp://arxiv.org/abs/2606.04946v1A General Framework for Dynamic Consistent Submodular Maximization2026-06-03T14:35:13ZConsistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of $n$ insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream of operations may contain both insertions and deletions.
We develop a general framework for designing algorithms for this setting, and instantiate it to obtain the first constant-factor approximations with sublinear consistency. For cardinality constraints, we propose a $\frac 12 - O(\varepsilon)$ approximation that is $O\left(\frac{1}{\varepsilon^2}\right)$ consistent. For rank-$k$ matroid constraints, we construct a $\frac 14 - O(\varepsilon)$ approximation to the dynamic optimum that is $O\left(\frac{\log k}{\varepsilon^2}\right)$ consistent.2026-06-03T14:35:13ZAccepted at ICML 2026Paul DüttingFederico FuscoSilvio LattanziAshkan Norouzi-FardOla SvenssonMorteza Zadimoghaddamhttp://arxiv.org/abs/2604.00915v2Orthogonal Learner for Estimating Heterogeneous Long-Term Treatment Effects2026-06-03T14:27:23ZEstimation of heterogeneous long-term treatment effects (HLTEs) is relevant for personalized decision-making in marketing, economics, and medicine, where short-term observational datasets are often combined with long-term observational datasets. However, HLTE estimation is challenging due to limited overlap in treatment assignments or in long-term outcomes for certain subpopulations, which can lead to unstable HLTE estimates with large finite-sample variance. To address this challenge, we introduce the LT-O-learners (Long-Term Orthogonal Learners), a set of novel orthogonal learners for HLTE estimation in the canonical HLTE setting with surrogacy. The key idea of our LT-O-learners is to retarget the loss via custom overlap weights that downweight low-overlap samples. We show that the retargeted loss recovers the true HLTE pointwise and satisfies Neyman-orthogonality. We further prove two key theoretical results: (i) The nuisance error enters the error bound only through higher-order terms, which means our learners are robust to nuisance estimation error. (ii) Under a linear function class, the retargeting effectively controls the asymptotic variance of the HLTE estimator via the overlap weights in low-overlap regimes. We conduct experiments on synthetic and real-world datasets to confirm the theoretical properties of our LT-O-learners, particularly robustness in low-overlap regimes. To our knowledge, ours are the first orthogonal learners for HLTE estimation robust to low overlap in long-term settings.2026-04-01T13:56:19ZHaorui MaDennis FrauenValentyn MelnychukStefan Feuerriegelhttp://arxiv.org/abs/2606.04930v1AdaKoop: Efficient Modeling of Nonlinear Dynamics from Nonstationary Data Streams with Koopman Operator Regression2026-06-03T14:23:32ZReal-time data analysis requires the ability to accurately and adaptively address nonlinear dynamics in a nonstationary data stream while preserving computational efficiency. However, nonlinear dynamics are so complex that capturing dynamically changing nonlinear patterns and utilizing them for downstream tasks under strict time constraints is nontrivial. To bridge the gap between nonlinear complexity and computational tractability, this study applies Koopman operator theory, which states that nonlinear dynamics can be represented as linear transitions in an infinite-dimensional space. Building upon finite-dimensional approximations of this operator, we present AdaKoop, an efficient streaming algorithm for modeling nonlinear dynamics over nonstationary data streams. Our approach utilizes a probabilistic framework grounded in Koopman operator theory, treating both raw observations and reproducing kernel Hilbert space (RKHS) features as emissions from latent vectors. This dual-view formulation allows nonlinear dynamics to be expressed as a tractable linear system. Therefore, AdaKoop enables the efficient and stable modeling of nonlinear dynamics in a streaming fashion, avoiding the prohibitive computational costs of iterative nonlinear optimization. Furthermore, to address nonstationarity in data streams, AdaKoop adaptively detects the switching of patterns via statistical hypothesis testing for abrupt pattern shifts and incrementally updates model parameters to handle continuous changes. Extensive experiments on a total of 71 practical benchmark datasets across various domains demonstrate that AdaKoop outperforms state-of-the-art methods in terms of real-time forecasting accuracy and computational efficiency.2026-06-03T14:23:32ZAccepted by KDD'26The 32nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2026Naoki ChiharaRen FujiwaraYasuko MatsubaraYasushi Sakurai10.1145/3770855.3817851http://arxiv.org/abs/2606.04916v1Worker Utility as Hysteresis: A Preisach Model of Transaction Acceptance in Gig Labour Markets2026-06-03T14:12:03ZWorker utility is not observed -- only its consequence is. Each gig transaction produces a single bit: accepted or rejected. We argue this structure points directly to the Preisach hysteresis model as the natural representation of latent worker preferences. The Preisach operator models aggregate output as an integral over a population of binary threshold elements -- precisely the structure that emerges when heterogeneous workers each carry a private acceptance wage. We estimate two latent utility surfaces: acceptance utility U_1(X) and rejection utility U_0(X), via a dual-output neural network (shared layers 256->128, margin loss enforcing U_1 >= U_0). Classification reduces to the Preisach gap U_1(X) - U_0(X), passed into an XGBoost classifier alongside clip-stabilised price-to-threshold encodings. On 36,891 gig transactions, this pipeline achieves Jaccard = 0.827 and ROC AUC = 0.799. The price-to-threshold encoding accounts for +11.0 pp AUC over raw utility features. The model confirms the directional asymmetry hysteresis predicts: price decreases depress completion rates more than equivalent increases raise them. Applied to the full dataset, the model's recommendations simultaneously reduce the total wage bill by 21.3% and increase expected fill rate by 9.7 pp. For 74.2% of transactions, P(accept) already exceeds 0.80; reducing the wage keeps it above threshold (mean post-cut P = 0.972), releasing cost savings (median 31%). For the remaining 25.4%, a median 7% wage increase recovers +43 pp acceptance. A model without an explicit indifference zone cannot execute both moves simultaneously.2026-06-03T14:12:03Z18 pages, 5 figuresPiotr Frydrychhttp://arxiv.org/abs/2505.15354v3Post-Training Corrections for Improved Time-Series Forecasting2026-06-03T14:10:02ZTime-series forecasting is a critical task in various business domains, but it remains inherently challenging.
Typically, large forecasting models are trained in a single, resource-intensive run. Once training is completed, a natural question arises:~\emph{is there still potential for meaningful improvement in the model's performance?}
Motivated by techniques from boosting, we introduce the concept of~\emph{post-training corrections}. This approach enhances a trained forecaster by sequentially applying a carefully selected set of corrections to its predictions.
Our method offers a lightweight, model-agnostic, and scalable strategy to improve forecasting performance in practical settings.
We provide theoretical foundations for the approach, starting with the affine correction case, and analyze the expected performance gains and computational costs in more general settings.
Across a range of benchmark datasets, our method consistently delivers up to a $30\%$ improvement in forecasting accuracy over existing state-of-the-art models, with minimal computational overhead.2025-05-21T10:30:02ZHamza CherkaouiMalik TiomokoGiuseppe PaoloZhang YiliYu MengZhang KeliHafiz Tiomoko Alihttp://arxiv.org/abs/2606.04845v1Bayesian learning for the stochastic shortest path problem2026-06-03T13:13:41ZSequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task. Specifically, we learn the optimal action-value function $Q^*$, but unlike many existing Bayesian approaches, we do not rely on unrealistic modelling assumptions and ad-hoc approximations. Our approach is to directly construct the posterior beliefs for $Q^*$ through Bellman's optimality equations. For deterministic rewards, we characterise the posterior as a distribution with a manifold density. To facilitate simpler inference, we relax the likelihood so that a Lebesgue density exists. The flip side is to create unidentifiability issues. Specifically, the relaxed posterior can have significant mass on improper decision rules, while the exact posterior will not.
We also calculate the exact posterior probabilities for optimal action selections for the tabular parametrisation of $Q^*$, a Gaussian likelihood relaxation and a Gaussian prior, which is useful in benchmarking studies. Numerical studies on variants of the Deep Sea benchmark verify our findings. We demonstrate that our framework faithfully quantifies uncertainty and, compared to other temporal-difference-based Bayesian methodologies, is more data efficient. We conclude with recommendations for future work.2026-06-03T13:13:41Z50 pages, 19 figuresChon Wai HoSumeetpal S. SinghJiaqi Guohttp://arxiv.org/abs/2509.23385v5Flow Matching Calibration for Simulation-Based Inference under Model Misspecification2026-06-03T12:14:13ZSimulation-based inference (SBI) is transforming experimental sciences by enabling parameter estimation in complex non-linear models from simulated data. A persistent challenge, however, is model misspecification. In a Bayesian setting, targeting posterior distributions, errors may arise from the simulator, the noise or prior modelling. These model components are only approximations of reality, and severe mismatches can yield biased or overconfident posteriors. We address this issue by introducing Flow Matching Corrected Posterior Estimation (FMCPE), a framework that leverages the flow matching paradigm to refine simulation-trained posterior estimators using a small set of calibration samples. Our approach proceeds in two stages: first, a posterior approximator is trained on abundant simulated data; second, flow matching transports its predictions toward the true posterior supported by calibration observations. We rely on the later to guide the correction, without requiring explicit knowledge of the misspecification form or of which model components are affected. This design enables FMCPE to combine the scalability of SBI with robustness to distributional shift. Across synthetic benchmarks and real-world datasets, we show that our proposal consistently mitigates the effects of misspecification, delivering improved inference accuracy and uncertainty quantification compared to standard SBI baselines, while remaining computationally efficient.2025-09-27T16:10:53ZPierre-Louis RuhlmannMichael ArbelFlorence ForbesPedro L. C. Rodrigueshttp://arxiv.org/abs/2602.19799v2Path-conditioned training: a principled way to rescale ReLU neural networks2026-06-03T12:05:22ZDespite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.2026-02-23T12:55:48ZProceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, PMLR 306 (2026)Arthur LebeurrierTitouan VayerRémi Gribonval