https://arxiv.org/api/EtqszER2wtVf1kR+oUoRQq+H21w2026-06-10T14:30:22Z3612424015http://arxiv.org/abs/2604.10712v2Integrative learning of individualized treatment rules from multiple studies with partially overlapping treatments2026-06-02T15:35:05ZAn individualized treatment rule (ITR) tailors treatments to a patient's specific characteristics. However, randomized controlled trials (RCTs) are often underpowered to detect the treatment effect heterogeneity needed for reliable ITR estimation. To address this limitation, there is growing interest in leveraging information from multiple studies to improve statistical power and support individualized decision-making. A key challenge in this context is that available RCTs may not evaluate the same set of treatments. In this paper, we propose an integrative learning framework that synthesizes evidence across multiple RCTs that share a common comparator but differ in their alternative treatment arms. Our method integrates information through a regularized weighted misclassification risk function and adaptively determines the contribution of each study to the ITRs of the others. We rigorously study the excess risk of the resulting estimator. Simulation studies demonstrate that the proposed approaches improve the estimation of both value and benefit functions. We illustrate the utility of our methodology using data from two landmark studies of major depressive disorder: the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care study and the International Study to Predict Optimized Treatment in Depression study, both of which include a selective serotonin reuptake inhibitor as a common treatment arm. We find that the separate learning method outperforms one-size-fits-all methods, and our integrative methods further improve performance.2026-04-12T16:12:14ZBiometrics, 82(2): Article ujag083, 2026Yuan BianDonglin ZengHyun-Joon YangLeanne M. WilliamsYuanjia Wang10.1093/biomtc/ujag083http://arxiv.org/abs/2606.03750v1Extending TCLUST to higher dimensions2026-06-02T15:02:12ZOutliers are known to significantly distort the results of many commonly used clustering methods, often leading to unreliable partitions. To address this issue, several robust clustering approaches have been developed that not only reduce their influence but also facilitate the detection of meaningful outliers. This presentation focuses on robust clustering methods based on trimming, especially TCLUST, which extends the type of trimming used by MCD in one-population problems to the more general case of multiple and unknown clusters. While TCLUST performs well on low-dimensional data, it struggles with high-dimensional datasets due to the complexity of estimating a large number of parameters. The Robust Linear Grouping (RLG) method offers an alternative by assuming clusters lie near lower-dimensional subspaces, thereby combining clustering with dimensionality reduction. However, RLG has limitations when subspaces intersect and assumes overly simplistic isotropic orthogonal errors. A robust clustering method extending TCLUST will be presented, building on the High Dimensional Data Clustering (HDDC) approach by incorporating trimming and eigenvalue constraints. This new approach, called tHHDC, combines TCLUST and RLG, requiring careful modification and integration of both methodologies within that HDDC framework. A study of the theoretical properties of this approach, together with a feasible algorithm for its implementation, will be presented. The interest of the proposed methodology, along with the issue of selecting input parameters, will be illustrated through a simulation study and a real-data example.2026-06-02T15:02:12ZLucía Trapote RegleroLuis Ángel García EscuderoAgustín Mayo Íscarhttp://arxiv.org/abs/2606.03702v1Dynamic Mini Max Design and Sequential HB Inference for Repeated Surveys2026-06-02T14:21:46ZTThis paper develops a Dynamic Mini-Max (DMM) framework for repeated surveys comprising a Dynamic Mini-Max Design and a Sequential Hierarchical Bayes Update (SHBU). The DMM jointly optimizes sample size and wave overlap subject to simultaneous precision constraints for levels and movements, a respondent burden limit, and a fieldwork budget.
The methods are illustrated using 2021 Australian Census data (t = 1) and simulated waves t = 2, 3, 4. Both the DMM and the classical design start from the same 5% proportional allocation of n_A = 42,018 units. The DMM reduces this to n* = 40,251 while meeting all precision constraints, achieving a cost saving of approximately 6.3%.
Level coverage is comparable between the two designs (maximum absolute relative error (MARE) ratio 0.844--1.263). Movement coverage diverges markedly: the DMM achieves 100% across all 27 domain-variable cells, while the classical design achieves only 82%--96% (87.5%--95.0% nationally).
The classical confidence interval understates movement uncertainty because it addresses sampling variance only and does not account for the model variance component V_mod_hat. Additional benefits of the DMM framework -- including coherent joint inference for levels and movements, sequential updating without ad hoc composite-estimator chaining, and small area estimation -- are outlined in the paper.2026-06-02T14:21:46Z41 pages, 4 figuresSiu-Ming Tamhttp://arxiv.org/abs/2606.03670v1Projection Diagnostics for Directional Asymmetry and Tail-Ratio Departure in Multivariate Data2026-06-02T13:54:51ZWe study projection-based diagnostics for distinguishing directional asymmetry from tail-ratio departure in multivariate data. The procedure reduces the problem to one-dimensional projections and computes two quantile-based summaries: a directional skewness measure evaluated over several quantile levels, and an interquantile tail-ratio evaluated relative to a chosen benchmark. The two summaries lead to a four-regime classification: symmetric benchmark-tail, symmetric tail-departed, skewed benchmark-tail, and skewed tail-departed. The quantile formulation avoids relying on third and fourth moments, which can be unstable in heavy-tailed settings. We establish population properties under central symmetry and ellipticity, uniform finite-sample bounds over the searched directions, and consistency of the threshold classifier under separated regimes. A sparse rank-one calculation is also used to show why coordinate directions can complement random directions in high dimensions. The resulting diagnostic is meant to guide subsequent modelling choices, for example whether a symmetric, skewed, tail-departed, or combined multivariate model is appropriate.2026-06-02T13:54:51ZSayantan BanerjeeSoudeep Debhttp://arxiv.org/abs/2606.03665v1Sparse Tree-Based Aggregation for Time Series Regressions2026-06-02T13:50:32ZHigh-dimensional time series regressions are often regularized to produce sparse coefficients. We show that temporal aggregation provides a powerful alternative to reduce dimensionality in high-order autoregressions and mixed-frequency regressions. To this end, we propose StarTime (Sparse Tree-based Aggregation for Time Series), a convex penalization method that uses a temporal tree to arrange lags hierarchically from high to low frequency. StarTime then flexibly selects coefficients to be aggregated at possibly varying frequencies, sparse or a combination thereof. We provide new error bounds for StarTime, demonstrate improved estimation accuracy and recovery of aggregation and sparsity in simulations relative to benchmarks, and illustrate StarTime's relevance for financial and macroeconomic applications.2026-06-02T13:50:32ZMarie CorillonStephan SmeekesInes Wilmshttp://arxiv.org/abs/2606.01184v2Topological Ignorability for Structural Causal Effects Beyond Means2026-06-02T13:19:17ZMany interventions alter the structure of an outcome distribution rather than its mean: they can split a population into disconnected regimes, create loops or holes, generate branches, or reorganize an outcome cloud while leaving the average response nearly unchanged. In such settings, mean-based causal estimands such as the average treatment effect may miss important structural effects.
We introduce topological-geometrical causal metrics based on summaries of interventional outcome laws, including density-superlevel Betti summaries, Euler signatures, and persistent-homology summaries. These metrics quantify structural differences between treated and untreated outcome laws beyond averages. We also study the assumptions needed for causal interpretation. We introduce topological ignorability, a topological analogue of conditional ignorability that requires invariance of the chosen structural feature rather than the full counterfactual distribution. When the chosen summary is injective, this condition coincides with weak ignorability; for noninjective summaries, it can identify the structural feature of interest without identifying the full interventional law.
We define a covariate-standardized topological-geometrical causal effect and develop practical estimators. We validate the framework in two hidden-confounding benchmarks: a fully synthetic exact benchmark and a real-covariate semi-synthetic benchmark using Wisconsin breast-cancer covariates. In both, weak ignorability fails and balancing observed covariates nearly eliminates standardized mean differences, yet the coordinate-mean average treatment effect remains biased. By contrast, selected finite density-superlevel Betti and Euler contrasts remain stable across oracle, observational, and weighted analyses.2026-05-31T11:56:53ZThis is a new version of our paper titled: Beyond Means: Topological Causal Effects under Persistent-Homology Ignorability. So we will resubmit this as version 2 of arXiv:2603.14169Usef Faghihihttp://arxiv.org/abs/2606.03477v1Surrogate-assisted optimal sampling for risk prediction under measurement constraints2026-06-02T10:53:50ZIn many risk prediction problems, covariates and a response surrogate are routinely available for a large target population, whereas the true response is costly to ascertain and is observed only for a limited subset. This creates a design problem: one must decide which observations should receive response measurement in order to build a prediction model under a fixed measurement budget. We propose a surrogate-assisted optimal sampling framework for risk prediction under measurement constraints. In the target setting, the surrogate identifies confirmed positive cases, while responses for surrogate-negative observations remain unobserved and can be selectively measured, and thus the sampling design determines how the response measurement budget is allocated. Our framework constructs an optimal sampling design minimizing the leading term of the expected out-of-sample cross-entropy loss and incorporates the resulting design into an inverse-probability-weighted cross-entropy estimator. The proposed design depends only on covariates, the surrogate, and a preliminary estimator, and therefore does not require responses from unlabeled observations at the design stage. We establish consistency, asymptotic normality, and leading-order prediction optimality of the resulting estimator. Extensive simulation studies and two real data applications demonstrate that the proposed design improves prediction performance and exhibits robustness under surrogate misspecification and rare outcome settings.2026-06-02T10:53:50ZSunhyun ParkSeong-ho Leehttp://arxiv.org/abs/2511.05050v3Estimating Bidirectional Causal Effects with Large Scale Online Kernel Learning2026-06-02T10:50:28ZIn this study, a scalable online kernel learning framework is proposed for estimating bidirectional causal effects in systems characterized by mutual dependence and heteroskedasticity. Traditional causal inference often focuses on unidirectional effects, overlooking the common bidirectional relationships in real-world phenomena. Building on heteroskedasticity-based identification, the proposed method integrates a quasi-maximum likelihood estimator for simultaneous equation models with large scale online kernel learning. It employs random Fourier feature approximations to flexibly model nonlinear conditional means and variances, while an adaptive online gradient descent algorithm ensures computational efficiency for streaming and high-dimensional data. Results from extensive simulations demonstrate that the proposed method achieves superior accuracy and stability than single equation and polynomial approximation baselines, exhibiting lower bias and root mean squared error across various data-generating processes. These results confirm that the proposed approach effectively captures complex bidirectional causal effects with near-linear computational scaling. By combining econometric identification with modern machine learning techniques, the proposed framework offers a practical, scalable, and theoretically grounded solution for large scale causal inference in natural/social science, policy making, business, and industrial applications.2025-11-07T07:44:06ZProceedings of the 2025 International Conference on Data Science and Intelligent Systems (DSIS 2025), Article 65, pp. 449-455Masahiro Tanaka10.1109/DSIS67228.2025.11390623http://arxiv.org/abs/2606.03429v1Modeling Discrete Data with High-Order Vector Potts Models2026-06-02T10:18:36ZModeling high-dimensional data is challenging, yet essential to understanding many complex systems. Maximum entropy models such as Ising and Potts models have been used extensively to capture pairwise interactions from correlation patterns in data, allowing to infer graphical representations of complex systems from observations (e.g., from protein sequences or neural population activity). Recently, there has been growing interest in modeling higher-order correlation patterns involving simultaneously three or more variables. While progress has been made in binary data with high-order Ising models, we extend this framework to the more general case of discrete data.
We introduce q-state spin models, a complete family of maximum entropy models that generalize the vector Potts model to include long-range and arbitrary high-order interactions. In the pairwise case, our models allow for more diverse interaction types compared to the standard vector Potts model. We discuss their statistical interpretation with examples and relate them to discrete Fourier analysis. Using a loop expansion of the partition function, we show that the statistical properties of spin models are fully captured by the algebraic structure of their interactions. We define gauge transformations under which this structure, and thus the partition function, remains invariant. Models equivalent under gauge transformations can be seen as different representations of the same abstract statistical model, despite generally having interactions of different orders, extending results from the binary case. For practical application to data analysis, we focus on a subset of models known in the binary case as Minimally Complex Models, generalizing them to discrete data. We obtain a closed-form expression for the marginal likelihood of these models, enabling fast model selection. We illustrate their use with simple real-world examples.2026-06-02T10:18:36Z89 pages, 16 figuresAaron De ClercqMerijn MoodyClélia de Mulatierhttp://arxiv.org/abs/2510.20372v4Testing Most Influential Sets2026-06-02T09:35:36ZSmall influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is excessive rather than expected under natural random sampling variation. We address this gap by developing a principled framework for most influential sets. Focusing on linear least-squares, we derive a convenient exact influence formula and identify the extreme value distributions of maximal influence - the heavy-tailed Fréchet for constant-size sets and heavy-tailed data, and the well-behaved Gumbel for growing sets or light tails. This allows us to conduct rigorous hypothesis tests for excessive influence. We demonstrate through applications across economics, biology, and machine learning benchmarks, resolving contested findings and replacing ad-hoc heuristics with rigorous inference.2025-10-23T09:12:29ZPublished as a conference paper at ICLR 2026Lucas D. KonradNikolas Kuschnighttp://arxiv.org/abs/2606.03230v1Predictively-Oriented Kalman Filtering2026-06-02T06:47:45ZThis paper presents a post-Bayesian approach to online filtering in nonlinear state-space models, capable of avoiding over-confident inferences in settings where either the dynamical model, the measurement model, or both, could be misspecified. This is addressed using predictively oriented (PrO) posteriors, an emerging paradigm in which learning (i.e., posterior concentration) occurs if and only if the overall model is well-specified, without strict adherence to Bayes' theorem. As the characterisation of PrO posteriors is challenging, our main technical contribution is a fast approximate linear-Gaussian update procedure, analogous to an (iterated) extended Kalman filter. The methodology, which we call EKF-PrO, has no tunable hyper-parameters and has a computational cost comparable to that of existing filtering methods. Performance is empirically assessed on a range of linear and non-linear applications, in which the state-space model is systematically misspecified.2026-06-02T06:47:45ZZheyang ShenGerardo Duran-MartinChris. J. Oateshttp://arxiv.org/abs/2602.01483v2Causal Preference Elicitation2026-06-02T06:41:10ZWe propose causal preference elicitation, a Bayesian framework for expert-in-the-loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any black-box observational posterior, we model noisy expert judgments with a three-way likelihood over edge existence and direction. Posterior inference uses a flexible particle approximation, and queries are selected by an efficient expected information gain criterion on the expert's categorical response. Experiments on synthetic graphs, protein signaling data, and a human gene perturbation benchmark show faster posterior concentration and improved recovery of directed effects under tight query budgets.2026-02-01T23:34:34ZEdwin V. BonillaHe ZhaoDaniel M. Steinberghttp://arxiv.org/abs/2606.03211v1Optimized Labeling Resource Allocation for Prediction-Assisted Inference via OPAL2026-06-02T06:09:04ZActive Statistical Inference is a new framework to make precise claims about population parameters with provable statistical guarantees. It uses a predictive "black-box" machine learning (ML) model to strategically decide which data points to label, roughly prioritizing samples for which the ML model is unsure about their label values. A major issue is that the framework can be brittle when uncertainty estimates are noisy. This paper introduces OPAL (Optimized Policy for Allocation of Labels), which learns a labeling strategy within a tractable class of smooth policies to yield estimators with the lowest variance. In effect, OPAL is an end-to-end pipeline that turns a black-box model's uncertainty scores into a data-adaptive labeling strategy and then performs inference on the collected samples. We evaluate OPAL on real datasets spanning medical imaging data, computational social science, and proteomics. As a concrete example, we consider predicting breast cancer subtype from histopathology images and using OPAL to form valid confidence intervals for odds ratios for different demographic groups. We show that OPAL achieves nominal coverage in finite samples and has the accuracy one expects from methods which have far more labeled samples.2026-06-02T06:09:04ZVirginia L. MaEmmanuel J. Candèshttp://arxiv.org/abs/2606.03154v1Efficient Federated Estimation and Inference for High-Dimensional Tail Index Regression2026-06-02T05:03:05ZTail index regression studies how covariates affect tail heaviness in heavy-tailed data. In many applications, data are distributed across heterogeneous sources, where direct pooling is infeasible due to privacy or regulatory constraints. Existing methods mainly focus on single-dataset analysis and do not address heterogeneous federated settings. We develop a personalized federated framework for high-dimensional tail index regression that accommodates client heterogeneity while exploiting latent similarities across clients. The proposed estimator combines sparsity regularization with nonconcave fusion penalties to perform coefficient estimation, variable selection, and group recovery. We establish non-asymptotic convergence rates and show that the estimator enjoys an oracle property by consistently recovering the underlying grouping structure. For computation, we develop an ADMM-based federated algorithm with adaptive gradient updates and establish its convergence guarantees. We further propose a debiased federated inference procedure based on adaptive weighted aggregation across related clients, yielding valid confidence intervals and hypothesis tests with improved efficiency over target-only inference. Simulation studies and real-data analysis demonstrate the effectiveness of the proposed methods.2026-06-02T05:03:05Z35 pages, 5 figuresHaoyu GengLiuhua PengChangliang ZouXiaolong Cuihttp://arxiv.org/abs/2605.11602v3A Unified Theory of Conditional Coverage in Conformal Prediction with Applications2026-06-02T03:52:51ZConformal prediction provides prediction sets with finite-sample marginal coverage, but many applications require coverage guarantees that adapt to individual test points, a subpopulation, or a structural component of the data. Existing methods targeting conditional coverage are largely analyzed case by case, leaving limited general theory for understanding where conditional miscoverage comes from, how different procedures should be compared, and how such guarantees can be extended beyond i.i.d.~data. We address these gaps through a unified framework and theory for conformal methods targeting conditional coverage. Our central contribution is a non-asymptotic decomposition of conditional miscoverage into three interpretable components: score-estimation error, finite-sample calibration error, and intrinsic conditional-mismatch error. This decomposition clarifies the mechanisms behind asymptotic conditional validity and places existing methods within a common analytical lens. Building on this framework, we derive principled guidance for conditional-coverage-oriented model selection, and develop localized methods with asymptotic conditional guarantees under covariate shift. Finally, we extend the framework to structured data, with concrete applications to graph-structured and hierarchical settings. Numerical experiments corroborate the theory and demonstrate the effectiveness of the proposed procedures.2026-05-12T06:31:43ZUpload Supplementary MaterialsYinjie MinLiuhua PengChangliang Zou