https://arxiv.org/api/HUfCgUaUcim0yEtua0Heoyx9f5o 2026-03-20T12:38:34Z 34634 15 15 http://arxiv.org/abs/2603.18538v1 Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning 2026-03-19T06:46:55Z Decentralized Federated Learning (DFL) remains highly vulnerable to adaptive backdoor attacks designed to bypass traditional passive defense metrics. To address this limitation, we shift the defensive paradigm toward a novel active, interventional auditing framework. First, we establish a dynamical model to characterize the spatiotemporal diffusion of adversarial updates across complex graph topologies. Second, we introduce a suite of proactive auditing metrics, stochastic entropy anomaly, randomized smoothing Kullback-Leibler divergence, and activation kurtosis. These metrics utilize private probes to stress-test local models, effectively exposing latent backdoors that remain invisible to conventional static detection. Furthermore, we implement a topology-aware defense placement strategy to maximize global aggregation resilience. We provide theoretical property for the system's convergence under co-evolving attack and defense dynamics. Numeric empirical evaluations across diverse architectures demonstrate that our active framework is highly competitive with state-of-the-art defenses in mitigating stealthy, adaptive backdoors while preserving primary task utility. 2026-03-19T06:46:55Z Sheng Pan Niansheng Tang http://arxiv.org/abs/2404.00256v5 Robust Bayesian modeling for Preprocessing Large-Scale Data 2026-03-19T02:46:24Z We develop a robust Bayesian analysis based on heavy-tailed modeling. It is common to impose a Student-$t$ distribution to eliminate the influence of outliers. We apply it to large-scale studies in Bayesian inference, and provide diagnoses for detecting outliers using the posterior predictive $p$-value ($ppp$). In addition, we propose an adaptive method to decide the level of the posterior FDR. We suggest an adaptive method to determine it using an estimated ratio of true null genes using Storey's $q$-value method. Our methods are demonstrated on gene expression data for colorectal cancer. 2024-03-30T06:03:40Z Yoshiko Hayashi http://arxiv.org/abs/2603.18404v1 Multi-Domain Causal Empirical Bayes Under Linear Mixing 2026-03-19T01:55:19Z Causal representation learning (CRL) aims to learn low-dimensional causal latent variables from high-dimensional observations. While identifiability has been extensively studied for CRL, estimation has been less explored. In this paper, we explore the use of empirical Bayes (EB) to estimate causal representations. In particular, we consider the problem of learning from data from multiple domains, where differences between domains are modeled by interventions in a shared underlying causal model. Multi-domain CRL naturally poses a simultaneous inference problem that EB is designed to tackle. Here, we propose an EB $f$-modeling algorithm that improves the quality of learned causal variables by exploiting invariant structure within and across domains. Specifically, we consider a linear measurement model and interventional priors arising from a shared acyclic SCM. When the graph and intervention targets are known, we develop an EM-style algorithm based on causally structured score matching. We further discuss EB $\rmg$-modeling in the context of existing CRL approaches. In experiments on synthetic data, our proposed method achieves more accurate estimation than other methods for CRL. 2026-03-19T01:55:19Z Bohan Wu Julius von Kügelgen David M. Blei http://arxiv.org/abs/2507.23240v2 A-optimal Designs under Generalized Linear Models 2026-03-19T01:27:27Z Designing efficient experiments under practical constraints is critical in both scientific research and industrial practice. Focusing on minimizing the average variance of the parameter estimates, A-optimal designs show advantages in screening factors and reducing prediction errors. Compared with other criteria, however, algorithms and software for generating A-optimal designs are scarce. In this paper, we characterize A-optimal designs under generalized linear models theoretically and develop efficient algorithms for identifying them. When a predetermined finite set of experimental settings is given, we derive analytic solutions or establish necessary and sufficient conditions for obtaining A-optimal approximate allocations. We show that a lift-one algorithm based on our formulae outperforms commonly used algorithms for finding A-optimal allocations. When continuous factors or design regions get involved, we develop a ForLion algorithm that is guaranteed to find A-optimal designs with mixed factors. Numerical studies show that our algorithms can find highly efficient designs with reduced numbers of distinct experimental settings, which may save both experimental time and cost significantly. Along with a rounding-off algorithm that converts approximate allocations to exact ones, we demonstrate that stratified samplers based on A-optimal allocations may provide more accurate parameter estimates than commonly used samplers. 2025-07-31T04:40:22Z 34 pages, 2 figure, 9 tables Yingying Yang Xiaotian Chen Jie Yang http://arxiv.org/abs/2603.18378v1 BiSSLB: Binary Spike-and-Slab Lasso Biclustering 2026-03-19T00:46:41Z Biclustering is a powerful unsupervised learning technique for simultaneously identifying coherent subsets of rows and columns in a data matrix, thus revealing local patterns that may not be apparent in global analyses. However, most biclustering methods are developed for continuous data and are not applicable for binary datasets such as single-nucleotide polymorphism (SNP) or protein-protein interaction (PPI) data. Existing biclustering algorithms for binary data often struggle to recover biclustering patterns under noise, face scalability issues, and/or bias the final results towards biclusters of a particular size or characteristic. We propose a Bayesian method for biclustering binary datasets called Binary Spike-and-Slab Lasso Biclustering (BiSSLB). Our method is robust to noise and allows for overlapping biclusters of various sizes without prior knowledge of the noise level or bicluster characteristics. BiSSLB is based on a logistic matrix factorization model with spike-and-slab priors on the latent spaces. We further incorporate an Indian Buffet Process (IBP) prior to automatically determine the number of biclusters from the data. We develop a novel coordinate ascent algorithm with proximal steps which allows for scalable computation. The performance of our proposed approach is assessed through simulations and two real applications on HapMap SNP and Homo Sapiens PPI data, where BiSSLB is shown to outperform other state-of-the-art binary biclustering methods when the data is very noisy. 2026-03-19T00:46:41Z Sijian Fan Ray Bai http://arxiv.org/abs/2411.04380v2 Identification of Long-Term Treatment Effects via Temporal Links, Observational, and Experimental Data 2026-03-19T00:17:59Z Recent literature proposes combining short-term experimental and long-term observational data to provide alternatives to conventional observational studies for the identification of long-term average treatment effects (LTEs). This paper re-examines the identification problem and uncovers that assumptions restricting temporal link functions -- relationships between short-term and mean long-term potential outcomes -- are central in this context. The experimental data serve to amplify the identifying power of such assumptions; absent them, the combined data are no more informative than the observational data alone. Plausible inference thus hinges on justifiable restrictions in this class. Motivated by this, I introduce two treatment response assumptions that may be defensible based on economic theory or intuition. To utilize them and facilitate future developments, I develop a novel unifying identification framework that computationally produces sharp bounds on the LTE for a general class of temporal link function restrictions and accommodates imperfect experimental compliance -- thereby also extending existing approaches. I illustrate the method by estimating the long-term effects of Head Start participation. The findings indicate that the effects on educational attainment, employment, and criminal involvement are lasting but smaller in magnitude than those established by sibling comparisons. 2024-11-07T02:47:13Z Filip Obradović http://arxiv.org/abs/2603.14561v2 Refined Inference for Asymptotically Linear Estimators with Non-Negligible Second-Order Remainders 2026-03-18T23:42:18Z Asymptotically linear estimators in semiparametric models achieve their point-estimation guarantees via a von Mises expansion in which a second-order remainder is declared negligible. Confidence intervals then treat the first-order influence-function term as the sole source of sampling variability. This reasoning is asymptotically exact but can fail materially in finite samples whenever the second-order remainder contributes variation of the same order as the influence-function variance -- a regime we call the \emph{near-boundary regime}, characterized by nuisance estimation operating at or near the product-rate threshold. We develop a general theory of inference for this regime. Our contributions are: (i) a \emph{finite-sample variance decomposition} that separates influence-function variance from remainder-induced variance and the covariance between them; (ii) a \emph{sandwich consistency theorem} that gives a precise necessary and sufficient condition -- strong remainder negligibility -- for the standard sandwich to be consistent for the total sampling variance, and shows this is strictly stronger than the product-rate condition that guarantees asymptotic linearity; (iii) two \emph{refined variance estimators} -- leave-one-unit-out jackknife and pairs cluster bootstrap -- each with full asymptotic validity guarantees in the near-boundary regime, together with a heteroskedasticity-corrected sandwich interpretation that is numerically equivalent to the jackknife Wald interval; and (iv) a \emph{clustered-data extension} in which the remainder interacts with intra-cluster correlation to produce an analytic formula for sandwich gap amplification. 2026-03-15T19:23:26Z 32 paged 3 tables, 1 supplement Lin Li Pengcheng Wu http://arxiv.org/abs/2603.18345v1 Synthetic Data, Information, and Prior Knowledge: Why Synthetic Data Augmentation to Boost Sample Doesn't Work for Statistical Inference 2026-03-18T23:10:05Z The use of synthetic data to deidentify data and to improve predictive models is well-attested to. The augmentation of datasets using synthetically generated data is an alluring proposition: in the best case, it generates realistic data \textit{in silico} at a fraction of the cost of authentic data which may be found \textit{in vivo} or \textit{in vitro}. This poses novel epistemic challenges. We contend that synthetic data augmentation is best understood as a novel way of accounting for prior knowledge. In this manuscript, we propose a definition of synthetic distributions and analyze how synthetic data augmentation interplays with standard accounts of maximum likelihood and Bayesian estimation. We observe that the marginal Fisher information contributed by synthetic data processes is subject to fundamental bounds, and enumerate obstacles to the use of synthetic data augmentation to aid in inferential tasks. We then articulate a Bayesian formulation of the way that synthetic data augmentation can be coherently understood, but argue that naive approaches to the specification of the prior are epistemically unjustifiable. This suggests that enhanced scrutiny must be placed on identifying justifiable priors to warrant the use and inclusion of data drawn from specific synthetic distributions. While our analysis shows the challenges and limitations of using synthetic data augmentation to improve upon traditional statistical model reasoning, it does suggest that augmentation is the principal approach analysts using outcome reasoning (i.e. using train/test splits to justify the analysis) can constrain an otherwise high-dimensional model space, providing an alternative to trying to encode the constraints into the potentially complex architecture of the algorithm. 2026-03-18T23:10:05Z Draft; feedback welcome Reid Dale Jordan Rodu Mike Baiocchi http://arxiv.org/abs/2307.12544v2 Adaptive debiased machine learning using data-driven model selection techniques 2026-03-18T20:56:47Z Debiased machine learning estimators for smooth functionals in nonparametric models can exhibit substantial variability and instability, often leading practitioners to instead rely on parametric or semiparametric working models. Such models, however, may be misspecified and can therefore introduce bias. We study how data-driven model selection can be combined with debiased machine learning to construct estimators that adapt to structure in the data-generating distribution. To this end, we propose Adaptive Debiased Machine Learning (ADML), a nonparametric framework for constructing superefficient estimators of pathwise differentiable parameters. The framework unifies a broad class of previously proposed adaptive estimators, including methods based on variable selection, learned feature representations, and collaborative targeted learning. It requires only high-level conditions and approximate validity of the selection procedure, which are implied by lower-level conditions already assumed in important settings, including sieve-based selection, sparsity-based methods such as the Lasso, and data-adaptive feature representations. We show that ADML estimators yield regular and efficient root-\(n\) inference for an oracle projection parameter induced by a data-adaptive oracle submodel. This oracle parameter coincides with the target parameter at the true distribution but typically has a smaller efficiency bound, thereby yielding superefficiency for the target parameter. As a practical illustration, we introduce a broad class of automatic ADML estimators for continuous linear functionals of the outcome regression, in which model selection is performed directly on the regression itself. Motivated by overlap challenges in causal inference, we develop new superefficient plug-in estimators for the average treatment effect based on calibration in semiparametric regression models. 2023-07-24T06:16:17Z Lars van der Laan Marco Carone Alex Luedtke Mark van der Laan http://arxiv.org/abs/2603.18279v1 Covariate-Dependent Functional Principal Component Analysis for SHM 2026-03-18T20:54:13Z In Structural Health Monitoring (SHM), sensor measurements and derived features such as eigenfrequencies often exhibit systematic daily patterns and can therefore be naturally represented as functional data. Furthermore, these patterns are typically influenced by environmental factors, particularly temperature, which can substantially affect the observed system response. While most existing methods for removing environmental effects assume that confounding influences affect only the mean response, it has been shown that environmental and operational factors may also alter the covariance structure of the residual process. To address this limitation in a functional data monitoring framework, we incorporate so-called covariate-dependent functional principal component analysis (CD-FPCA), which allows eigenfunctions and eigenvalues of the residual process to vary smoothly with covariates such as temperature. The proposed methodology is illustrated using an extended version of the KW51 railway bridge eigenfrequency dataset. This case study suggests that accounting for covariate effects beyond the functional mean can improve the robustness of the monitoring procedure, in particular by reducing environmentally induced (false) alarms under challenging low-temperature conditions. 2026-03-18T20:54:13Z 10 pages, 3 figures, conference Philipp Wittenberg Lizzie Neumann Kristof Maes Jan Gertheiss http://arxiv.org/abs/2603.18149v1 Analysing Extreme Rainfall via a Geometric Framework 2026-03-18T18:00:06Z Motivated by the EVA 2025 Data Challenge, we address the problem of predicting extreme rainfall in the eastern United States using data from a large ensemble of climate model runs. The challenge focuses on three quantities of interest related to the spatial extent and/or temporal duration of extreme rainfall, each requiring extrapolation. To tackle these questions, we adopt the recently developed geometric framework for extreme-value analysis, offering substantial flexibility for capturing complex extremal dependence structures and enabling extrapolation across the entire multivariate tail. In this work, we focus on the spatial geometric framework for analysing the spatial extent and consider a sampling procedure that retains the temporal information in the data, thereby enabling estimation of the duration of extreme rainfall events. We also account for the non-stationary behaviour, arising from topographical and seasonal effects, that commonly characterises extreme weather events in both space and time. Using diagnostic metrics, we demonstrate that the proposed model is appropriate for inferring extreme events on this dataset and apply it to estimate target quantities of interest. 2026-03-18T18:00:06Z Ryan Campbell Kristina Grolmusova Lydia Kakampakou Jeongjin Lee http://arxiv.org/abs/2603.17984v1 On min-Storey estimators for multiple testing and conformal novelty detection 2026-03-18T17:46:33Z In a multiple testing task, finding an appropriate estimator of the proportion $π_0$ of non-signal in the data to boost power of false discovery rate (FDR) controlling procedures is a long-standing research theme, sometimes referred to as 'adaptive FDR control'. The interest in this theme has been reinforced in the recent years with conformal novelty detection, for which it turns out that similar tools can be used in combination with any 'blackbox' machine learning algorithm. Nevertheless, perhaps surprisingly, finding a solution for 'adaptive FDR control' that is optimal in a broad sense is still an open problem. This paper fills this gap by introducing new $π_0$-estimators, referred to as min-Storey (MS) and interval-min-Storey (IMS), which are built upon the so-called 'Storey estimator'. Plugging these estimators in the adaptive Benjamini-Hochberg (BH) procedure is shown to deliver FDR control both in the independent and conformal settings. In addition, these methods satisfy an optimal power property over any (regular) alternative distribution. The excellent behaviors of the new adaptive procedures are illustrated with numerical experiments both in the independent and conformal models for various distribution structures. 2026-03-18T17:46:33Z 52 pages, 9 figures, 2 tables Gao Zijun Roquain Etienne http://arxiv.org/abs/2503.19068v2 Minimum Volume Conformal Sets for Multivariate Regression 2026-03-18T17:44:07Z Conformal prediction provides a principled framework for constructing predictive sets with finite-sample validity. While much of the focus has been on univariate response variables, existing multivariate methods either impose rigid geometric assumptions or rely on flexible but computationally expensive approaches that do not explicitly optimize prediction set volume. We propose an optimization-driven framework based on a novel loss function that directly learns minimum-volume covering sets while ensuring valid coverage. This formulation naturally induces a new nonconformity score for conformal prediction, which adapts to the residual distribution and covariates. Our approach optimizes over prediction sets defined by arbitrary norm balls, including single and multi-norm formulations. Additionally, by jointly optimizing both the predictive model and predictive uncertainty, we obtain prediction sets that are tight, informative, and computationally efficient, as demonstrated in our experiments on real-world datasets. 2025-03-24T18:54:22Z Sacha Braun Liviu Aolaritei Michael I. Jordan Francis Bach http://arxiv.org/abs/2408.08177v3 Localized Sparse Principal Component Analysis of Multivariate Time Series in Frequency Domain 2026-03-18T17:18:40Z Principal component analysis has been a main tool in multivariate analysis for estimating a low dimensional linear subspace that explains most of the variability in the data. However, in high-dimensional regimes, naive estimates of the principal loadings are not consistent and difficult to interpret. In the context of time series, principal component analysis of spectral density matrices can provide valuable, parsimonious information about the behavior of the underlying process, particularly if the principal components are interpretable in that they are sparse in coordinates and localized in frequency bands. In this paper, we introduce a formulation and consistent estimation procedure for interpretable principal component analysis for high-dimensional time series in the frequency domain. An efficient frequency-sequential algorithm is developed to compute sparse-localized estimates of the low-dimensional principal subspaces of the signal process. The method is motivated by and used to understand neurological mechanisms from high-density resting-state EEG in a study of first episode psychosis. 2024-08-15T14:30:34Z 63 pages, 6 figures Jamshid Namdari Amita Manatunga Fabio Ferrarelli Robert Krafty http://arxiv.org/abs/2603.17925v1 Multi-Armed Sequential Hypothesis Testing by Betting 2026-03-18T17:01:34Z We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis $\mathscr{P}$ that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting $\mathscr{P}$ in favor of a composite alternative $\mathscr{Q}$ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek $e$-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against $\mathscr{P}$. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest. 2026-03-18T17:01:34Z Ricardo J. Sandoval Ian Waudby-Smith Michael I. Jordan