https://arxiv.org/api/RHXfXBZGxXIXqQTFhgMkbnvBduM2026-03-20T16:16:01Z346344515http://arxiv.org/abs/2603.17327v1Empirical Likelihood Inference for Sen and Sen--Shorrocks--Thon Indices2026-03-18T03:44:37ZThe Sen index and Sen-Shorrocks-Thon (SST) index are widely used measures of poverty indices. Developing reliable inference for these measures enables us to compare these measures in different populations of interest in an effective way. It is important to construct confidence intervals for the Sen index and SST index, which provide better coverage probability and shorter interval length. Motivated by this, we discuss empirical likelihood (EL) and jackknife empirical likelihood (JEL) based inference for the Sen index. To derive a JEL-based confidence interval for the Sen and SST indices, we propose a new estimator for the Sen index using the theory of U-statistics and examine its properties. The large sample properties of the EL and JEL ratio statistics are studied. We also discuss EL and JEL-based inference for the Sen-Shorrocks-Thon (SST) index. The finite sample performance of the EL and JEL-based confidence intervals of both Sen and SST indices is evaluated through a Monte Carlo simulation study. Finally, we illustrate our methods using individual-level data from the Panel Study of Income Dynamics (PSID) survey from the US as well as Indian household level income data for different states sourced from the Consumer Pyramids Household Survey (CPHS).2026-03-18T03:44:37ZSreelakshmi NSaparya SureshSudheesh K. Kattumannilhttp://arxiv.org/abs/2602.14286v3Online LLM watermark detection via e-processes2026-03-18T03:26:04ZWatermarking for large language models (LLMs) has emerged as an effective tool for distinguishing AI-generated text from human-written content. Statistically, watermark schemes induce dependence between generated tokens and a pseudo-random sequence, reducing watermark detection to a hypothesis testing problem on independence. We develop a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online testing. We propose various methods to construct empirically adaptive e-processes that can enhance the detection power. The proposed methods are applicable to any sequential testing problem where independent pivotal statistics are available. In addition, theoretical results are established to characterize the power properties of the proposed procedures. Some experiments demonstrate that the proposed framework achieves competitive performance compared to existing watermark detection methods.2026-02-15T19:37:06ZWeijie SuRuodu WangZinan Zhaohttp://arxiv.org/abs/2603.00269v3Robust Regression with Student's T: The Role of Degrees of Freedom2026-03-18T02:57:45ZLinear regression estimators are known to be sensitive to outliers, and one alternative to obtain a robust and efficient estimator of the regression parameter is to model the error with Student's $t$ distribution. In this article, we compare estimators of the degrees of freedom parameter in the $t$ distribution using frequentist and Bayesian methods, and then study properties of the corresponding estimated regression coefficient. We also include the comparison with some recommended approaches in the literature, including fixing the degrees of freedom and robust regression using the Huber loss. Our extensive simulations on both synthetic and real data demonstrate that estimating the degrees of freedom via the adjusted profile log-likelihood approach yields regression coefficient estimators with high accuracy, performing comparably to the maximum likelihood estimators where the degrees of freedom are fixed at their true values. These findings provide a detailed synthesis of $t$-based robust regression and underscore a key insight: the proper calibration of the degrees of freedom is as crucial as the choice of the robust distribution itself for achieving optimal performance. The {\tt R} package that implements our method is available at https://github.com/amanda-ng518/RobustTRegression.2026-02-27T19:37:38ZAmanda NgShangkai ZhuArcher Gong ZhangNancy Reidhttp://arxiv.org/abs/2603.17294v1Bayesian Scalar-on-Tensor Quantile Regression for Longitudinal Data on Alzheimer's Disease2026-03-18T02:43:01ZAs a general and robust alternative to traditional mean regression models, quantile regression avoids the assumption of normally distributed errors, making it a versatile choice when modeling outcomes such as cognitive scores that typically have skewed distributions. Motivated by an application to Alzheimer's disease data where the aim is to explore how brain-behavior associations change over time, we propose a novel Bayesian tensor quantile regression for high-dimensional longitudinal imaging data. The proposed approach distinguishes between effects that are consistent across visits and patterns unique to each visit, contributing to the overall longitudinal trajectory. A low-rank decomposition is employed on the tensor coefficients which reduces dimensionality and preserves spatial configurations of the imaging voxels. We incorporate multiway shrinkage priors to model the visit-invariant tensor coefficients and variable selection priors on the tensor margins of the visit-specific effects. For posterior inference, we develop a computationally efficient Markov chain Monte Carlo sampling algorithm. Simulation studies reveal significant improvements in parameter estimation, feature selection, and prediction performance when compared with existing approaches. In the analysis of the Alzheimer's disease data, the flexibility of our modeling approach brings new insights as it provides a fuller picture of the relationship between the imaging voxels and the quantile distributions of the cognitive scores.2026-03-18T02:43:01ZRongke LyuMarina VannucciSuprateek Kunduhttp://arxiv.org/abs/2603.17278v1Classifier Pooling for Modern Ordinal Classification2026-03-18T02:11:54ZOrdinal data is widely prevalent in clinical and other domains, yet there is a lack of both modern, machine-learning based methods and publicly available software to address it. In this paper, we present a model-agnostic method of ordinal classification, which can apply any non-ordinal classification method in an ordinal fashion. We also provide an open-source implementation of these algorithms, in the form of a Python package. We apply these models on multiple real-world datasets to show their performance across domains. We show that they often outperform non-ordinal classification methods, especially when the number of datapoints is relatively small or when there are many classes of outcomes. This work, including the developed software, facilitates the use of modern, more powerful machine learning algorithms to handle ordinal data.2026-03-18T02:11:54ZNoam H. RotenbergAndreia V. FariaBrian Caffohttp://arxiv.org/abs/2603.17271v1Wasserstein-type Gaussian Process Regressions for Input Measurement Uncertainty2026-03-18T01:51:21ZGaussian process (GP) regression is widely used for uncertainty quantification, yet the standard formulation assumes noise-free covariates. When inputs are measured with error, this errors-in-variables (EIV) setting can lead to optimistically narrow posterior intervals and biased decisions. We study GP regression under input measurement uncertainty by representing each noisy input as a probability measure and defining covariance through Wasserstein distances between these measures. Building on this perspective, we instantiate a deterministic projected Wasserstein ARD (PWA) kernel whose one-dimensional components admit closed-form expressions and whose product structure yields a scalable, positive-definite kernel on distributions. Unlike latent-input GP models, PWA-based GPs (\PWAGPs) handle input noise without introducing unobserved covariates or Monte Carlo projections, making uncertainty quantification more transparent and robust.2026-03-18T01:51:21Z22 pagesHengrui LuoXiaoye S. LiYang LiuMarcus NoackJi QiangMark D. Risserhttp://arxiv.org/abs/2603.17243v1Transmuted logistic-exponential distribution - some new properties, estimation methods and application with infectious disease mortality data2026-03-18T00:57:21ZLately, a New Transmuted Logistic-exponential (NTLE) distribution was introduced and studied as an extension of the Logistic-Exponential Distribution (LED) with wider applicability in lifetime modelling. However, the maximum likelihood estimates (MLE) of NTLE are not in closed form, and the consistency of the estimates was not examined. Furthermore, some other important properties of NTLE, namely the Shannon entropy, Rényi entropy, stochastic ordering, mode, stress-strength reliability measure, residual life functions (mean and reverse), incomplete moments, Bonferroni and Lorenz curves are yet to be derived. Motivated by this, we derived and studied these important properties and evaluated the performance of ten estimation methods (Maximum Likelihood, Moments, Least Squares, Weighted Least Squares, Maximum product of Spacings, Anderson-Darling, Cramer-von Mises, percentile estimation, and Maximum Goodness-of-Fit methods) for NTLE parameters via Monte Carlo simulation using bias, mean square error, and root mean square error as evaluation criteria. Real-life infectious mortality data fitted to the distributions showed that NTLE has a better fit compared to its base distributions (Exponential and Logistic-Exponential). This finding contributes valuable insights for researchers and practitioners when selecting the appropriate estimation methods, especially for NTLE and some similar distributions in non-closed form.2026-03-18T00:57:21ZIsqeel OgunsolaAbosede AkintundeKehinde YusuffBasirat AdetonaFaheez Abdulrasaqhttp://arxiv.org/abs/2603.14942v2A System-Theoretic Approach to Hawkes Process Identification with Guaranteed Positivity and Stability2026-03-18T00:48:10ZThe Hawkes process models self-exciting event streams, requiring a strictly non-negative and stable stochastic intensity. Standard identification methods enforce these properties using non-negative causal bases, yielding conservative parameter constraints and severely ill-conditioned least-squares Gram matrices at higher model orders. To overcome this, we introduce a system-theoretic identification framework utilizing the sign-indefinite orthonormal Laguerre basis, which guarantees a well-conditioned asymptotic Gram matrix independent of model order. We formulate a constrained least-squares problem enforcing the necessary and sufficient conditions for positivity and stability. By constructing the empirical Gram matrix via a Lyapunov equation and representing the constraints through a sum-of-squares trace equivalence, the proposed estimator is efficiently computed via semidefinite programming.2026-03-16T07:47:56Z7 pages, 2 figuresXinhui RongGirish N. Nairhttp://arxiv.org/abs/2603.17226v1Difference-Based High-Dimensional Long-Run Covariance Matrix Estimation for Mean-shift Time Series2026-03-18T00:12:58ZWe consider estimation of high-dimensional long-run covariance matrices for time series with nonconstant means, a setting in which conventional estimators can be severely biased. To address this difficulty, we propose a difference-based initial estimator that is robust to a broad class of mean variations, and combine it with hard thresholding, soft thresholding, and tapering to obtain sparse long-run covariance estimators for high-dimensional data. We derive convergence rates for the resulting estimators under general temporal dependence and time-varying mean structures, showing explicitly how the rates depend on covariance sparsity, mean variation, dimension, and sample size. Numerical experiments show that the proposed methods perform favorably in high dimensions, especially when the mean evolves over time.2026-03-18T00:12:58ZYanhong LiuFengyi SongLong Fenghttp://arxiv.org/abs/2506.14531v2A statistical framework for dynamic cognitive diagnosis in digital learning environments2026-03-17T22:48:26ZReading is foundational for educational, employment, and economic outcomes, but a persistent proportion of students globally struggle to develop adequate reading skills. Some countries promote digital tools to support reading development, alongside regular classroom instruction. Such tools generate rich log data capturing students' behaviour and performance. This study proposes a dynamic cognitive diagnostic modeling (CDM) framework based on restricted latent class models to trace students' time-varying skills mastery using log files from digital tools. Unlike traditional CDMs that require expert-defined skill-item mappings (Q-matrix), our approach jointly estimates the Q-matrix and latent skill profiles, integrates log-derived covariates (e.g., reattempts, response times, counts of mastered items) and individual characteristics, and models transitions in mastery using a Bayesian estimation approach. Applied to real-world data, the model demonstrates practical value in educational settings by effectively uncovering individual skill profiles and the skill-item mappings. Simulation studies confirm robust recovery of Q-matrix structures and latent profiles with high accuracy under varied sample sizes, item counts and different sparsity of Q-matrices. The framework offers a data-driven, time-dependent restricted latent class modeling approach to understanding early reading development.2025-06-17T13:57:43ZYawen MaAnastasia UshakovaKate CainGabriel Wallinhttp://arxiv.org/abs/2603.13681v2Generalized projection tests for function-valued parameters with applications to testing structural causal assumptions2026-03-17T21:47:01ZStructural assumptions are central to the causal inference literature. In practice, it is often crucial to assess their validity or to test implications that follow from them. In many settings, such tests can be framed as evaluating whether a function-valued parameter equals zero. In this paper, we propose a class of generalized projection tests based on series estimators for function-valued parameters. We establish conditions under which the proposed tests are valid and illustrate their applicability through examples from the data fusion and instrumental variables literature. Our approach accommodates flexible machine learning methods for estimating nuisance parameters. In contrast to many existing approaches, the limiting distribution of the proposed test statistics is straightforward to compute under the null hypothesis. We apply our method to test the equality of conditional COVID-19 risk across vaccine arms in the COVID-19 Variant Immunologic Landscape (COVAIL) trial.2026-03-14T01:35:52ZRui WangAlbert OsomBo Zhanghttp://arxiv.org/abs/2502.17292v3Joint Value Estimation and Bidding in Repeated First-Price Auctions2026-03-17T21:24:05ZWe study regret minimization in repeated first-price auctions (FPAs), where a bidder observes only the realized outcome after each auction -- win or loss. This setup reflects practical scenarios in online display advertising where the actual value of an impression depends on the difference between two potential outcomes, such as clicks or conversion rates, when the auction is won versus lost. We incorporate causal inference into this framework and analyze the challenging case where only the treatment effect admits a simple dependence on observable features. Under this framework, we propose algorithms that jointly estimate private values and optimize bidding strategies under two different feedback types on the highest other bid (HOB): the full-information feedback where the HOB is always revealed, and the binary feedback where the bidder only observes the win-loss indicator. Under both cases, our algorithms are shown to achieve near-optimal regret bounds. Notably, our framework enjoys a unique feature that the treatments are actively chosen, and hence eliminates the need for the overlap condition commonly required in causal inference.2025-02-24T16:21:50ZPOMS-HK 2026 Best Student Paper FinalistYuxiao WenYanjun HanZhengyuan Zhouhttp://arxiv.org/abs/2205.06868v2Regression and Dimension Reduction for Multivariate Mixed-Type Data via Semiparametric Gaussian Copula2026-03-17T20:28:09ZClinical and epidemiological studies encode participant information in multivariate vectors with mixed type variables on continuous, truncated, ordinal, and binary scales. Semiparametric Gaussian Copula (SGC) assumes that observed data is generated by latent multivariate normal random variables which marginals are monotonically transformed and then truncated/ordinalized/binarized. In SGC, the latent correlation matrix fully determines the dependence structure and it is estimated through an inversion of ``bridges'' between Kendall's Tau rank correlations of observed variables and latent correlations. By employing SGC, we develop regression (SGC-Reg), principal component analysis (SGC-PCA), and principal component regression (SGC-PCR) for latent representations of observed data. To build our framework, we make several key contributions: i) establishing novel bridging results for general ordinal type variables, ii) developing regression estimation on the latent space and deriving asymptotic normality of estimators, iii) developing a computationally efficient algorithm that reduces calculation complexity of all steps including calculation of asymptotic covariance matrix from $O(n^4)$ to $O(n\log n)$, iv) developing methods to predict latent representations of observed data and perform imputation of missing data, and v) developing principal component analysis and principal component regression on the latent space. We apply our framework to study the association between a 5-year mortality and 61 frailty-related measures composed of 29 continuous, 17 ordinal, and 15 binary variables in 9478 participants of 1999-2010 waves of National Health and Nutrition Examination Survey (NHANES).2022-05-13T19:55:13Z43 pages, 8 figures, 3 tablesDebangan DeyVadim Zipunnikovhttp://arxiv.org/abs/2509.11381v2The Honest Truth About Causal Trees: Accuracy Limits for Heterogeneous Treatment Effect Estimation2026-03-17T19:59:56ZRecursive decision trees are widely used to estimate heterogeneous causal treatment effects in experimental and observational studies. These methods are typically implemented using CART-type recursive partitioning and are often viewed as adaptive procedures capable of discovering treatment effect heterogeneity in high-dimensional settings. We study causal tree estimators based on adaptive recursive partitioning and establish lower bounds on their estimation accuracy. Under basic conditions, we show that causal trees constructed via standard CART-type splitting rules cannot achieve polynomial-in-$n$ convergence rates in the uniform norm (where $n$ denotes the sample size). The underlying mechanism is that greedy recursive partitioning selects highly imbalanced splits with non-vanishing probability, producing terminal nodes containing very few observations and leading to large estimation variance. We further show that sample splitting (``honesty'') yields at most negligible improvements in convergence rates. As a consequence, causal tree estimators may converge arbitrarily slowly and can even be inconsistent in some settings. Our results also clarify the role of balanced partition assumptions in existing theoretical guarantees for causal forests and related ensemble methods. The analysis develops new probabilistic tools for studying adaptive recursive partitioning procedures, including non-asymptotic approximations for suprema of partial sums and Gaussian processes. As a technical by-product, we also identify and correct an error in Eicker (1979).2025-09-14T18:29:45ZMatias D. CattaneoJason M. KlusowskiRuiqi Rae Yuhttp://arxiv.org/abs/2411.16902v4Bounding causal effects with an unknown mixture of informative and non-informative missingness2026-03-17T19:27:58ZIn experimental and observational data settings, researchers often have limited knowledge of the reasons for missing outcomes. To address this uncertainty, we propose bounds on causal effects for missing outcomes, accommodating the scenario where missingness is an unobserved mixture of informative and non-informative components. Within this mixed missingness framework, we explore several assumptions to derive bounds on causal effects, including bounds expressed as a function of user-specified sensitivity parameters. We develop influence-function based estimators of these bounds to enable flexible, non-parametric, and machine learning based estimation, achieving root-n convergence rates and asymptotic normality under relatively mild conditions. We further consider the identification and estimation of bounds for other causal quantities that remain meaningful when informative missingness reflects a competing outcome, such as death. We conduct simulation studies and illustrate our methodology with a study on the causal effect of antipsychotic drugs on diabetes risk using a health insurance dataset.2024-11-25T20:13:32ZMax RubinsteinDenis AgnielLarry HanMarcela Horvitz-LennonSharon-Lise Normand