https://arxiv.org/api/yU4cr/zYWWDPShUBCacrIYF6DMk 2026-03-28T09:14:34Z 22843 60 15 http://arxiv.org/abs/2305.10413v5 On Consistency of Signature Using Lasso 2026-03-22T07:58:48Z Signatures are iterated path integrals of continuous and discrete-time processes, and their universal nonlinearity linearizes the problem of feature selection in time series data analysis. This paper studies the consistency of signature using Lasso regression, both theoretically and numerically. We establish conditions under which the Lasso regression is consistent both asymptotically and in finite sample. Furthermore, we show that the Lasso regression is more consistent with the Itô signature for time series and processes that are closer to the Brownian motion and with weaker inter-dimensional correlations, while it is more consistent with the Stratonovich signature for mean-reverting time series and processes. We demonstrate that signature can be applied to learn nonlinear functions and option prices with high accuracy, and the performance depends on properties of the underlying process and the choice of the signature. 2023-05-17T17:48:52Z Xin Guo Binnan Wang Ruixun Zhang Chaoyi Zhao http://arxiv.org/abs/2404.04709v3 Two-Sided Flexibility in Platforms 2026-03-22T06:52:22Z Flexibility is a cornerstone of operations management, crucial to hedge stochasticity in product demands, service requirements, and resource allocation. In two-sided platforms, flexibility is also two-sided and can be viewed as the compatibility of agents on one side with agents on the other side. Platform actions often influence the flexibility on either the demand or the supply side. But how should flexibility be jointly allocated across different sides? Whereas the literature has traditionally focused on only one side at a time, our work initiates the study of two-sided flexibility in matching platforms. We propose an abstract matching model in random graphs and identify the flexibility allocation that optimizes the expected size of a maximum matching. Our findings reveal that flexibility allocation is a first-order issue: for a given flexibility budget, the resulting matching size can vary greatly depending on how the budget is allocated. Moreover, even in the simple and symmetric settings we study, the quest for the optimal allocation is complicated. In particular, easy and costly mistakes can be made if the flexibility decisions on the demand and supply sides are optimized independently (e.g., by two different teams in the company), rather than jointly. To guide the search for optimal flexibility allocation, we uncover two effects - flexibility cannibalization and flexibility asymmetry - that govern when the optimal design places the flexibility budget only on one side or equally on both sides. In doing so we identify the study of two-sided flexibility as a significant aspect of platform efficiency. 2024-04-06T19:04:44Z Daniel Freund Sébastien Martin Jiayu Kamessi Zhao http://arxiv.org/abs/2603.21032v1 Integrative Predictor-Dependent Learning of Network Data and Spatially Correlated Nodal Attributes for Multimodal Brain Imaging in Aging 2026-03-22T03:05:53Z This article introduces a predictor-dependent joint modeling framework for network data obtained from multiple subjects over a shared set of nodes with spatial co-ordinates and spatially correlated nodal attributes. The framework is highly flexible, allowing concurrent inference on nodes significantly associated with a predictor, spatial associations of nodal attributes and the regression relationship between a predictor and edge connecting a pair of nodes or a specific nodal attribute. Empirical results indicate a superior performance of the proposed approach due to accounting for network structure and spatial correlation in the data simultaneously. The methodology analyzes multimodal brain imaging data collected first-hand in the coauthor's Lifespan Cognitive and Motor Neuroimaging Laboratory, with a focus on integrating structural and functional information. It examines brain connectivity, represented as a connectome network across regions of interest (ROIs) derived from functional magnetic resonance imaging (fMRI), while also incorporating ROI-specific attributes obtained from structural MRI data, for each subject. Subject-specific aging-related features and spatial locations of ROIs are incorporated in the analysis. This framework facilitates robust inference on the associations between predictors and brain connectivity patterns, the spatial relationships among ROI-specific attributes, and the regression relationships involving edges or ROI-specific attributes with aging-related predictors. By integrating these diverse data sources, the approach provides a deeper understanding of the complex interplay between brain structure, function, aging-related changes, and external predictors. As a model-based Bayesian approach, it provides uncertainty quantification for all inferences, offering robust and reliable results, particularly in scenarios with limited sample size. 2026-03-22T03:05:53Z 38 pages Jose Rodriguez-Acosta Sharmistha Guha Jessica Bernard Thamires Magalhaes Kaitlin McOwen http://arxiv.org/abs/2603.20980v1 From Causal Discovery to Dynamic Causal Inference in Neural Time Series 2026-03-21T23:53:53Z Time-varying causal models provide a powerful framework for studying dynamic scientific systems, yet most existing approaches assume that the underlying causal network is known a priori - an assumption rarely satisfied in real-world domains where causal structure is uncertain, evolving, or only indirectly observable. This limits the applicability of dynamic causal inference in many scientific settings. We propose Dynamic Causal Network Autoregression (DCNAR), a two-stage neural causal modeling framework that integrates data-driven causal discovery with time-varying causal inference. In the first stage, a neural autoregressive causal discovery model learns a sparse directed causal network from multivariate time series. In the second stage, this learned structure is used as a structural prior for a time-varying neural network autoregression, enabling dynamic estimation of causal influence without requiring pre-specified network structure. We evaluate the scientific validity of DCNAR using behavioral diagnostics that assess causal necessity, temporal stability, and sensitivity to structural change, rather than predictive accuracy alone. Experiments on multi-country panel time-series data demonstrate that learned causal networks yield more stable and behaviorally meaningful dynamic causal inferences than coefficient-based or structure-free alternatives, even when forecasting performance is comparable. These results position DCNAR as a general framework for using AI as a scientific instrument for dynamic causal reasoning under structural uncertainty. 2026-03-21T23:53:53Z 14 pages, 4 figures Valentina Kuskova Dmitry Zaytsev Michael Coppedge http://arxiv.org/abs/2603.20962v1 Integrative Learning of Dynamically Evolving Multiplex Graphs and Nodal Attributes Using Neural Network Gaussian Processes with an Application to Dynamic Terrorism Graphs 2026-03-21T22:01:29Z Exploring the dynamic co-evolution of multiplex graphs and nodal attributes is a compelling question in criminal and terrorism networks. This article is motivated by the study of dynamically evolving interactions among prominent terrorist organizations, considering various organizational attributes like size, ideology, leadership, and operational capacity. Statistically principled integration of multiplex graphs with nodal attributes is significantly challenging due to the need to leverage shared information within and across layers, account for uncertainty in predicting unobserved links, and capture temporal evolution of node attributes. These difficulties increase when layers are partially observed, as in terrorism networks where connections are deliberately hidden to obscure key relationships. To address these challenges, we present a principled methodological framework to integrate the multiplex graph layers and nodal attributes. The approach employs time-varying stochastic latent factor models, leveraging shared latent factors to capture graph structure and its co-evolution with node attributes. Latent factors are modeled using Gaussian processes with an infinitely wide deep neural network-based covariance function, termed neural network Gaussian processes (NN-GP). The NN-GP framework on latent factors exploits the predictive power of Bayesian deep neural network architecture while propagating uncertainty for reliability. Simulation studies highlight superior performance of the proposed approach in achieving inferential objectives. The approach, termed as dynamic joint learner, enables predictive inference (with uncertainty) of diverse unobserved dynamic relationships among prominent terrorist organizations and their organization-specific attributes, as well as clustering behavior in terms of friend-and-foe relationships, which could be informative in counter-terrorism research. 2026-03-21T22:01:29Z 59 pages Jose Rodriguez-Acosta Sharmistha Guha Lekha Patel Kurtis Shuler http://arxiv.org/abs/2603.22344v1 Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study 2026-03-21T21:39:55Z Large language models (LLMs) assisted literature retrieval may lead to erroneous references, but these errors have not been rigorously quantified. Therefore, we quantitatively assess errors in reference retrieval of widely used free-version LLM platforms and identify the factors associated with retrieval errors. We evaluated 2,000 references retrieved by 5 LLMs (Grok-2, ChatGPT GPT-4.1, Google Gemini Flash 2.5, Perplexity AI, and DeepSeek GPT-4) for 40 randomly-selected original articles (10 per journal) published Jan. 2024 to July 2025 from British Medical Journal (BMJ), Journal of the American Medical Association, and The New England Journal of Medicine (NEJM). Primary outcomes were a multimetric score ratio combining validity of digital object identifier, PubMed ID, Google-Scholar link, and relevance; and complete miss rate (proportion of references failing all applicable metrics). Multivariable regression was used to examine independent associations. LLM platforms completely failed to retrieve correct reference data 47.8% of the time. The average score ratio of the 5 LLM platforms was 0.29 (standard deviation, 0.35; range, 0-1.25), with a higher score ratio indicating a higher accuracy in retrieving relevant references and correct bibliographic data. The highest and lowest accuracies were achieved by Grok (0.57) and Genimi (0.11), respectively. Compared with BMJ, NEJM articles had lower score ratios and higher complete miss rates. Multivariable analysis shows LLM platforms and journals were independently associated with score ratios and complete miss rate, respectively. We show modest overall performance of LLMs and significant variability in retrieval accuracy across platforms and journals. LLM platforms and journals are associated with LLM's performance in retrieving medical literature. Bibliographic data should be carefully reviewed when using LLM-assisted literature retrieval. 2026-03-21T21:39:55Z Jenny Gao College of Arts and Science, New York University, New York, NY Yongfeng Zhang Department of Computer Sciences, School of Arts & Sciences, Rutgers University, Piscataway, NJ Mary L Disis UW Medicine Cancer Vaccine Institute University of Washington, Seattle, WA Lanjing Zhang Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ Department of Pathology, Princeton Medical Center, Plainsboro, NJ Rutgers Cancer Institute, New Brunswick, NJ http://arxiv.org/abs/2601.22481v2 Changepoint Detection As Model Selection: A General Framework 2026-03-21T17:29:45Z This dissertation presents a general framework for changepoint detection based on L0 model selection. The core method, Iteratively Reweighted Fused Lasso (IRFL), improves upon the generalized lasso by adaptively reweighting penalties to enhance support recovery and minimize criteria such as the Bayesian Information Criterion (BIC). The approach allows for flexible modeling of seasonal patterns, linear and quadratic trends, and autoregressive dependence in the presence of changepoints. Simulation studies demonstrate that IRFL achieves accurate changepoint detection across a wide range of challenging scenarios, including those involving nuisance factors such as trends, seasonal patterns, and serially correlated errors. The framework is further extended to image data, where it enables edge-preserving denoising and segmentation, with applications spanning medical imaging and high-throughput plant phenotyping. Applications to real-world data demonstrate IRFL's utility. In particular, analysis of the Mauna Loa CO2 time series reveals changepoints that align with volcanic eruptions and ENSO events, yielding a more accurate trend decomposition than ordinary least squares. Overall, IRFL provides a robust, extensible tool for detecting structural change in complex data. 2026-01-30T02:44:34Z Michael Grantham Xueheng Shi Bertrand Clarke http://arxiv.org/abs/2603.20853v1 Correcting for Missing Data When Evaluating Surrogate Markers in a Clinical Trial 2026-03-21T15:15:00Z Evaluating treatment effects is critical in clinical trials but sometimes involves lengthy, invasive, or costly follow-up procedures. In these cases, surrogate markers, which provide intermediate measures of the long-term treatment effect, allow clinicians to obtain results faster and more efficiently than would have otherwise been possible. Prior to adoption, it is vital that the utility of surrogate markers (i.e., their ability to capture the treatment effect on the primary outcome) is statistically validated. Many frameworks for evaluating surrogate markers have been proposed, but they do not account for missing data. Instead, they rely on complete cases (the subset of patients without missing data), which can be inefficient and biased. To improve on this, we propose methods to accommodate missing data in nonparametric and parametric surrogate evaluation via inverse probability weighting (IPW) and semiparametric maximum likelihood estimation (SMLE). Through simulation studies, we demonstrate that the proposed methods remain unbiased under a broader range of missing data mechanisms than complete case analysis and can help retain the statistical precision of the full trial. We illustrate their practical utility through an application to a diabetes clinical trial. Moreover, our missing data corrections have complementary strengths with respect to computational ease, robustness, and statistical efficiency. All methods are implemented in the MissSurrogate R package. 2026-03-21T15:15:00Z 19 pages, 4 tables, 3 figures, R package and GitHub repository with simulation code Sarah C. Lotspeich P. D. Anh. Nguyen Layla Parast http://arxiv.org/abs/2603.20727v1 Compositional regression using principal nested spheres 2026-03-21T09:22:49Z Regression with compositional responses is challenging due to the nonlinear geometry of the simplex and the limitations of Euclidean methods. We propose a regression framework for manifold-valued data based on mappings to statistically tractable intermediate spaces. For compositional data, responses are embedded in the positive orthant of the sphere and analysed using Principal Nested Spheres (PNS), yielding a cylindrical intermediate space with a circular leading score and Euclidean higher-order scores. Regression is performed in this intermediate space and fitted values are mapped back to the simplex. A simulation study demonstrates good performance of PNS-based regression. An application to environmental chemical exposure data illustrates the interpretability and practical utility of the method. 2026-03-21T09:22:49Z 19 pages, 8 figures, 1 table Mymuna Monem Ian L. Dryden Florence George Natalia Soares Quinete http://arxiv.org/abs/2410.09027v2 Variance reduction combining pre-experiment and in-experiment data 2026-03-21T07:50:39Z Online controlled experiments (A/B testing) are fundamental to data-driven decision-making in many companies. Improving the sensitivity of these experiments under fixed sample size constraints requires reducing the variance of the average treatment effect (ATE) estimator. Existing variance reduction techniques such as CUPED and CUPAC use pre-experiment data, but their effectiveness depends on how predictive those data are for outcomes measured during the experiment. In-experiment data are often more strongly correlated with the outcome, but using arbitrary post-treatment variables can introduce bias. In this paper, we propose a general, robust, and scalable framework that combines both pre-experiment and in-experiment data to achieve variance reduction. Our framework is simple, interpretable, and computationally efficient, making it practical for real-world deployment. We develop the asymptotic theory of the proposed estimator and provide consistent variance estimators. Empirical results from multiple online experiments conducted at Etsy demonstrate substantial additional variance reduction over current pipeline, even when incorporating only a few post-treatment covariates. These findings underscore the effectiveness of our framework in improving experimental sensitivity and accelerating data-driven decision-making. 2024-10-11T17:45:29Z Accepted to 5th Conference on Causal Learning and Reasoning (CLeaR), 2026 Zhexiao Lin Pablo Crespo http://arxiv.org/abs/2601.10878v2 Optimal and Unbiased Fluxes from Up-the-Ramp Detectors under Variable Illumination 2026-03-21T00:22:31Z Near-infrared (NIR) detectors -- which use non-destructive readouts to measure time-series counts-per-pixel -- play a crucial role in modern astrophysics. Standard NIR flux extraction techniques were developed for space-based observations and assume that source fluxes are constant over an observation. However, ground-based telescopes often see short-timescale atmospheric variations that can dramatically change the number of photons arriving at a pixel. This work presents a new statistical model that shares information between neighboring spectral pixels to characterize time-variable observations and extract unbiased fluxes with optimal uncertainties. We generate realistic synthetic data using a variety of flux and amplitude-of-time-variability conditions to confirm that our model recovers unbiased and optimal estimates of both the true flux and the time-variable signal. We find that the time-variable model should be favored over a constant-flux model when the observed count rates change by more than 3.5%. Ignoring time variability in the data can result in flux-dependent, unknown-sign biases that are as large as ~120% of the flux uncertainty. Using real APOGEE spectra, we find empirical evidence for approximately wavelength-independent, time-dependent variations in count rates with amplitudes much greater than the 3.5% threshold. Our model can robustly measure and remove the time-dependence in real data, improving the quality of data-model comparison. We show several examples where the observed time-dependence quantitatively agrees with independent measurements of observing conditions, such as variable cloud cover and seeing. 2026-01-15T22:15:13Z 22 pages, 20 figures Bowen Li Kevin A. McKinnon Andrew K. Saydjari Conor Sayres Gwendolyn M. Eadie Andrew R. Casey Jon A. Holtzman Timothy D. Brandt Jose G. Fernandez-Trincado http://arxiv.org/abs/2603.20546v1 On the Limits of Prediction: Forecastability Profiles and Information Decay in Time Series 2026-03-20T22:28:16Z Forecasting accuracy is bounded by the information available about the future. This paper makes that statement precise using information-theoretic tools. Under logarithmic loss, the expected performance of any probabilistic forecast decomposes into two parts: an irreducible component and an approximation component. The irreducible term is the conditional entropy of the future given the available information, while the approximation term is the divergence between the true conditional distribution and the forecasting method. The gap between this conditional-entropy limit and an unconditional baseline is exactly the mutual information between the future observation and the declared information set. This leads to a definition of forecastability as the maximum achievable reduction in expected log loss. Evaluated across horizons, forecastability forms a profile that describes how predictive information varies with lead time. This profile reflects the dependence structure of the process and need not be monotone: predictive information may be concentrated at particular lags, including seasonal horizons, even when intermediate horizons contain little useful signal. From this profile, the paper defines the informative horizon set: the horizons at which forecastability exceeds a practical threshold. At horizons not in this set, the achievable gain over the unconditional baseline is necessarily small, regardless of the forecasting method used. The framework therefore separates what is learnable from what is not, and distinguishes limits imposed by the data from errors introduced by modelling. The result is a pre-modelling diagnostic that identifies where meaningful prediction is feasible before any model is chosen, providing a principled basis for allocating modelling effort across forecast horizons. 2026-03-20T22:28:16Z Peter Maurice Catt http://arxiv.org/abs/2603.03004v2 eTFCE: Exact Threshold-Free Cluster Enhancement via Fast Cluster Retrieval 2026-03-20T19:03:41Z Threshold-free cluster enhancement (TFCE) is a popular method for cluster extent inference but is computationally intensive. Existing TFCE implementations often rely on discretized approximation that introduces numerical errors. Also, we identified a long-standing scaling error in the FSL implementation of TFCE (version 6.0.7.19 and earlier). As an alternative implementation, we present eTFCE, an efficient framework that computes exact TFCE scores using an optimized cluster retrieval algorithm, which, though exact, reduces computation time by approximately 50% compared to standard approximated implementations. In addition, the proposed framework enables simultaneous computation of TFCE and generalized cluster statistics, formulated similarly to TFCE, within a single nonparametric run, with negligible additional computational cost. This, in turn, facilitates systematic method comparisons, and enables a more complete characterization of spatial activation patterns. As a result, eTFCE establishes a mathematically exact and computationally efficient framework for comprehensive and informative nonparametric inference in neuroimaging. 2026-03-03T13:56:57Z Withdrawn by the authors after identifying aspects of the analysis and interpretation that require further validation. To avoid potentially misleading readers, we chose to withdraw the manuscript while conducting additional analyses Xu Chen Wouter Weeda Thomas E. Nichols Jelle J. Goeman http://arxiv.org/abs/2512.04366v7 Sequential Randomization Tests Using e-values: Applications for trial monitoring 2026-03-20T18:09:11Z Sequential monitoring of randomized trials traditionally relies on parametric assumptions or asymptotic approximations. We discuss a family of nonparametric sequential tests - collectively called e-RT - for binary, deaths-only, continuous, time-to-event, and multi-state endpoints. All variants derive validity solely from the randomization mechanism. Using a betting framework, each test constructs a test martingale by sequentially wagering on treatment assignments given observed outcomes. Under the null hypothesis of no treatment effect, the expected wealth cannot grow, guaranteeing anytime-valid Type I error control regardless of stopping rule. We prove validity for each variant, present simulation studies demonstrating calibration and power, and discuss the principled asymmetry in betting strategies across outcome types. These methods provide a conservative, assumption-free complement to model-based sequential analyses. 2025-12-04T01:24:17Z Fernando G Zampieri http://arxiv.org/abs/2504.04143v4 The Rhythm of Aging: Stability and Drift in the Individual Rate of Senescence 2026-03-20T17:24:46Z Human aging is marked by a steady rise in the risk of dying with age-a process demographers call senescence. Over the past century, life expectancy has risen dramatically, but is this because we are aging slower, or simply starting it later? Vaupel hypothesizes that the pace at which individuals age may be constant, with gains in longevity coming from the delayed onset of senescence rather than its slowing down. We test this idea using a new framework that decomposes the pace of senescence into three components: a biological baseline, a long-term trend, and the cumulative impact of period shocks. Applying this to cohort mortality data above age 80 from 12 countries, we find that once period shocks are accounted for, there is no statistical evidence of a long-term trend, consistent with Vaupel's hypothesis. Analyses using lower starting ages yield the same qualitative conclusion. Rather than indicating a change in the process that drives senescence, these variations are consistent with echoes of shared historical events. These results suggest that while longevity has shifted, the rhythm of human aging may be conserved. 2025-04-05T11:31:02Z Silvio Cabral Patricio