https://arxiv.org/api/fiQ3qsiRoIAFVcfD/QlhfcKh8+E2026-06-18T23:18:18Z2357146515http://arxiv.org/abs/2505.05670v3Estimation and Inference in Boundary Discontinuity Designs: Location-Based Methods2026-05-14T14:08:17ZBoundary discontinuity designs are used to learn about causal treatment effects along a continuous assignment boundary that splits units into control and treatment groups according to a bivariate location score. We analyze location-based local polynomial treatment effect estimators that directly employ the bivariate score of each unit. We develop pointwise and uniform estimation and inference methods for the \textit{Boundary Average Treatment Effect Curve} (BATEC), as well as for two aggregated causal parameters: the \textit{Weighted Boundary Average Treatment Effect} (WBATE) and the \textit{Largest Boundary Average Treatment Effect} (LBATE). Our results cover both sharp and fuzzy (imperfect compliance) designs. We illustrate the methods with an empirical application, and provide companion general-purpose software. The supplemental appendix includes additional substantive theoretical results, methodological details, and simulation evidence.2025-05-08T21:59:05ZMatias D. CattaneoRocio TitiunikRuiqi Rae Yuhttp://arxiv.org/abs/2506.12296v3Finite-sample bias-variance tradeoff with variables related to trial participation inserted into causal forest models for ensuring generalizability2026-05-14T13:59:31ZEstimating conditional average treatment effects (CATE) from randomized controlled trials (RCTs) and generalizing them to broader populations is essential for personalizing treatment rules but is complicated by selection bias due to trial participation and potentially high dimensional covariates. We evaluated finite sample bias variance tradeoff for Causal Forest based CATE estimation strategies to address the selection bias. Identification theory suggests unbiased CATE estimation is possible when covariates related to trial participation are included in CATE estimating models. However, simulation studies demonstrated that, under realistic RCT sample sizes, variance inflation from high dimensional covariates often outweighed modest bias reduction. In our data generating process that define individual treatment effect (ITE) in source population and selected trial samples, including more than 3 covariates related to participation in causal forest substantially degraded precision unless sample sizes were large. In contrast, inverse probability weighting (IPW) based methods consistently improved performance across scenarios. Application to a RCT of omega 3 fatty acids and coronary heart disease illustrated how IPW shifts CATE estimates toward source population effects and refines heterogeneity assessments. Our findings highlight that including trial-selection variables for CATE estimating models may inflate estimator variance and reduce ITE prediction performance in applications using medical RCTs. Addressing selection bias separately (e.g. through IPW) would be a reasonable strategy.2025-06-14T01:17:59Z4 figuresRikuta HamayaEtsuji SuzukiKonan Harahttp://arxiv.org/abs/2510.11177v2Policy Robustness & Uncertainty in Model-based Decision Support for the Energy Transition2026-05-14T12:52:14ZClimate policy modelling is a key tool for assessing mitigation strategies in complex systems, where uncertainty is inherent and unavoidable. We present a general methodology for extensive uncertainty analysis in this field. While other studies have performed uncertainty analyses, few apply methods from the field of Uncertainty Quantification, which are commonly used in other modelling disciplines. We show how emulators can identify key uncertainties in modelling frameworks and demonstrate a novel policy analysis previously restricted by computational cost and limited representation of uncertainty. We apply this methodology to FTT:Power to explore uncertainties in the electricity system transition both globally and in India to assess the robustness of mitigation strategies to a wide range of policy and techno-economic scenarios. This approach results in much larger uncertainties in transition outcomes than commonly represented, but policy design can be shaped to mitigate this. Globally, our results indicate transition uncertainty is dominated by average rates of renewables cannibalisation, construction times and grid connection lead times, outweighing regional price policies, including policy reversals in the US. Solar PV appears most resilient due to low costs, though still sensitive to infrastructure constraints and cannibalisation. Onshore wind is more exposed to a range of uncertainties. In India, we find evidence that policy packages including partial phase-out instruments have greater robustness to key uncertainties, although longer lead times still hinder policy goals. Our results suggest that enabling policy and regulating fossil fuels are critical for robust power sector transitions.2025-10-13T09:09:47ZIan J. BurtonFemke J. M. M. NijsseJames M. Salterhttp://arxiv.org/abs/2605.14647v1Multiscale Topological Inference for Marked Point Processes via Euler Characteristic Envelopes2026-05-14T10:05:25ZThe statistical analysis of marked point processes requires disentangling complex spatial arrangements from attribute-dependent interactions. While classical summary statistics are effective for second-order dependencies, they frequently fail to capture higher-order topological structures and non-linear interactions between marks and space. In this work, we propose a novel multiscale topological inference framework for marked point processes by integrating mark-weighted filtrations with Euler Characteristic envelopes. We redefine the underlying metric space using an exponential mark-weighted distance, which modulates connectivity based on attribute similarity, effectively accelerating the merger of connected components among homophilic neighbors. To ensure rigorous statistical inference, we apply non-parametric global envelope tests to the resulting Euler Characteristic Curves, allowing for formal hypothesis testing against the null model of random labeling. Furthermore, we introduce a local decomposition of the topological signal via Z-scores at the critical filtration scale to identify and localize structural hubs and topological barriers. Systematic simulations across various scenarios demonstrate the framework's high specificity and sensitivity to attribute-space dependencies while remaining robust against purely geometric effects. This methodology provides a comprehensive and interpretable toolkit for identifying, quantifying, and localizing complex structural dependencies in marked spatial data, bridging the gap between topological data analysis and classical point process statistics.2026-05-14T10:05:25ZMatthias EckardtMehdi Moradihttp://arxiv.org/abs/2605.14632v1DRL-STAF: A Deep Reinforcement Learning Framework for State-Aware Forecasting of Complex Multivariate Hidden Markov Processes2026-05-14T09:44:11ZForecasting multivariate hidden Markov processes is challenging due to nonlinear and nonstationary observations, latent state transitions, and cross-sequence dependencies. While deep learning methods achieve strong predictive accuracy, they typically lack explicit state modeling, whereas Hidden Markov Models (HMMs) provide interpretable latent states but struggle with complex nonlinear emissions and scalability. To address these limitations, we propose DRL-STAF, a Deep Reinforcement Learning based STate-Aware Forecasting framework that jointly predicts next-step observations and estimates the corresponding hidden states for complex multivariate hidden Markov processes. Specifically, DRL-STAF models complex nonlinear emissions using deep neural networks and estimates discrete hidden states using reinforcement learning, reducing the reliance on predefined transition structures and enabling flexible adaptation to diverse temporal dynamics. In particular, DRL-STAF mitigates the state-space explosion encountered by typical multivariate HMM-based methods. Extensive experiments demonstrate that DRL-STAF outperforms HMM variants, standalone deep learning models, and existing DL-HMM hybrids in most cases, while also providing reliable hidden-state estimates.2026-05-14T09:44:11ZManrui JiangJingru HuangYong ChenChen Zhanghttp://arxiv.org/abs/2510.15141v4Manifold Dimension Estimation via Local Graph Structure2026-05-14T09:32:37ZMost existing manifold dimension estimators rely on the assumption that the underlying manifold is locally flat within the neighborhoods under consideration. More recently, curvature-adjusted principal component analysis (CA-PCA) has emerged as a powerful alternative by explicitly accounting for the manifold's curvature. Motivated by these ideas, we propose a manifold dimension estimation framework that captures the local graph structure of the manifold through regression on local PCA coordinates. Within this framework, we introduce two representative estimators: quadratic embedding (QE) and total least squares (TLS). Experiments on both synthetic and real-world datasets demonstrate that these methods perform competitively with, and often outperform, state-of-the-art approaches.2025-10-16T20:59:46ZZelong BiPierre Lafaye de Micheauxhttp://arxiv.org/abs/2605.30360v1Polynomial Histograms for Memory-Efficient Representation of Long-tailed System Distributions2026-05-14T01:19:06ZDistributed systems must frequently keep track of many different types of performance metrics across many different computers. For example, the latency distribution of certain operations may be computed for a large combination of computers, users, and operations. These empirical distributions need to be collected at minimal expense on the individual software components, efficiently aggregated across multiple dimensions, and stored in a compact representation for a variety of downstream data analysis applications.
We describe an information loss metric for binned data that allows us to optimize cost of information loss from different histogram representations. We explore the use of polynomial histograms where each bin of a histogram is annotated with moments of the underlying distribution in that bin. These polynomial histograms are compared to traditional histograms using the same storage cost for additional bins instead of annotations in each bin. We describe an application of these techniques for file system metrics for a large production system, and analytically characterize when polynomial histograms offer more information at lower cost.2026-05-14T01:19:06Z2014 PreprintMurray StokelyTim HesterbergArif MerchantNate Coehlohttp://arxiv.org/abs/2605.14056v1An MCMC-Based Method for Dynamic Causal Modeling of Effective Connectivity in Functional MRI2026-05-13T19:28:46ZEffective connectivity analysis in functional magnetic resonance imaging (fMRI) studies directional interactions among brain regions and experimental stimuli. Dynamic causal modeling (DCM) is a widely used method to estimate effective connectivity, based on a state-space representation consisting of a latent neural signal model and an observation model transforming the neural signal into the observed blood-oxygen-level-dependent (BOLD) response. A standard DCM combines ordinary differential equation (ODE) dynamics for the latent signal with a complex neural-hemodynamic system for the observation model, and typically uses variational Bayes for parameter estimation. While physically well-motivated, this approach can lead to practical challenges such as inexact solutions and underestimated uncertainty. We introduce Canonical DCM (CDCM), a Markov chain Monte Carlo (MCMC)-based method that adopts a simpler observation model and the No-U-Turn Sampler for posterior sampling. The simpler observation model admits a piecewise analytic solution to the neural ODE, increasing computational efficiency and enabling explicit derivation of sufficient conditions for parameter identifiability. The results indicate that CDCM provides reliable uncertainty quantification and consistent estimation of parameters related to experimental inputs for simulated and real data. We use publicly available data from the Wellcome Centre for Human Neuroimaging and the Human Connectome Project (HCP) to benchmark CDCM against standard DCM methods and examine replicability of estimated connectivity patterns in small- and large-scale neuroimaging settings.2026-05-13T19:28:46ZKaitlyn R. FalesHyebin SongNicole A. Lazarhttp://arxiv.org/abs/2605.14000v1Recent advances in statistical methodology applied to the Hjort liver index time series (1859-2012) and associated influential factors2026-05-13T18:08:27ZCertain recent advances in statistical methodology have promising potential for fruitful use in general biology and the fisheries sciences. This paper reviews and discusses some of the relevant themes, including accurate modelling via focused model selection techniques, dynamic goodness-of-fit testing of processes evolving over time, finding break points for phenomena experiencing changes, prediction uncertainty, and optimal combination of information across diverse sources via confidence distributions. The methods are illustrated for the Hjort liver quality index time series. Its roots lie in the classic Hjort (`Fluctuations in the Great Fisheries of Northern Europe, Viewed in the Light of Biological Research', 1914), where liver quality of the Atlantic cod {\it (Gadus morhua)} for 1880--1912 is reported on and studied, along with related factors, making it one of the first teleost time series ever published. Diligent work by Kjesbu et al. (`Making use of Johan Hjort's `unknown' legacy: reconstruction of a 150-year coastal time-series on northeast Arctic cod (Gadus morhua) liver data reveals long-term trends in energy allocation patterns', 2014), involving both archival and calibration efforts, have extended the series both backwards and forwards in time, to 1859--2012, yielding one of the longest time series of marine science. Our study offers a detailed examination of this series and how it relates to and interacts with associated factors, including Kola winter temperatures, length distribution parameters, cod mortality, and a certain index related to availability of food.2026-05-13T18:08:27Z16 pages, 19 figures. This is the authors' manuscript, 2016, published in modified form in Canadian Journal of Fisheries and Aquatic Sciences 2016, vol. 73, pages 279-295, part of the special issue based on the Johan Hjort Symposium on Recruitment Dynamics and Stock Variability, Bergen, Norway, October 2014Canadian Journal of Fisheries and Aquatic Sciences, 2016, vol. 73, pages 279-295Gudmund H. HermansenNils Lid HjortOlav S. Kjesbuhttp://arxiv.org/abs/2605.13742v1Macroscopic Activity-Based Modeling of Urban Active Mobility2026-05-13T16:19:21ZThis paper develops a macroscopic, activity-based model of urban active mobility using nonintrusive sensor data. It introduces attendance functions to describe spatio-temporal travel patterns between activities and formulates the disaggregation of aggregated counts as a statistical inference problem. Counts are modeled as Poisson variables, and unknown subpopulation sizes are estimated via maximum likelihood, with theoretical guarantees and an efficient EM algorithm for computation. Grounded in a microscopic stochastic model, the framework offers a scalable and privacy-preserving approach to analyzing urban soft mobility dynamics.2026-05-13T16:19:21Z29 pagesRomain AzaïsAdrien MarionFlorian Patouthttp://arxiv.org/abs/2605.13926v1Optimising football transfer strategy under budget constraints: A weighted multi-criteria approach2026-05-13T15:27:51ZThe football transfer market is a complex, dynamic environment in which clubs compete to acquire players who strengthen their squads. While several frameworks estimate a player's worth, a comprehensive approach that captures both squad optimisation and transfer market dynamics remains limited. In this paper, we propose a quantitative framework for optimising football transfer strategy under budget constraints, integrated with a competitive bidding paradigm. Using data from professional football leagues, we construct player performance and transfer price models using linear mixed-effects frameworks that incorporate player characteristics, recent performance, team context, and league effects. The predicted ratings and estimated transfer prices are then integrated into a weighted multi-criteria constrained optimisation framework that determines a club's transfer activities at the end of the season. Finally, these optimal transfer decisions are embedded within an independent private-value auction model with a random reserve price to analyse market behaviour when multiple teams compete for the same player. We illustrate our approach using the 2018-19 season of the English Premier League to demonstrate its ability to capture transfer-market dynamics.2026-05-13T15:27:51ZTathagata BasuSoudeep DebRishideep Royhttp://arxiv.org/abs/2605.13660v1Improving ecological inference and uncertainty quantification from camera trap data through the fusion of AI confidences and manual annotations2026-05-13T15:18:03ZCamera traps have become a core tool in ecological research, enabling large-scale, noninvasive monitoring of wildlife populations and behavior. By automatically recording animals as they pass within view, these devices generate massive image datasets with minimal field effort. Yet this data richness introduces a new bottleneck when translating the images into usable information due to time and effort required for human annotation. Recently, artificial intelligent (AI) has been integrated into the workflow to improve this efficiency. However, the data procured from AI approaches are of a different nature, necessitating new statistical methods in order to obtain inference, make predictions, and quantify uncertainty. We propose a new Bayesian hierarchical data-fusion model which combines the strengths of human annotations and AI predictions. The benefits of our approach are an ability to provide uncertainty quantification as well as improved inference and prediction power, which we demonstrate using a simulation study. We apply our model to an AI analysis of the body condition of white-tailed deer (Odocoileus virginianus) from camera trap images from North Carolina to study the relationship between health and their environment. We find that bucks in rut have higher body condition than other deer and that green, open habitats are correlated with high body condition. Our new model derived novel ecological inference compared to a traditional approach using the same data.2026-05-13T15:18:03Z38 pages, 8 figuresAdira CohenErin M. SchliepRoland KaysMohammad AlyetamaMatthew Sniderhttp://arxiv.org/abs/2605.13446v1Scenario generation of intraday electricity price paths for optimal trading in continuous markets2026-05-13T12:41:07ZContinuous intraday electricity markets play an increasingly important role in short-term trading and balancing, yet decision-making under rapidly evolving price dynamics remains challenging. This paper proposes a comprehensive framework for ensemble forecasting of intraday electricity price trajectories and their translation into adaptive trading decisions. Building on a corrected Support Vector Regression model, the approach extends point predictions to probabilistic trajectory forecasts by introducing scenario generation based on forecast errors of fundamental variables and proposing a novel Support Vector Sorting procedure for the efficient selection of representative scenarios. The framework is evaluated using transaction level data from the German intraday continuous market. Empirical results show improvements over benchmark methods in both statistical and economic terms. Fundamental scenarios enhance median trajectory accuracy but produce more concentrated predictive distributions, while historical simulation with scenario selection better captures tail risk. From an economic perspective, ensemble-based forecasts outperform naive benchmarks across most of the trading strategies. Dynamic updating through scenario reweighting further improves profitability with limited impact on downside risk. Overall, the results demonstrate that combining kernel-based learning with scenario driven uncertainty and adaptive updating provides a flexible and effective approach for forecasting and trading in continuous electricity markets.2026-05-13T12:41:07ZAndrzej PućJoanna Janczurahttp://arxiv.org/abs/2507.09983v2Gradient boosted multi-population mortality modelling with high-frequency data2026-05-13T12:23:22ZHigh-frequency mortality data have attracted growing attention, but their use has largely been confined to specific applications rather than general modelling and forecasting. Such data pose new challenges to traditional mortality models due to pronounced seasonal patterns and short-term fluctuations. To address these challenges and produce more accurate forecasts with the high-frequency mortality data, this paper introduces a novel integration of gradient boosting techniques into traditional stochastic mortality models under a multi-population setting. Our key innovation lies in using the Li and Lee model as the weak learner within the gradient boosting framework, replacing conventional decision trees. Empirical studies are conducted using weekly mortality data from 30 countries (Human Mortality Database, 2015-2019). Empirical evidence highlights that the proposed methodology not only enhances model fit by accurately capturing underlying mortality trends and seasonal patterns, but also achieves superior forecast accuracy, compared to the benchmark models. We also investigate a key challenge in multi-population mortality modelling: how to select appropriate sub-populations with sufficiently similar mortality experiences. A comprehensive clustering exercise is conducted based on mortality improvement rates and seasonal strength. The empirical results demonstrate that our proposed model maintains strong forecast accuracy across different clustering configurations, thereby reducing the need for extensive data preprocessing.2025-07-14T07:00:27ZZiting MiaoHan LiYuyu Chenhttp://arxiv.org/abs/2605.13388v1Toward a practical handbook for choosing among causal inference methods in non-randomized studies with binary outcomes: A simulation study for applied researchers2026-05-13T11:46:47ZApplied researchers in biomedicine and related fields are often interested in estimating the causal effect of a treatment or intervention. Although randomized clinical trials are considered the gold standard for establishing causal effects, they are not always feasible, and real-world data may represent the only available source of evidence. In such settings, causal effects must be estimated using statistical methods applied to observational data. Over the last few decades, modern causal inference methods based on the potential outcomes framework have emerged as useful tools in this field. However, many such techniques exist, and their performance depends on factors such as sample size, the proportion of treated patients, the proportion of patients experiencing the outcome, the magnitude of the treatment effect, the target estimand, and potential violations of the fundamental assumptions of causal inference. Given the wide range of available methods, selecting an appropriate approach can be challenging for applied researchers. This study uses a large-scale simulation experiment to address this issue and provide researchers with a guide in the form of a handbook for a binary treatment and a binary outcome. Particularly, we test four popular statistical techniques: propensity score matching (full matching), inverse of the probability weighting, G-computation, and targeted maximum likelihood estimation. The proposed handbook is applied to two real-world datasets to assess its practical utility: one comprising vulnerable patients with mild COVID-19 (n=534 patients and more than 50% treated), and another of patients undergoing colorectal surgery (n=3635 patients and about 20% treated).2026-05-13T11:46:47Z21 pages, 4 figures. Code available at https://github.com/aaurensanz/code-causal-inference-comparisonAdrián Aurensanz-CrespoCristóbal M Rodríguez-LealRosario SusiJorge Castillo-MateoJesús AsínJosé M RamírezTeresa Pérez