https://arxiv.org/api/wmBNXlDY8CRUWyVCOOMFXh0hXH8 2026-06-19T04:40:23Z 23582 540 15 http://arxiv.org/abs/2605.02326v2 Large-Scale Asset Selection via Metric Dependence with Enriched High Frequency Information 2026-05-10T11:29:59Z

Large-scale portfolio choice is highly sensitive to estimation error, making the preliminary asset selection essential in empirical implementation. Existing selection rules typically rely on scalar returns or low dimensional high frequency summaries, and thus discard intraday risk dynamics that may be relevant for risk adjusted allocation. We propose Metric Dependence Screening (MDS), an asset selection procedure that incorporates high frequency information as object valued data. Each asset day observation is represented as a point-curve object combining daily return with an intraday risk state curve, equipped with a weighted product metric that preserves both reward information and within day risk dynamics. MDS ranks assets by a Fréchet variation based dependence score, measuring how much a risk adjusted target explains the metric dispersion of the asset representations. This yields a simple two stage portfolio procedure: MDS first reduces the investable universe, and standard mean-variance or minimum variance allocation is then applied. We develop a target slicing estimator and establish concentration, sure selection, and rank consistency guarantees under $α$-mixing time series dependence and ultrahigh dimensionality. Simulations show that MDS performs well across both Euclidean and non-Euclidean settings. Using high frequency data for $2938$ Chinese A-share stocks from July 2023 to December 2025, we demonstrate that MDS improves out of sample portfolio performance over return based and scalar dependence based benchmarks, highlighting the value of preserving intraday risk dynamics.

2026-05-04T08:26:39Z Yangzhou Chen Shuaida He Xin Chen http://arxiv.org/abs/2605.09291v1 dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models 2026-05-10T03:36:49Z

Discrete flow models (DFMs) are a class of flexible generative models for generating discrete data, and diffusion large language models (dLLMs) can be viewed as a special case with a specific choice of mixture path and a masked source distribution. While several recent works have explored reinforcement learning into dLLMs, its application to more general discrete flow models remains underexplored. In this work, we present discrete Flow-GRPO (dFlowGRPO), a unified reinforcement learning framework for discrete flow models that supports a broad family of probability paths and non-masked source distributions. We derive the full trajectory probability for DFMs and formulate denoising as a Markov decision process, enabling dFlowGRPO to incorporate information from both the associated conditional transition rates and the posterior model during reinforcement learning. We apply dFlowGRPO to FUDOKI, a recent multimodal discrete flow model, and evaluate it on both image generation and multimodal understanding tasks. Empirical results show that dFlowGRPO outperforms existing GRPO-type methods for dLLMs on text-to-image generation tasks and achieves performance competitive with continuous flow-based models trained using FlowGRPO, while also demonstrating strong capabilities on understanding tasks.

2026-05-10T03:36:49Z Zhengyan Wan Yidong Ouyang Panwen Hu Qiang Sun http://arxiv.org/abs/2604.08789v2 Quantifying the resilience benefits of undergrounding a circuit with utility data 2026-05-10T02:42:23Z

We leverage historical outage data to quantify the resilience benefits of undergrounding a circuit. The historical performance of the overhead circuit is compared to the performance if the circuit had been undergrounded in the past. The number of outages, customers affected, outage duration, and customer hours lost are used as metrics to quantify the benefits of undergrounding. Results show 75% and 78% reductions in customer hours lost per year for two selected circuits, as well as a significant reduction in the average number of outages and customers affected per year, highlighting the advantages of undergrounding. The benefits of investments that result in 10% faster outage restoration are also calculated by rerunning history with the faster restoration included.

2026-04-09T21:56:42Z Arslan Ahmad Ian Dobson Anne Kimber http://arxiv.org/abs/2605.09193v1 Quantifying Time-Varying Physical Activity Intervention Effects via Functional Regression 2026-05-09T22:16:57Z

Physical activity (PA) intervention studies often collect repeated intensity measurements over long observation periods. Quantifying the variation in intervention effects over the study period is critical to evaluating and improving intervention strategies, yet many analyses reduce PA data into scalar summary measures, resulting in limited insights. We propose a functional regression framework, which captures time-varying intervention effects by modeling the entire PA trajectory as a functional observation. From both methodological and practical perspectives, we demonstrate the advantages of function-on-scalar regression (FoSR) over the traditional two-step approach of applying functional principal components analysis (FPCA) followed by regressing scores on covariates. The FoSR is further extended to a function-on-function regression (FoFR) for studying the association of PA across time periods. Methods are applied to daily step counts from the Social incentives to Encourage Physical Activity and Understand Predictors (STEP UP) study, revealing distinct and highly interpretable time-varying effects of three intervention strategies on PA and differences in their sustainability. Our case study highlights the feasibility of functional data analysis techniques for uncovering novel insights in intervention studies with high-dimensional endpoints.

2026-05-09T22:16:57Z Nidhi Pai Yu Lu Kristin A. Linn Erjia Cui http://arxiv.org/abs/2605.09147v1 From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages 2026-05-09T20:15:18Z

Part-of-speech (POS) tagging for Medieval Romance languages remains challenging due to orthographic variation, morphological complexity, and limited annotated resources. This paper presents a systematic empirical evaluation of large language models (LLMs) for POS tagging across three medieval varieties: Medieval Occitan, Medieval Catalan, and Medieval French. We compare traditional rule-based and statistical taggers with modern open-source LLMs under zero-shot prompting, few-shot prompting, monolingual fine-tuning, and cross-lingual transfer learning settings. Experiments on historically grounded datasets show that LLM-based approaches consistently outperform traditional taggers, with fine-tuning and multilingual training yielding the largest improvements. In particular, cross-lingual transfer learning substantially benefits under-resourced varieties, while targeted bilingual training can outperform broader multilingual configurations for specific target languages. The results highlight the importance of linguistic proximity and dataset characteristics when designing transfer strategies for historical NLP. These findings provide empirical insights into the applicability of modern neural methods to medieval text processing and provide practical guidance for deploying LLM-based POS tagging pipelines in digital humanities research. All code, models, and processed datasets are released for reproducibility.

2026-05-09T20:15:18Z Accepted at NLP4DH @ ACL 2026 Matthias Schöffel Esteban Garces Arias http://arxiv.org/abs/2503.16027v2 Deep Gaussian Process Emulation with gradient Information and Sequential Design for Simulators with Sharp Variations 2026-05-09T19:35:34Z

Deep Gaussian Processes (DGPs) compose GP layers to warp inputs, enabling improved emulation of computer models with nonstationary input-output behavior compared with ordinary GPs. In contrast to GPs, the predictive uncertainty for DGP gradients remains relatively underexplored. Quantifying DGP gradient uncertainty can support gradient-based tasks in complex, nonstationary settings where ordinary GPs may struggle. While GP gradient posteriors are analytically tractable, extending such constructions to DGPs is challenging due to their hierarchical composition. In this paper, we propose an efficient approximation to the gradient distribution of a two-layer DGP emulator. Using the chain rule with local linearization, we derive closed-form expressions for the gradient mean and covariance, enabling fast gradient evaluation with uncertainty quantification (UQ). Empirically, our approach delivers promising performance while uniquely providing UQ of gradients. We then use the gradient uncertainties to guide sequential design for models with sharp variations: we define sharp variation regions as those where the gradient norm exceeds a threshold. We subsequently introduce an entropy-based acquisition rule that selects new samples in locations where the classification of points as inside versus outside the sharp-variation region is most uncertain. Experiments on synthetic benchmarks and a real-world application show that the resulting sequential design more accurately emulates functions with sharp variations than existing design methods.

2025-03-20T10:48:56Z Yiming Yang Deyu Ming Serge Guillas http://arxiv.org/abs/2605.09116v1 Fit CATE Once: Model-Assisted Randomization Tests Without Sample Splitting 2026-05-09T19:03:33Z

Randomization tests and flexible treatment-effect models offer complementary strengths for analyzing data from randomized panel experiments: the former provide valid inference under the known assignment mechanism, while the latter can capture complex patterns of effect heterogeneity. We develop model-assisted randomization tests that combine these strengths without sample splitting. The key idea is to estimate an unsigned version of the conditional average treatment effect (CATE) from the covariance structure of residualized outcomes, while leaving the realized assignments for randomization inference. The remaining sign can be chosen to best fit the observed outcomes. We establish identification and consistency for the proposed unsigned CATE estimators, as well as validity for the CATE-assisted randomization tests. Across synthetic and semi-synthetic experiments, the CATE-assisted randomization tests control Type I error and achieve higher power than covariate-adjusted and sample-split alternatives. Finally, we show that the assignment-free CATE estimates can be used to discover heterogeneous subgroups and test subgroup-specific treatment effects.

2026-05-09T19:03:33Z 48 pages, 7 figures Fangnan Zheng Yao Zhang http://arxiv.org/abs/1808.09448v2 Estimating the distribution of marks of a homogeneous marked Poisson process 2026-05-09T18:01:51Z

In this paper we propose an estimator of the distribution of events of different kinds in a homogeneous Poisson process. We give an explicit solution for the maximum likelihood estimator of the distribution and derive its strong consistency and asymptotic normality. We also provide an order restricted estimator of the distribution and derive its consistency and asymptotic distribution. The inference problem gives rise to a Sylvester-Ramanujan system of equations. We discuss application of the estimator to the detection of neutrons in a novel detector developed at the European Spallation Source in Lund, Sweden.

2018-08-28T14:32:58Z Dragi Anevski Vladimir Pastukhov http://arxiv.org/abs/2605.06135v2 Linked-Tucker Factorized Individualized Regression for Paired Multivariate Categorical Outcomes 2026-05-09T17:55:53Z

We propose a joint individualized hurdle-ordinal regression model for paired zero-inflated ordinal outcomes with subject-specific, spatially varying, and time-varying covariate effects, motivated by the Iowa Fluoride Study (IFS). The two outcomes, dental caries and dental fluorosis, are measured repeatedly across ages at fine spatial resolution, yielding nested longitudinal data with substantial zero inflation, ordinality, and heterogeneity across individuals and locations. For each outcome, a hurdle component models disease presence, while a proportional-odds component models severity among positive observations. To parsimoniously represent the high-dimensional coefficient arrays, we introduce a linked Tucker tensor factorization. Shared subject-mode factors induce dependence between the caries and fluorosis coefficient tensors, while separate spatial factors accommodate the distinct measurement grids of tooth surfaces and tooth zones. A horseshoe prior on the core tensor elements encourages sparsity, and posterior computation is performed using the No-U-Turn Sampler in NumPyro. Population-level effect summaries are obtained by projecting individualized posterior linear predictors onto the design space, and Wasserstein barycenters aggregate these summaries across tooth locations and anatomical classes. Applied to the IFS, the model reveals spatially heterogeneous associations between early-life fluoride and dietary exposures and both outcomes. Fluoride exposure is associated with increased odds and severity of fluorosis, while soda intake consistently increases caries risk. These associations differ between presence and severity components and vary across tooth locations, ages, and subpopulations defined by prior caries status, highlighting the importance of the joint hurdle-ordinal framework for disentangling disease occurrence from disease progression in multilevel dental data.

2026-05-07T12:34:33Z Arkaprava Roy Jeremy T. Gaskins Steven Levy Somnath Datta http://arxiv.org/abs/2605.09064v1 Bayesian decision theory for wildlife management under uncertainty: from inference to action 2026-05-09T17:14:54Z

Ecologists are increasingly expected to inform management decisions under uncertainty, yet most analytical workflows stop at statistical inference. This disconnect limits the practical impact of ecological modelling, particularly in high-stakes contexts such as wildlife management, where decisions must balance ecological, economic and social objectives. Bayesian decision theory provides a coherent framework to bridge this gap. It propagates uncertainty from posterior distributions to quantify the consequences of alternative actions through utility functions. Despite its strong theoretical foundations, it remains underused in ecology. Here, we present a practical workflow for implementing Bayesian decision theory using standard Bayesian tools. We illustrate the approach with two case studies. First, wolf management in France, where the decision consists of selecting the number of wolves that can be removed under uncertainty about population dynamics. Second, invasive muskrat management in the Netherlands, where the decision involves allocating a fixed control effort across space. In both cases, expected utility is computed from posterior simulations, explicitly accounting for uncertainty and trade-offs. Results show that optimal decisions emerge as a compromise between competing objectives. In the wolf case, optimal harvest balances removal benefits and population risk. In the muskrat case, optimal effort increases with the importance of population reduction and is unevenly allocated across provinces. These examples show that Bayesian decision theory can be implemented as a direct extension of standard inference. By making trade-offs explicit, it enhances transparency, reproducibility, and relevance for management. More broadly, it provides a flexible basis for integrating ecological modelling with decision-making.

2026-05-09T17:14:54Z 3 figures Olivier Gimenez Abby Keller Cyril Milleret http://arxiv.org/abs/2605.08873v1 CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization 2026-05-09T10:51:58Z

Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.

2026-05-09T10:51:58Z Soo Min Kwon Ziteng Sun Ananda Theertha Suresh Himanshu Jain Sanjiv Kumar http://arxiv.org/abs/2511.14091v2 State-Space Representation of INGARCH Models and Their Application in Insurance 2026-05-09T10:45:20Z

Integer-valued generalized autoregressive conditional heteroskedastic (INGARCH) models are a popular framework for modeling serial dependence in count time-series. While convenient for modeling, prediction, and estimation, INGARCH models lack a clear theoretical justification for the evolution step. This limitation not only makes interpretation difficult and complicates the inclusion of covariates, but can also make the handling of missing data computationally burdensome. Consequently, applying such models in an insurance context, where covariates and missing observations are common, can be challenging. In this paper, we first introduce the marginalized state-space model (M-SSM), defined solely through the marginal distribution of the observations, and show that INGARCH models arise as special cases of this framework. The M-SSM formulation facilitates the natural incorporation of covariates and missing data mechanisms, and this representation in turn provides a coherent way to incorporate these elements within the INGARCH model as well. We then demonstrate that an M-SSM can admit an observation-driven state-space model (O-SSM) representation when suitable assumptions are imposed on the evolution of its conditional mean. This lifting from an M-SSM to an O-SSM provides a natural setting for establishing weak stationarity, even in the presence of heterogeneity and missing observations. The proposed ideas are illustrated through the Poisson and the Negative-Binomial INGARCH(1,1) models, highlighting their applicability in predictive analysis for insurance data.

2025-11-18T03:20:24Z Jae Youn Ahn Hong Beng Lim Mario V. Wüthrich http://arxiv.org/abs/2605.08532v1 Accounting for variable detection functions in temporal abundance modeling via transfer learning 2026-05-08T22:38:51Z

Relative abundance, measured as the number of animals caught per unit of sampling effort (CPUE), is commonly used to monitor fish and wildlife populations, largely because sampling methods are cost-effective to implement. Modeling relative abundance, however, requires the assumption that the detection probability is constant across sampling events. This assumption is likely not valid, as the probability of detection often varies as a function of several factors, including the characteristics of individual animals and environmental conditions at the time of sampling. In contrast, methods to estimate absolute abundance, such as capture-recapture (CR), account for variable detection, but are often infeasible to implement across large spatiotemporal scales. Despite this, CR data are sometimes available for species of interest, albeit at smaller spatiotemporal extents. Leveraging information on detection probabilities from CR data to help inform estimates of widely available CPUE data could strengthen inferences about the status of fish and wildlife populations. We propose an approach to (i) learn the effect of environmental covariates on detection probabilities from CR data and (ii) transfer these detection functions to CPUE models for improved inference. Shown empirically through a simulation study, this approach improves estimates of abundance and the ability to detect temporal trends. We apply our transfer learning method using CR and CPUE data to recreationally important smallmouth bass (\textit{Micropterus dolomieu}) fisheries in Pennsylvania, USA rivers.

2026-05-08T22:38:51Z Kevin M. Collins Erin M. Schliep Tyler Wagner Christopher K. Wikle http://arxiv.org/abs/2505.23113v3 Valid F-screening in linear regression 2026-05-08T22:03:56Z

Suppose that a data analyst wishes to report the results of a least squares linear regression only if the overall null hypothesis, $H_0^{1:p}: β_1= β_2 = \ldots = β_p=0$, is rejected. This practice, which we refer to as F-screening (since the overall null hypothesis is typically tested using an $F$-statistic), is in fact common across a number of applied fields. Unfortunately, it poses a problem: standard guarantees for the inferential outputs of linear regression, such as Type 1 error control of hypothesis tests and nominal coverage of confidence intervals, hold unconditionally, but fail to hold conditional on rejection of the overall null hypothesis. In this paper, we develop an inferential toolbox for the coefficients in a least squares model that are valid conditional on rejection of the overall null hypothesis. We develop selective p-values that lead to tests that are consistent and control the selective Type 1 error, i.e., the Type 1 error conditional on having rejected the overall null hypothesis. Furthermore, they can be computed without access to the raw data, i.e., using only the standard outputs of a least squares linear regression, and therefore are suitable for use in a retrospective analysis of a published study. We also develop confidence intervals that attain nominal selective coverage, and point estimates that account for having rejected the overall null hypothesis. We derive an expression for the Fisher information about the coefficients resulting from the proposed approach, and compare this to the Fisher information that results from an alternative approach that relies on sample splitting. We investigate the proposed approach in simulation and via re-analysis of two datasets from the biomedical literature.

2025-05-29T05:32:45Z Olivia McGough Daniela Witten Daniel Kessler http://arxiv.org/abs/2605.08395v1 Statistical Design of Pragmatic Trials Using Electronic Health Record Data when Outcome Assessments are Uncontrolled and Irregular 2026-05-08T19:03:33Z

Pragmatic trials increasingly define outcomes using real-world data such as electronic health records, where assessments are collected during routine care rather than at fixed timepoints. Consequently, these uncontrolled assessments may be irregular, sparse, and affected by the intervention (intervention-dependent assessments), which can lead to biased treatment effect estimates. We developed a simulation study to inform the statistical approach for trials with uncontrolled assessments, which we applied to the MI-CARE pragmatic trial. Using a pre-trial cohort mimicking eligibility and outcome measurement, we estimated assessment frequency and timing and combined these estimates with assumptions about how the intervention effects might impact assessment. We simulated sparse and intervention-dependent assessments and compared single-measure approaches with longitudinal models using all scores. Under intervention-dependent assessments, we found that naive methods such as using the best score or using a randomly selected score without adjusting for measurement timing produced substantial bias. Models that adjusted flexibly for the follow-up timing estimated time-point specific or time-averaged treatment effects without bias. Simulation results informed the selection of the statistical approach for the MI-CARE trial. Among unbiased methods, the most powerful was a linear mixed model with exponential correlation structure, adjustment for time since baseline, and a time-varying intervention effect to estimate the intervention effect at the end of the intervention window. Future studies can use pre-trial data to conduct a simulation study tailored to the trial's data features to inform the analytic approach. Trials with uncontrolled assessments should consider the potential for intervention-dependent assessments and select an appropriate method to avoid bias.

2026-05-08T19:03:33Z 24 pages, 2 figures; includes supplementary material Jennifer F. Bobb Sungtaek Son Melissa L. Anderson Noorie Hyun Lynn L. DeBar Katharine A. Bradley