https://arxiv.org/api/7dsNkQEAN7Zmg8ImaCkn+6lkxVs2026-06-21T20:41:24Z2358275015http://arxiv.org/abs/2604.21647v1Exploring climate change effects on concurrent floods and concurrent droughts via statistical deep learning2026-04-23T13:09:44ZConcurrent floods and concurrent droughts in nearby catchments pose challenges to risk assessment and water management. Climate change is affecting extremely high and low discharge, but the complex interplay between changes in individual catchments and in the dependence across catchments make it difficult to provide accurate assessments of the occurrence probabilities of concurrent extremes. In this work, we use a contemporary statistical deep learning model (the deep SPAR framework) to capture concurrent river floods and droughts in four catchments in the Upper Danube basin, based on discharge simulated by a hydrological model driven with large ensemble climate model output. The statistical model is able to accurately capture the multivariate extremes of the simulated discharge, which we assess by making use of the large available sample size. We subsequently use our statistical model to study changes in joint tail behaviour of discharge over time, finding that both compound flooding and drought-like conditions are becoming increasingly likely towards the end of the 21st century under a high-emission scenario. In particular, our results highlight that changes in the dependence structure of extremes strongly contribute to the detected changes, an aspect that would be difficult to capture with traditional approaches. This work paves the way for highly flexible, general inference on compound extremes in hydrological applications, and demonstrates key advantages of using statistical deep learning in this setting.2026-04-23T13:09:44ZC. J. R. Murphy-BarltropJ. RichardsB. PoschlodA. SasseJ. Zscheischlerhttp://arxiv.org/abs/2604.21545v1Informed Asymmetric Dirichlet Priors for Multivariate Bernoulli Mixture Models2026-04-23T11:11:41ZClustering multivariate binary data is of interest in many scientific fields, including ecology, biomedicine, and social policy. Beyond heuristic clustering algorithms, such data can be modelled using multivariate Bernoulli mixture models. Many Bayesian implementations of these models involve a trade-off between computational efficiency and full posterior inference. We propose instead a Bayesian approach able to provide both aspects. The method fixes the total number of components to a large value and employs an asymmetric Dirichlet prior on the mixture weights. The asymmetric Dirichlet hyperparameters are elicited using the popular Penalized Complexity prior framework, which provides an intuitive way for users to inform the induced distribution of the number of clusters. An efficient MCMC algorithm is then developed to fit the model. Simulations and real-world applications demonstrate that the method is competitive with existing alternatives and can outperform them in certain settings. The proposal is illustrated using an ecological dataset about presence-absence of species across multiple sites, where cluster-specific parameters are modelled on the basis of environmental conditions. Overall, the proposed method provides a computationally efficient, fully Bayesian, and interpretable framework for clustering multivariate binary data, with potential applications across diverse scientific domains.2026-04-23T11:11:41Z44 pages, 11 figuresLuisa FerrariMaria Franco VilloriaGarritt L. PageAlex Lainihttp://arxiv.org/abs/2604.21498v1Analyzing directional errors in spatial orientation using nonparametric circular regression with mixed covariates2026-04-23T10:01:03ZSpatial orientation is a fundamental cognitive skill that relies on sensory information to update perceived direction. Understanding how sensory conditions influence directional accuracy is important for both cognitive science and the design of assistive technologies. We analyze experimental data in which blind, low-vision, and sighted participants performed spatial updating tasks under five sensory conditions, with signed angular error as the response. To model these data, we propose a nonparametric circular regression framework that accommodates both continuous and categorical predictors via a product-kernel estimator. Bandwidth selection is crucial in this setting, yet developing practical data-driven methods remains challenging. We derive asymptotic bias and variance expressions for the estimator, though these results do not directly lead to a feasible plug-in bandwidth selector. To address this, we develop a bootstrap bandwidth selection criterion tailored to the cosine loss and compare it with cross-validation and rule-of-thumb approaches in simulation studies. Applied to the spatial updating data, the proposed framework reveals nonlinear, condition-specific patterns and quantifies uncertainty via simultaneous bootstrap confidence bands. Across the scenarios considered, the proposed bootstrap selector achieves a favorable bias-variance trade-off and yields stable inference relative to the competing methods. An implementation is available in the R package circMixedReg.2026-04-23T10:01:03Z33 pages, 13 figures, 3 tablesMario Francisco-FernándezAndrea Meilán-Vilahttp://arxiv.org/abs/2604.21491v1Benchmarking the Utility of Privacy-Preserving Cox Regression Under Data-Driven Clipping Bounds: A Multi-Dataset Simulation Study2026-04-23T09:53:15ZDifferential privacy (DP) is a mathematical framework that guarantees individual privacy; however, systematic evaluation of its impact on statistical utility in survival analyses remains limited. In this study, we systematically evaluated the impact of DP mechanisms (Laplace mechanism and Randomized Response) with data-driven clipping bounds on the Cox proportional hazards model, using 5 clinical datasets ($n = 168$--$6{,}524$), 15 levels of $\varepsilon$ (0.1--1000), and $B = 1{,}000$ Monte Carlo iterations. The data-driven clipping bounds used here are observed min/max and therefore do not provide formal $\varepsilon$-DP guarantees; the results represent an optimistic lower bound on utility degradation under formal DP. We compared three types of input perturbations (covariates only, all inputs, and the discrete-time model) with output perturbations (dfbeta-based sensitivity), using loss of significance rate (LSR), C-index, and coefficient bias as metrics. At standard DP levels ($\varepsilon \leq 1$), approximately 90% (90--94%) of the significant covariates lost significance, even in the largest dataset ($n = 6{,}524$), and the predictive performance approached random levels (test C-index $\approx 0.5$) under many conditions. Among the input perturbation approaches, perturbing only covariates preserved the risk-set structure and achieved the best recovery, whereas output perturbation (dfbeta-based sensitivity) maintained near-baseline performance at $\varepsilon \geq 5$. At $n \approx 3{,}000$, the significance recovered rapidly at $\varepsilon = 3$--10; however, in practice, $\varepsilon \geq 10$ (for predictive performance) to $\varepsilon \geq 30$--60 (for significance preservation) is required. In the moderate-to-high $\varepsilon$ range, false-positive rates increased for variables whose baseline $p$-values were near the significance threshold.2026-04-23T09:53:15Z11 pages, 6 figures, 5 tables. Supplementary material (5 pages, 2 figures, 3 tables) included as ancillary file. Submission to IEEE Journal of Biomedical and Health Informatics (J-BHI)Keita FukuyamaYukiko MoriTomohiro KurodaHiroaki Kikuchihttp://arxiv.org/abs/2506.04292v3GARG-AML against Smurfing: A Scalable and Interpretable Graph-Based Framework for Anti-Money Laundering2026-04-23T09:21:51ZPurpose: We introduce GARG-AML, a fast and transparent graph-based method to catch `smurfing', a common money-laundering tactic. It assigns a single, easy-to-understand risk score to every account in both directed and undirected networks. Unlike overly complex models, it balances detection power with the speed and clarity that investigators require.
Methodology: The method maps an account's immediate and secondary connections (its second-order neighbourhood) into an adjacency matrix. By measuring the density of specific blocks within this matrix, GARG-AML flags patterns that mimic smurfing behaviour. We further boost the model's performance using decision trees and gradient-boosting classifiers, testing the results against current state-of-the-art on both synthetic and open-source data.
Findings: GARG-AML matches or beats state-of-the-art performance across all tested datasets. Crucially, it easily processes the massive transaction graphs typical of large financial institutions. By leveraging only the adjacency matrix of the second-order neighbourhood and basic network features, this work highlights the potential of fundamental network properties towards advancing fraud detection.
Originality: The originality lies in the translation of human expert knowledge of smurfing directly into a simple network representation, rather than relying on uninterpretable deep learning. Because GARG-AML is built expressly for the real-world business demands of scalability and interpretability, banks can easily incorporate it in their existing AML solutions.2025-06-04T11:30:37ZBruno DeprezBart BaesensTim VerdonckWouter Verbekehttp://arxiv.org/abs/2604.21457v1Context-Aware Displacement Estimation from Mobile Phone Data: A Methodological Framework2026-04-23T09:14:11ZTimely population displacement estimates are critical for humanitarian response during disasters, but traditional surveys and field assessments are slow. Mobile phone data enables near real-time tracking, yet existing approaches apply uniform displacement definitions regardless of individual mobility patterns, misclassifying regular commuters as displaced. We present a methodological framework addressing this through three innovations: (1) mobility profile classification distinguishing local residents from commuter types, (2) context-aware between-municipality displacement detection accounting for expected location by user type and day of week, and (3) operational uncertainty bounds derived from baseline coefficient of variation with a disaster adjustment factor, intended for humanitarian decision support rather than formal statistical inference. The framework produces three complementary metrics scaled to population with uncertainty bounds: displacement rates, origin-destination flows, and return dynamics. An Aparri case study following Super Typhoon Nando (2025, Philippines) applies the framework to vendor-provided daily locations from Globe Telecom. Context-aware detection reduced estimated between-municipality displacement by 1.6-2.7 percentage points on weekdays versus naive methods, attributable to the commuter exception but not independently validated. The method captures between-municipality displacement only. Within-municipality evacuation falls outside scope. The single-case demonstration establishes proof of concept. External validity requires application across multiple events and locations. The framework provides humanitarian actors with operational displacement information while preserving individual privacy through aggregation.2026-04-23T09:14:11Z24 pages, 4 figures, 14 tables. Case study: Super Typhoon Nando, Philippines (2025)Rajius IdzalikaMuhammad Rheza MuztahidRadityo Eko Prasojohttp://arxiv.org/abs/2604.21372v1Optimal basis risk weighting in expectile-based parametric insurance2026-04-23T07:35:29ZParametric insurance contracts translate index measurements to compensation for policyholders' losses using predefined payment schemes. These need to be designed carefully to keep basis risk, i.e. the disparity between payouts and true damages, small. Previous research has motivated the use of conditional expectiles as payment schemes, whose compensation is impacted by the policyholder's potentially unknown attitude towards basis risk. To alleviate this model uncertainty and to investigate the impact of (hidden) influencing factors, we characterize existence and uniqueness of the optimal basis risk weighting in a utility-maximization framework through a set of boundary conditions. In the absence of an optimal solution, we provide comparisons to the utility of no insurance and full indemnity coverage. We establish a link between location-scale distributions and separability of conditional expectiles' derivatives, thus improving the understanding of these statistical functionals. A simulation study on parametric hurricane insurance visualizes our results, investigates the influence of premium loading and risk aversion on the optimal weighting, and comments on the challenge of (spatial) loss dependence.2026-04-23T07:35:29ZMarkus Johannes MaierMatthias Schererhttp://arxiv.org/abs/2604.21292v1Large values in time series and additive combinatorics2026-04-23T05:11:05ZIt is well-known in industrial data science that large values of real-life time series tend to be structured and often follow concrete and visible patterns. In this paper, we use ideas from additive combinatorics and discrete Fourier analysis to give this heuristic a mathematical foundation. Our main tool is the Fourier ratio, a complexity measure previously used in compressed sensing, combined with a generalized version of Chang's lemma from additive combinatorics. Together, these yield a precise prediction: when the Fourier ratio of a time series is small, the set of its largest values can be additively generated by a very small set using only $\{-1,0,1\}$ coefficients. We test this prediction on US inflation data and Delhi climate data, both in their original form and after mean-centering. The numerical results confirm the predicted structure: a generating set of size $4$--$7$ suffices to span large spectra containing dozens of points, even when the Fourier ratio is large enough that our theoretical bounds become loose. These findings provide a rigorous explanation for why extreme values in real-world data are information-rich and structurally significant.2026-04-23T05:11:05Z13 pages, 6 figuresAlex IosevichVishal Guptahttp://arxiv.org/abs/2604.21115v1Complex Approximate Message Passing with Non-separable Denoising2026-04-22T21:58:28ZApproximate Message Passing (AMP) is a general framework for iterative algorithms, originally developed for compressed sensing and later extended to a wide range of high-dimensional inference problems. Although recent work has advanced matrix AMP, complex AMP, and AMP for non-separable functions independently, a unified state evolution theory for complex AMP with non-separable denoisers has been lacking. This article fills that gap by establishing state evolution in the setting of complex, non-separable denoising functions. The proposed approach constructs an augmented real-valued system that lifts the problem to a higher-dimensional space, then recovers the complex domain through a many-to-one canonical transformation. Under this construction, the Onsager correction naturally involves Wirtinger derivatives, and the resulting state evolution reduces to scalar complex recursions despite the non-separable structure of the denoisers. The framework extends to the matrix-valued setting, accommodating multiple feature vectors simultaneously. This generalization enables AMP to exploit joint structural constraints, such as simultaneous group and element sparsity, in complex-valued recovery problems. The complex sparse group least absolute shrinkage and selection operator (LASSO) serves as a key instantiation, motivated by preamble detection in Orthogonal Time-Frequency Space (OTFS)-based unsourced random access. Numerical experiments confirm that state evolution accurately predicts performance and show that complex non-separable denoising can produce significant gains over separable and real-valued alternatives.2026-04-22T21:58:28ZVishnu Teja KundeAlessandro MirriJean-Francois ChamberlandEnrico Paolinihttp://arxiv.org/abs/2604.21067v1The geometry of conflict : 3D Spatio-temporal patterns in fatalities prediction2026-04-22T20:20:58ZUnderstanding how conflict events spread over time and space is crucial for predicting and mitigating future violence. However, progress in this area has been limited by the lack of methods capable of capturing the intricate, dynamic patterns of conflict diffusion. The complex nature of those trends needs flexibility in the models to untangle them. This study addresses this gap by analyzing spatio-temporal conflict fatality data using an innovative approach that transforms the data into three-dimensional patterns at the Prio-Grid level. In this paper, a shape-based model called ShapeFinder is adapted. By applying the Earth Movers Distance (EMD) algorithm, we detect and classify these patterns, allowing us to compare and match patterns with high adaptive capacity in all dimensions. Using historical similar patterns, we generate predictions of conflict fatalities and compare these with forecasts from the Views ensemble model, a leading benchmark. Our findings demonstrate that recognizing and analyzing conflict diffusion patterns significantly improves predictive accuracy, outperforming the benchmark model. This research contributes to the study of conflict dynamics by introducing a novel pattern recognition framework that enhances the analysis of spatio-temporal data and offers practical applications for early warning systems.2026-04-22T20:20:58Z68 Pages, 34 figuresThomas Schincariolhttp://arxiv.org/abs/2409.07609v2Survival of the Cheapest: Cost-Aware Hardware Adaptation for Adversarial Robustness2026-04-22T17:36:44ZDeploying adversarially robust machine learning systems requires continuous trade-offs between robustness, cost, and latency. We present an autonomic decision-support framework providing a quantitative foundation for adaptive hardware selection and hyper-parameter tuning in cloud-native deep learning. The framework applies accelerated failure time (AFT) models to quantify the effect of hardware choice, batch size, epochs, and validation accuracy on model survival time. This framework can be naturally integrated into an autonomic control loop (monitor--analyse--plan--execute, MAPE-K), where system metrics such as cost, robustness, and latency are continuously evaluated and used to adapt model configurations and hardware selection. Experiments across three GPU architectures confirm the framework is both sound and cost-effective: the Nvidia L4 yields a 20% increase in adversarial survival time while costing 75% less than the V100, demonstrating that expensive hardware does not necessarily improve robustness. The analysis further reveals that model inference latency is a stronger predictor of adversarial robustness than training time or hardware configuration.2024-09-11T20:43:59ZCharles MeyersMohammad Reza Saleh SedghpourTommy LöfstedtErik Elmrothhttp://arxiv.org/abs/2604.20625v1Dynamic Prediction of the Target Survival Time in Metastatic Solid Tumor Cancer Clinical Trials2026-04-22T14:40:03ZOverall survival (OS) is the gold standard for assessing patient benefit and cost-effectiveness of new cancer drugs. However, it is often difficult to use OS as the primary endpoint in randomized clinical trials (RCTs) for patients with metastatic cancer due to multiple reasons. In recent years, progression-free survival (PFS) has increasingly been used as the primary endpoint in metastatic cancer RCTs to accelerate development. However, regulatory authorities often seek mature OS data for approval. Therefore, it is critical to determine the target time when OS data are expected to be mature for reliable statistical inference. Motivated by an advanced renal cell carcinoma (RCC) clinical trial, we develop and investigate different prediction models leveraging information from disease progression to improve target OS prediction times. We propose a multivariate joint modeling approach considering components of progression and OS and extend three models commonly used for association to be used for OS prediction. To the best of our knowledge, this is the first comprehensive statistical study exploring the prediction of OS using different levels of information on disease progression and illustrating these models using a real, complex dataset. Our findings have significant implications for OS prediction.2026-04-22T14:40:03ZSidi WangKelley KidwellBo HuangSatrajit Roychoudhuryhttp://arxiv.org/abs/2604.20611v1Bayesian Inference for Incomplete 2x2 Diagnostic Tables2026-04-22T14:26:32ZIncomplete reporting of diagnostic accuracy data remains a persistent problem in medical research. In many studies, only part of the 2x2 diagnostic table is reported, leaving denominators for diseased and non-diseased groups unknown and preventing direct calculation of sensitivity, specificity, predictive values, and related operating characteristics. To address this limitation, we develop hierarchical Bayesian models for reconstructing incomplete 2x2 diagnostic tables from such partial information. Two motivating scenarios are considered: one in which only a single test-outcome row is observed, and another in which true positives, false positives, and the total sample size are reported but the remaining cells are missing. The proposed models are illustrated on a benchmark breast MRI study with complete counts, treated as partially observed in order to assess reconstruction performance under controlled missingness. The framework yields posterior inference for the missing cell counts and associated diagnostic measures, together with uncertainty quantification in weakly identified settings.2026-04-22T14:26:32Z21 pages, 10 tables. Supplementary materials and reproducible code available at https://github.com/saraantonijevic/bayesian_diagnostic_table-reconstructionSara AntonijevicDanielle SitaloBrani Vidakovichttp://arxiv.org/abs/2505.13106v5How to optimise tournament draws: The case of the FIFA World Cup2026-04-22T13:12:46ZThe organisers of major sports competitions use different policies with respect to constraints in the group draw. Our paper aims to rationalise these choices by analysing the trade-off between attractiveness (the number of games played by teams from the same geographic zone) and fairness (the departure of the draw mechanism from a uniform distribution). A parametric optimisation model is formulated and applied to the 2018 and 2022 FIFA World Cup draws. A flaw of the draw procedure is identified: the pre-assignment of the host to a group unnecessarily increases the distortions. All Pareto efficient sets of draw constraints are determined via simulations. The proposed framework can be used to find the optimal draw rules and justify the non-uniformity of the draw procedure for the stakeholders.2025-05-19T13:36:00Z32 pages, 8 figures, 6 tablesInternational Transactions in Operational Research, 2026, forthcomingLászló Csatóhttp://arxiv.org/abs/2503.16744v3Modeling and forecasting subnational age distribution of death counts2026-04-22T12:18:54ZExisting mortality forecasting methods focus on age-specific mortality rates, which lie in an unconstrained space and overlook the distributional nature of life-table death counts. Few studies have developed and compared forecasting methods that model the shape and dynamics of the age distribution of deaths, especially at the subnational level, where data quality varies greatly. This paper presents several forecasting methods to model and forecast the subnational age distribution of death counts. The age distribution of death counts has many similarities to probability density functions, which are non-negative and have a constrained integral, and thus live in a constrained nonlinear space. To address the nonlinear nature of objects, we implement a cumulative distribution function transformation that is scale-free and has additional monotonicity. Using subnational Japanese life-table death counts from the Japanese Mortality Database (2025), we evaluate the forecast accuracy of the transformation and forecasting methods. The improved forecast accuracy of life-table death counts implemented here will be of great interest to demographers in estimating regional age-specific survival probabilities and life expectancy, and to actuaries as a foundation for exploring potential applications in determining annuity prices for various ages and maturities.2025-03-20T23:11:50Z45 pages, 9 figures, 7 tablesHan Lin ShangCristian F. Jiménez-Varón