https://arxiv.org/api/maAnk+RS89c4iunsM0307uPNyK02026-06-14T08:38:09Z7835430015http://arxiv.org/abs/2509.10825v6CUBE: Contrastive Understanding by Balanced Experiments2026-06-04T12:25:50ZPost-hoc explanation depends on how model queries are organized. We propose CUBE, a design-based framework that explains a trained predictive model through balanced low--high probes. Selected variables define factors, designed feature-level combinations define query conditions, and model predictions are summarized as factorial contrasts. CUBE reports main effects and pairwise interactions as controlled readings of average and conditional response changes over a declared design space. Experiments on synthetic and real tabular tasks show that CUBE recovers dominant learned effect structure, clarifies query-efficient identifiability, and supports screening--follow-up refinement.2025-09-13T14:44:45ZThe core framework and main claims remain unchanged; the manuscript has been revised for clarity, presentation, and consistencyDongseok KimHyoungsun ChoiMohamed Jismy Aashik RasoolGisung Ohhttp://arxiv.org/abs/2605.08318v2When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains2026-06-04T12:22:50ZWe study the problem of \emph{architecture selection} for deep learning models trained to solve partial differential equations (PDEs), asking when transformer-based architectures with learned attention outperform Fourier-domain neural operators. We introduce the \textbf{Multi-Scale Attention Transformer} (\msat{}), a deep learning architecture that encodes spatiotemporal solution histories as token sequences and trains end-to-end via a composite supervised objective with optional physics-informed regularization terms. We conduct a comprehensive empirical evaluation against nine baselines -- including physics-informed neural networks (PINNs), neural operators (FNO, DeepONet, GNOT), and state-space models (Mamba-NO) -- across five benchmark problems from the PINNacle suite, using identical train/test splits and reference data for all methods. \msat{} achieves state-of-the-art generalization on complex geometry problems ($L^2_\mathrm{rel} = 0.0101$ on Heat2D-CG, a $3.7\times$ improvement over FNO) at $34\,\mathrm{s}$ total inference vs.\ $120{,}812\,\mathrm{s}$ for Mamba-NO. Ablation studies over the physics regularization component reveal a precise inductive bias tradeoff: physics priors reduce test error on diffusion-dominated problems but degrade generalization on chaotic and recirculating-flow regimes, directly characterizing the prior misspecification boundary. Approximation error bounds as a function of domain boundary complexity $κ$ provide a theoretical basis for these empirical findings and a principled rule for architecture selection.2026-05-08T15:23:13ZSubstantial Revision RequiredBrandon YeePairie KohJack RodriguezMihir Tekalhttp://arxiv.org/abs/2405.05097v9Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional propagation of values and densities2026-06-04T12:12:01ZRecently a million of biological neurons (BNN) has turned out better from modern RL methods in playing Pong~\cite{RL}, reminding they are still qualitatively superior e.g. in learning, flexibility and robustness - suggesting to try to improve current artificial e.g. MLP/KAN for better agreement with biological. There is proposed extension of KAN approach to neurons containing model of local joint distribution: $ρ(\mathbf{x})=\sum_{\mathbf{j}\in B} a_\mathbf{j} f_\mathbf{j}(\mathbf{x})$ for $\mathbf{x} \in [0,1]^d$, adding interpretation and information flow control to KAN, and allowing to gradually add missing 3 basic properties of biological: 1) biological axons propagate in both directions~\cite{axon}, while current artificial are focused on unidirectional propagation - joint distribution neurons can repair by substituting some variables to get conditional values/distributions for the remaining. 2) Animals show risk avoidance~\cite{risk} requiring to process variance, and generally real world rather needs probabilistic models - the proposed can predict and propagate also distributions as vectors of moments: (expected value, variance) or higher. 3) biological neurons require local training, and beside backpropagation, the proposed allows many additional ways, like direct training, through tensor decomposition, or finally local and promising: information bottleneck. Proposed approach is very general, can be also used as extension of softmax in embeddings of e.g. transformer, JEPA, Mamba, suggesting interpretation that features are mixed moments of joint density of real-world properties.2024-05-08T14:49:27Z12 pages, 17 figuresJarek Dudahttp://arxiv.org/abs/2606.06043v1Adaptive Learning Rates with Surrogate Probability for Follow-the-Perturbed-Leader2026-06-04T11:36:08ZFollow-the-regularized-leader framework has shown effectiveness and flexibility in online learning problems, where the choice of learning rates are known to be crucial. Recently, adaptive learning rates defined in terms of the arm-selection probabilities, obtained by solving convex optimization, have achieved improved best-of-both-worlds (BOBW) guarantees in various bandit problems. In contrast, BOBW guarantees for its computationally efficient alternative, follow-the-perturbed-leader (FTPL), remain relatively limited since its optimization-free nature ironically makes the design of adaptive, probability-dependent learning rates non-trivial. To address this challenge, we propose an adaptive learning rate for FTPL by introducing surrogate probability functions that can be computed only from the available quantities, without requiring the exact probabilities. Based on these learning rates with surrogate functions, we provide the BOBW guarantee for FTPL with Pareto perturbations for any shape parameter $α>1$, generalizing prior results restricted to specific choices of $α=2$. We further show the BOBW guarantees for FTPL with adaptive learning rates in the bandit problem with expert advices. Our approach preserves the computational simplicity of FTPL while enabling probability-dependent adaptivity, and the surrogate-based methodology may be of independent interest in other algorithmic frameworks beyond FTPL and learning rate designs.2026-06-04T11:36:08ZTBA COLT2026Jongyeong LeeJunya HondaShinji ItoChansoo Kimhttp://arxiv.org/abs/2505.11006v6Is Supervised Learning Really That Different from Unsupervised?2026-06-04T10:01:10ZWe demonstrate how supervised learning can be decomposed into a two-stage procedure, where (1) all model parameters are selected in an unsupervised manner, and (2) the outputs y are added to the model, without changing the parameter values. This is achieved by a new model selection criterion that - in contrast to cross-validation - can be used also without access to y. For linear ridge regression, we bound the asymptotic out-of-sample risk of our method in terms of the optimal asymptotic risk. We also demonstrate that versions of linear and kernel ridge regression, smoothing splines, k-nearest neighbors, random forests, and neural networks, trained without access to y, perform similarly to their standard y-based counterparts. Hence, our results suggest that the difference between supervised and unsupervised learning is less fundamental than it may appear.2025-05-16T08:51:44ZPaper accepted at AISTATS 2026Oskar AllerboThomas B. Schönhttp://arxiv.org/abs/2606.05957v1Dead Directions: Geometric Singular Learning2026-06-04T09:54:08ZSingular learning theory and information geometry have studied the same parameter spaces in mostly separate vocabularies: the former computes Bayesian invariants in resolved coordinates, the latter works in original coordinates under a non-degeneracy assumption that overparameterised models routinely violate. We bridge them through one primitive, the dead direction: a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set with a definite KL order, set by how fast the KL divergence vanishes. The two readings name the same vector; our central move shows its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold, and we extend the recovery to multi-component crossings, multiplicity $m$, the singular fluctuation $ν$ (universal in the KL order for 1D directions), prior-RLCT shifts, and tempered posteriors. We then lift this rate to a deep network: a multi-layer K-FAC factorisation writes each Fisher block as a product of activation- and gradient-side rates with a duality between them, instantiated at modern-network primitives (residual streams, layer normalisation, attention). A quotient theorem carries the rate to the gauge quotient $Θ/G$ under gradient flow on a $G$-invariant metric; SGD qualifies, standard Adam does not, and we construct a $G$-equivariant Adam-family preconditioner (DDCAdam) that does. The bridge yields a parameter-coordinate handle on singular geometry, closed-form per-architecture predictions, and a trajectory-rate readout of Watanabe's triple $(λ, m, ν)$ from one checkpoint's forward and backward passes, without posterior sampling.2026-06-04T09:54:08Z139 pages, 13 figures, 13 tablesTejas Pradeep Shirodkarhttp://arxiv.org/abs/2601.16195v3Pushing the limits of unconstrained machine-learned interatomic potentials2026-06-04T09:53:42ZMachine-learned interatomic potentials (MLIPs) are increasingly used to replace computationally demanding electronic-structure calculations to model matter at the atomic scale. The most commonly used model architectures are constrained to fulfill a number of physical laws exactly, from geometric symmetries to energy conservation. Evidence is mounting that relaxing some of these constraints can be beneficial to the efficiency and (somewhat surprisingly) accuracy of MLIPs, even though care should be taken to avoid qualitative failures associated with the breaking of physical symmetries. Given the recent trend of scaling up models to larger numbers of parameters and training samples, a very important question is how unconstrained MLIPs behave in this limit. Here we investigate this issue, showing that -- when trained on large datasets -- unconstrained models can be superior in accuracy and speed when compared to physically constrained models. We assess these models both in terms of benchmark accuracy and in terms of usability in practical scenarios, focusing on static simulation workflows such as geometry optimization and lattice dynamics. We conclude that accurate unconstrained models can be applied with confidence, especially since simple inference-time modifications can be used to recover observables that are consistent with the relevant physical symmetries.2026-01-22T18:46:58Z21 pages, 8 figuresFilippo BigiPaolo PegoloArslan MazitovJonathan SchmidtMichele Ceriottihttp://arxiv.org/abs/2606.05942v1EML-CD: Causal Mechanism Recovery via EML Symbolic Trees in Structure Learning2026-06-04T09:45:42ZNeural network (NN)-based nonlinear causal discovery methods recover DAG structure but leave each causal mechanism as a black box. Waxman et al. argued that extracting causal mechanisms from NN weights is ill-posed. We propose EML-CD, a framework that integrates the EML operator (capable of composing elementary functions from a single binary operator) into causal structure learning, with interpretable mechanism recovery as the primary objective. EML-CD represents each edge mechanism as a gated EML binary tree and automatically discovers closed-form causal equations. Analytical Jacobians can be directly computed from the output equations, enabling quantitative understanding of causal effects. On real data (Sachs protein signaling, d=11), EML-CD achieves SHD=11.2 +/- 0.4 (5-seed mean; baselines are single deterministic runs), on par with PC/GES within seed variance and below CAM, while attaching closed-form equations to each detected edge (precision 0.756, recall 0.365). In a controlled bivariate test with known mechanisms, EML-CD recovers 10 of 11 elementary function families faithfully (held-out shape correlation >= 0.96; only high-frequency sine is partial). On a symbolic synthetic benchmark, EML-CD attains a substantially lower and more stable held-out mechanism f-MSE than a fixed SINDy dictionary (mean 3.67 vs. 7644, the latter inflated by catastrophic extrapolation on one seed), although its structure recovery (SHD 14.0) only matches the dictionary and stays below specialized optimizers; on the Causal Chambers light-tunnel subset, a depth-2 model improves F1 over linear OLS-BIC (0.444 vs. 0.273).2026-06-04T09:45:42ZSota Asanumahttp://arxiv.org/abs/2509.20345v3General Synthetic-Powered Inference2026-06-04T08:42:19ZThe rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around a broad class of statistical inference procedures to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard method using only real data when synthetic data are of low quality. The error rate of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.2025-09-24T17:37:14ZMeshi BashariYonghoon LeeRoy Maor LotanEdgar DobribanYaniv Romanohttp://arxiv.org/abs/2311.07565v3Exploration via linearly perturbed loss minimisation2026-06-04T08:26:53ZWe introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.2023-11-13T18:54:43ZUpdated with erratum note: Appendix I contains a gap in the proof; all main-paper claims remain valid via the corrected argument of Perneczky, Abeille & Janz (2026, arXiv:2606.00431)David JanzShuai LiuAlex AyoubCsaba Szepesvárihttp://arxiv.org/abs/2412.11800v4Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data2026-06-04T08:23:17ZExtracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (AnomalyCD), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The AnomalyCD presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of the approach on two datasets: monitoring sensor data from the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public dataset from an information technology monitoring system. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly datasets. Code: https://github.com/muleina/AnomalyCD .2024-12-16T14:11:28Z26 pages, 17 figures, 8 tables, published version at EPJ-C: Computing, Software and Data ScienceEur. Phys. J. C, 86, 585 (2026)Mulugeta Weldezgina AsresChristian Walter OmlinThe CMS-HCAL Collaboration10.1140/epjc/s10052-026-15611-5http://arxiv.org/abs/2412.04605v4Semiparametric Bayesian Difference-in-Differences2026-06-04T08:05:29ZThis paper studies semiparametric Bayesian inference for the average treatment effect on the treated (ATT) within the difference-in-differences (DiD) research design. We propose two new Bayesian methods with frequentist validity. The first one is the semiparametric Bayesian outcome regression, where we place a Gaussian process prior on the conditional mean function of the control group. The second method is a doubly robust Bayesian procedure that adjusts the prior distribution of the conditional mean function and subsequently corrects the posterior distribution of the resulting ATT. We prove new semiparametric Bernstein-von Mises (BvM) theorems for both proposals. Monte Carlo simulations and an empirical application demonstrate that the proposed Bayesian DiD methods exhibit strong finite-sample performance. We also present extensions of the canonical DiD approach, incorporating clustered data and staggered entry with multiple periods.2024-12-05T20:41:36ZChristoph BreunigRuixuan LiuZhengfei Yuhttp://arxiv.org/abs/2606.05779v1TinyML-Driven Cybersecurity for Autonomous Spacecraft: Latency-Accuracy Analysis for SPARTA RF and Cyber Threat Detection2026-06-04T07:08:28ZAutonomous spacecraft require rapid, lightweight, and reliable onboard detection of cyber-RF threats. Using the SPARTA attack model, we analyze the latency-accuracy trade-offs of TinyML-compatible classical models -- Random Forest, Logistic Regression, SVM, and MLP -- for detecting uplink jamming, Fake-NR spoofing, payload manipulation, ground-segment compromise, and unauthorized command injection. We present a physics-informed theoretical analysis of each model's computational complexity, VC dimension, Lipschitz continuity, and latency scaling, supported by empirical measurements on adversarial RF spectrograms generated via BandErasure, FakeNR, and NoiseBurst corruption modes. Results show that Logistic Regression achieves microsecond-level inference with only a 1\% accuracy drop relative to Random Forest, making it an effective TinyML baseline for onboard autonomy. The study also identifies opportunities for advancing spacecraft cybersecurity through richer feature encoders and multi-timescale learning architectures, building on recent progress in edge intelligence and trustworthy AI.2026-06-04T07:08:28ZTwenty Fifth International Conference on Security & Management (SAM'26)Van LeTrevor TranTan Lehttp://arxiv.org/abs/2511.16111v2Rotation-Parameterized Graph Fractional Fourier Transform: Definition, Properties, and Optimal Filtering2026-06-04T07:02:23ZGraph spectral representations are fundamental in graph signal processing, providing a rigorous frameworkforanalyzing graph-structured data. The graph fractional Fourier transform (GFRFT) extends the graph Fourier transform (GFT) through a fractional-order parameter, enabling flexible spectral analysis with mathematical consistency. The angular graph Fourier transform (AGFT) further introduces angular control by rotating GFT eigenvectors; however, existing constructions may fail to reduce exactly to the GFT at zero angle, weakening theoretical consistency and interpretability. To address these complementary limitations, namely the lack of rotation-based basis control in GFRFT and the defective zero-angle degeneracy of AGFT, this paper proposes the rotation-parameterized graph fractional Fourier transform (RP-GFRFT), which unifies fractional order and rotation-parameterized spectral analysis. A degeneracy preserving rotation matrix family is constructed to guarantee exact GFT reduction at zero angle. TwoRP-GFRFTvariants,I-RP-GFRFTandII-RP-GFRFT,arethenformulated, with theoretical analyses confirming their unitarity, invertibility, reduction behavior, and smooth parameter dependence. The fractional order and rotation angle are jointly optimized for adaptive graph spectral filtering. Experiments on real-world signals, images, and point clouds demonstrate that RP-GFRFT improves denoising accuracy, reconstruction quality, and feature preservation over GFRFT, AGFT, and representative filtering baselines.2025-11-20T07:13:27ZFeiyue ZhaoMingzhi WangYangfan HeZhichao Zhanghttp://arxiv.org/abs/2606.05733v1Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs2026-06-04T05:48:56ZPer-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. To address this limitation, we introduce a heterogeneous Rust-Python streaming architecture that maps cross-company attention as a continuous-time graph driven directly from text. We show that on the ingestion side, a zero-copy Rust edge parses news records in $\sim$100 ns and scans the target equity universe in $\sim$1.2 $μ$s. On the inference end, a multivariate Neural Hawkes Process featuring per-node continuous-time LSTM states and a bilinear latent projection propagates directed excitation, while an adaptive pruning rule bounds the computational cost of dynamic neighborhood updates. Combining these stages, we demonstrate an end-to-end processing latency of $\sim$13 ms per incoming news record on a single commodity CPU. Evaluated on a one-month temporal holdout of the FNSPID corpus (638 articles across 47 tickers), the system delivers a $1.70\times$ precision lift over random at the 90th-percentile next-day return threshold, and $3.36\times$ over a same-sector baseline. Crucially, removing the graph topology collapses precision to zero, confirming that the dynamic attention network is the sole driver of cross-company signal in this architecture.2026-06-04T05:48:56ZAccepted to the 2026 ACM SIGMOD Workshop on Data Management for the Modern Financial Systems (FinDS). 10 pages, 4 figuresKabir Murjani