https://arxiv.org/api/rAyKnsTIjagM6yqY2XztpSRg/sk2026-03-26T13:11:57Z765437515http://arxiv.org/abs/2309.07250v2All you need is spin: SU(2) equivariant variational quantum circuits based on spin networks2026-03-24T02:02:08ZVariational algorithms require architectures that naturally constrain the optimization space to run efficiently. Geometric quantum machine learning achieves this goal by encoding group structure into parameterized quantum circuits to include the symmetries of a problem as an inductive bias. However, constructing such circuits is challenging as a concrete guiding principle has yet to emerge. In this paper, we propose the use of spin networks, a form of directed tensor network invariant under a group transformation, to devise SU(2) equivariant quantum circuit ansätze $\unicode{x2013}$ circuits possessing spin-rotation symmetry. By changing to the basis that block diagonalizes the SU(2) group action, these networks provide a natural building block for constructing parameterized equivariant quantum circuits. We prove that our construction is mathematically equivalent to other known constructions, such as those based on twirling and generalized permutations, but more direct to implement on quantum hardware. The efficacy of our constructed circuits is tested by solving the ground state problem of SU(2) symmetric Heisenberg models on the one-dimensional triangular lattice and the Kagome lattice. Our results highlight that our equivariant circuits boost the performance of quantum variational algorithms, indicating broader applicability to other real-world problems.2023-09-13T18:38:41Z19 + 7 pages, close to a version accepted to Quantum Science and TechnologyQuantum Sci. Technol. 11 025025 (2026)Richard D. P. EastGuillermo Alonso-LinajeChae-Yeun Park10.1088/2058-9565/ae4cffhttp://arxiv.org/abs/2603.20655v2Exponential Family Discriminant Analysis: Generalizing LDA-Style Generative Classification to Non-Gaussian Models2026-03-24T00:40:22ZWe introduce Exponential Family Discriminant Analysis (EFDA), a unified generative framework that extends classical Linear Discriminant Analysis (LDA) beyond the Gaussian setting to any member of the exponential family. Under the assumption that each class-conditional density belongs to a common exponential family, EFDA derives closed-form maximum-likelihood estimators for all natural parameters and yields a decision rule that is linear in the sufficient statistic, recovering LDA as a special case and capturing nonlinear decision boundaries in the original feature space. We prove that EFDA is asymptotically calibrated and statistically efficient under correct specification, and we generalise it to $K \geq 2$ classes and multivariate data. Through extensive simulation across five exponential-family distributions (Weibull, Gamma, Exponential, Poisson, Negative Binomial), EFDA matches the classification accuracy of LDA, QDA, and logistic regression while reducing Expected Calibration Error (ECE) by $2$-$6\times$, a gap that is structural: it persists for all $n$ and across all class-imbalance levels, because misspecified models remain asymptotically miscalibrated. We further prove and empirically confirm that EFDA's log-odds estimator approaches the Cramér-Rao bound under correct specification, and is the only estimator in our comparison whose mean squared error converges to zero. Complete derivations are provided for nine distributions. Finally, we formally verify all four theoretical propositions in Lean 4, using Aristotle (Harmonic) and OpenGauss (Math, Inc.) as proof generators, with all outputs independently machine-checked by AXLE (Axiom).2026-03-21T05:24:56ZPreprint, 15 pages, 5 figuresAnish Lakkapragadahttp://arxiv.org/abs/2603.22644v1Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification2026-03-23T23:43:52ZWe consider a PAC-Bayes type learning rule for binary classification, balancing the training error of a randomized ''posterior'' predictor with its KL divergence to a pre-specified ''prior''. This can be seen as an extension of a modified two-part-code Minimum Description Length (MDL) learning rule, to continuous priors and randomized predictions. With a balancing parameter of $λ=1$ this learning rule recovers an (empirical) Bayes posterior and a modified variant recovers the profile posterior, linking with standard Bayesian prediction (up to the treatment of the single-parameter noise level). However, from a risk-minimization prediction perspective, this Bayesian predictor overfits and can lead to non-vanishing excess loss in the agnostic case. Instead a choice of $λ\gg 1$, which can be seen as using a sample-size-dependent-prior, ensures uniformly vanishing excess loss even in the agnostic case. We precisely characterize the effect of under-regularizing (and over-regularizing) as a function of the balance parameter $λ$, understanding the regimes in which this under-regularization is tempered or catastrophic. This work extends previous work by Zhu and Srebro [2025] that considered only discrete priors to PAC Bayes type learning rules and, through their rigorous Bayesian interpretation, to Bayesian prediction more generally.2026-03-23T23:43:52ZXiaohan ZhuMesrob I. OhannessianNathan Srebrohttp://arxiv.org/abs/2202.05775v3Inference of Multiscale Gaussian Graphical Model2026-03-23T20:48:52ZGaussian Graphical Models (GGMs) are widely used in high-dimensional data analysis to synthesize the interaction between variables. In many applications, such as genomics or image analysis, graphical models rely on sparsity and clustering to reduce dimensionality and improve performances. This paper explores a slightly different paradigm where clustering is not knowledge-driven but performed simultaneously with the graph inference task. We introduce a novel Multiscale Graphical Lasso (MGLasso) to improve networks interpretability by proposing graphs at different granularity levels. The method estimates clusters through a convex clustering approach - a relaxation of k-means, and hierarchical clustering. The conditional independence graph is simultaneously inferred through a neighborhood selection scheme for undirected graphical models. MGLasso extends and generalizes the sparse group fused lasso problem to undirected graphical models. We use continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA) to propose a regularization path of solutions along the group fused Lasso penalty, while the Lasso penalty is kept constant. Extensive experiments on synthetic data compare the performances of our model to state-of-the-art clustering methods and network inference models. Applications to gut microbiome data and poplar's methylation mixed with transcriptomic data are presented.2022-02-11T17:11:20Z31 pagesComputo, 2023Do Edmond SanouChristophe AmbroiseGeneviève Robin10.57750/1f4p-7955http://arxiv.org/abs/2603.22563v1Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling2026-03-23T20:45:17ZPreference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.2026-03-23T20:45:17ZYoung Hyun ChoWill Wei Sunhttp://arxiv.org/abs/2512.04165v4Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity2026-03-23T20:25:04ZTwo pressing topics in the theory of deep learning are the interpretation of feature learning (FL) mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich FL often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this analytical complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of FL emerge. This form of scale analysis is considerably simpler than such exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.2025-12-03T19:00:03ZNoa RubinOrit DavidovichZohar Ringelhttp://arxiv.org/abs/2509.25802v2Graph Distribution-valued Signals: A Wasserstein Space Perspective2026-03-23T19:14:50ZWe introduce a novel framework for graph signal processing (GSP) that models signals as graph distribution-valued signals (GDSs), which are probability distributions in the Wasserstein space. This approach overcomes key limitations of classical vector-based GSP, including the assumption of synchronous observations over vertices, the inability to capture uncertainty, and the requirement for strict correspondence in graph filtering. By representing signals as distributions, GDSs naturally encode uncertainty and stochasticity, while strictly generalizing traditional graph signals. We establish a systematic dictionary mapping core GSP concepts to their GDS counterparts, demonstrating that classical definitions are recovered as special cases. The effectiveness of the framework is validated through graph filter learning for prediction tasks, supported by experimental results.2025-09-30T05:21:18ZAccepted by IEEE ICASSP 2026Yanan ZhaoFeng JiXingchao JianWee Peng Tayhttp://arxiv.org/abs/2603.22468v1SPDE Methods for Nonparametric Bayesian Posterior Contraction and Laplace Approximation2026-03-23T18:36:36ZWe derive posterior contraction rates (PCRs) and finite-sample Bernstein von Mises (BvM) results for non-parametric Bayesian models by extending the diffusion-based framework of Mou et al. (2024) to the infinite-dimensional setting. The posterior is represented as the invariant measure of a Langevin stochastic partial differential equation (SPDE) on a separable Hilbert space, which allows us to control posterior moments and obtain non-asymptotic concentration rates in Hilbert norms under various likelihood curvature and regularity conditions. We also establish a quantitative Laplace approximation for the posterior. The theory is illustrated in a nonparametric linear Gaussian inverse problem.2026-03-23T18:36:36Z32 pages, under reviewEnric Alberola-BoloixIoar Casado-Telletxeahttp://arxiv.org/abs/2603.22465v1A Theoretical Framework for Energy-Aware Gradient Pruning in Federated Learning2026-03-23T18:31:07ZFederated Learning (FL) is constrained by the communication and energy limitations of decentralized edge devices. While gradient sparsification via Top-K magnitude pruning effectively reduces the communication payload, it remains inherently energy-agnostic. It assumes all parameter updates incur identical downstream transmission and memory-update costs, ignoring hardware realities. We formalize the pruning process as an energy-constrained projection problem that accounts for the hardware-level disparities between memory-intensive and compute-efficient operations during the post-backpropagation phase. We propose Cost-Weighted Magnitude Pruning (CWMP), a selection rule that prioritizes parameter updates based on their magnitude relative to their physical cost. We demonstrate that CWMP is the optimal greedy solution to this constrained projection and provide a probabilistic analysis of its global energy efficiency. Numerical results on a non-IID CIFAR-10 benchmark show that CWMP consistently establishes a superior performance-energy Pareto frontier compared to the Top-K baseline.2026-03-23T18:31:07Z8 pages, 2 figures. This work has been submitted to the IEEE for possible publicationEmmanouil M. Athanasakoshttp://arxiv.org/abs/2509.15197v2Consistent Bayesian causal discovery for structural equation models with equal error variances2026-03-23T18:09:26ZWe consider the problem of recovering the true causal structure among a set of variables, generated by a linear acyclic structural equation model (SEM) with the error terms being independent, not necessarily Gaussian, and having equal variances. It is well-known that the true underlying directed acyclic graph (DAG) encoding the causal structure is uniquely identifiable under this assumption. Interestingly, in this setting, it further holds that the sum of minimum expected squared errors for every variable, while predicted by the best linear combination of its parent variables, is minimised if and only if the causal structure is represented by any supergraph of the true DAG. In this work, we propose a Bayesian DAG selection method, where the working model assumes Gaussian SEM with equal error variances, and employ independent g-priors on each set of SEM coefficients. Furthermore, we utilise the aforementioned key property to establish that the proposed method recovers the true graph consistently without any additional distributional assumption, and illustrate it with a simulation study.2025-09-18T17:52:26ZAnamitra ChaudhuriYang NiAnirban Bhattacharyahttp://arxiv.org/abs/2603.22276v1Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels2026-03-23T17:57:24ZWeight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved.
We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice.
Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.2026-03-23T17:57:24Z30 pages, 15 figures, 15 tables, including appendices. Code and data at https://github.com/sockeye44/dorafactorsAlexandra ZeleninAlexandra Zhuravlyovahttp://arxiv.org/abs/2603.22248v1Confidence-Based Decoding is Provably Efficient for Diffusion Language Models2026-03-23T17:43:21ZDiffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emph{decoding strategy} -- which determines the order and number of tokens generated at each iteration -- critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited.
In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves $\varepsilon$-accurate sampling in KL divergence with an expected number of iterations $\widetilde O(H(X_0)/\varepsilon)$, where $H(X_0)$ denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs.2026-03-23T17:43:21ZChangxiao CaiGen Lihttp://arxiv.org/abs/2603.22219v1Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting2026-03-23T17:14:11ZModern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.2026-03-23T17:14:11ZQilin Wanghttp://arxiv.org/abs/2603.22208v1Identification of physiological shock in intensive care units via Bayesian regime switching models2026-03-23T17:03:07ZDetection of occult hemorrhage (i.e., internal bleeding) in patients in intensive care units (ICUs) can pose significant challenges for critical care workers. Because blood loss may not always be clinically apparent, clinicians rely on monitoring vital signs for specific trends indicative of a hemorrhage event. The inherent difficulties of diagnosing such an event can lead to late intervention by clinicians which has catastrophic consequences. Therefore, a methodology for early detection of hemorrhage has wide utility. We develop a Bayesian regime switching model (RSM) that analyzes trends in patients' vitals and labs to provide a probabilistic assessment of the underlying physiological state that a patient is in at any given time. This article is motivated by a comprehensive dataset we curated from Mayo Clinic of 33,924 real ICU patient encounters. Longitudinal response measurements are modeled as a vector autoregressive process conditional on all latent states up to the current time point, and the latent states follow a Markov process. We present a novel Bayesian sampling routine to learn the posterior probability distribution of the latent physiological states, as well as develop an approach to account for pre-ICU-admission physiological changes. A simulation and real case study illustrate the effectiveness of our approach.2026-03-23T17:03:07ZEmmett B. KendallJonathan P. WilliamsCurtis B. StorlieMisty A. RadosevichErica D. WittwerMatthew A. Warnerhttp://arxiv.org/abs/2602.10273v2Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning2026-03-23T17:01:31ZMany recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution $π_α(y\mid x)\propto p_θ(y\mid x)^α$ ($α>1$), which concentrates mass on whole sequences instead of adjusting token-level temperature. Prior work shows that Metropolis--Hastings (MH) sampling from this distribution recovers strong reasoning performance, but at order-of-magnitude inference slowdowns. We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency. Power-SMC advances a small particle set in parallel, corrects importance weights token-by-token, and resamples when necessary, all within a single GPU-friendly batched decode. We prove that temperature $τ=1/α$ is the unique prefix-only proposal minimizing incremental weight variance, interpret residual instability via prefix-conditioned Rényi entropies, and introduce an exponent-bridging schedule that improves particle stability without altering the target. On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28\times$ to $1.4$--$3.3\times$ over baseline decoding. The code is available at https://github.com/ArminAzizi98/Power-SMC.2026-02-10T20:31:40ZSeyedarmin AziziErfan Baghaei PotraghlooMinoo AhmadiSouvik KunduMassoud Pedram