https://arxiv.org/api/OieUENMWqCu+21I4e/O078ADO0Q2026-06-21T09:42:06Z7851164515http://arxiv.org/abs/2410.15761v6Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees2026-06-03T09:17:37ZLarge Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.2024-10-21T08:21:00Z25 pages, 17 main paperYannis MontreuilShu Heng YeoAxel CarlierLai Xing NgWei Tsang Ooihttp://arxiv.org/abs/2601.07144v3Optimal Transport under Group Fairness Constraints2026-06-03T09:10:56ZEnsuring fairness in matching algorithms is a key challenge in allocating scarce resources and positions. Focusing on Optimal Transport (OT), we introduce a novel notion of group fairness requiring that the probability of matching two individuals from any two given groups in the OT plan satisfies a predefined target. We first propose a modified Sinkhorn algorithm to compute perfectly fair transport plans efficiently. Since exact fairness can significantly degrade matching quality in practice, we then develop two relaxation strategies. The first one involves solving a penalized OT problem, for which we derive novel finite-sample complexity guarantees. Our second strategy leverages bilevel optimization to learn a ground cost that induces a fair OT solution, and we establish a bound on the deviation of fairness when matching unseen data. Finally, we present empirical results illustrating the performance of our approaches and the trade-off between fairness and transport cost.2026-01-12T02:26:32ZAccepted at ICML 2026 (spotlight)Linus BleisteinMathieu DagréouFrancisco AndradeThomas BoudouAurélien Bellethttp://arxiv.org/abs/2503.18721v3Differentially Private Joint Independence Test2026-06-03T09:01:24ZIdentification of joint dependence among several random vectors plays an important role in many statistical applications, where the data may contain sensitive or confidential information. In this paper, we consider the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) in the context of differential privacy. Given that the limiting distribution of the empirical estimate of dHSIC is a complicated Gaussian chaos, constructing tests in the non-private regime is typically based on permutation and bootstrap methods. To detect joint dependence under privacy constraints, we propose a dHSIC-based testing procedure employing a differentially private permutation methodology. We show that our method enjoys privacy guarantees, a valid level, and pointwise consistency, whereas the bootstrap counterpart suffers from inconsistent power. We further investigate the uniform power of the proposed test under the dHSIC and $L_2$ metrics, showing that the proposed test attains the minimax optimal power across different privacy regimes. As a byproduct, we show that the non-private permutation dHSIC test proposed in Pfister et al. (2018) is a special case of our differentially private permutation test, and our results also establish its pointwise and uniform power--thus resolving an open problem from that work. Both numerical simulations and real data analysis in causal inference suggest that our proposed test performs well empirically.2025-03-24T14:32:05Z57 pages, 7 figuresXingwei LiuYuexin ChenJin-Ting ZhangWangli Xuhttp://arxiv.org/abs/2606.04603v1Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval2026-06-03T08:41:27ZApproximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content.
We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure.
On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.2026-06-03T08:41:27ZOlivier Jeunenhttp://arxiv.org/abs/2606.04576v1ReSGA: A Large Tail Risk Model for Learning Value-at-Risk and Expected Shortfall2026-06-03T08:11:45ZLearning Value-at-Risk (VaR) and Expected Shortfall (ES) is important for managing financial risks effectively. Existing approaches with limited parameters are vulnerable to model misspecification in the era of big data. To address this limitation, we propose a large tail risk model, the retrieval-enhanced self-grouping autoencoder (ReSGA), which is designed with millions of parameters to exploit the rich cross-sectional dependence and long-term temporal dynamics of assets using their characteristics. Applied to monthly US equity returns from 1926 to 2023 with 153 firm characteristics, ReSGA outperforms twelve econometric and machine learning competitors in terms of out-of-sample loss and statistical backtesting. In addition, its forecast advantages can translate into significant economic gains from long-short decile portfolios that are constructed by a new size-enhanced left-side momentum strategy. To clarify the role of complexity, we further conduct a systematic scaling analysis and demonstrate that improvements in joint VaR-ES forecasting are primarily driven by data complexity rather than model complexity. Finally, our analyses of group-importance and transfer-learning exhibit the interpretability and cross-market generalizability of ReSGA.2026-06-03T08:11:45ZYichi ZhangKe ZhuZhoufan Zhuhttp://arxiv.org/abs/2606.04574v1Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning2026-06-03T08:10:33ZThis study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cryptocurrency markets. Although classical implementations of the strategy have proven successful in traditional equities, they frequently exhibit rigidity and suffer from severe divergence risks when applied to high-variance environments. To address this need, this research introduces novel concepts. To construct a robust system, we developed a hierarchical "Filter-then-Rank" pair selection methodology and a proprietary "Fixed Risk, Adaptive Mean" execution model. The system employs a Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer to govern execution decisions within strict deterministic risk management boundaries. Evaluated on 1-hour interval data from the Binance USD-M Futures market, the optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent's risk-adjusted outperformance is statistically significant at the 10 percent level. Although falling marginally short of the stricter 5 percent threshold, this result highlights the extreme idiosyncratic variance characteristic of digital assets. Ultimately, this thesis contributes to the quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies. Furthermore, it delivers a novel framework for safe reinforcement learning via deterministic shielding, proving that anchoring a neural policy to statistically robust boundaries successfully mitigates severe divergence risks.2026-06-03T08:10:33Z61 pages, 37 figures, 16 tablesDamian LebiedźRobert Ślepaczukhttp://arxiv.org/abs/2606.09885v1TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts2026-06-03T07:49:31ZMixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron's projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.2026-06-03T07:49:31ZJiangyang HeShaolin ZhuDeyi Xionghttp://arxiv.org/abs/2606.05242v1Deterministic Envelopes for Tamed SGLD: Decoupling Stochastic-Gradient Noise and Localizing Taming2026-06-03T07:23:16ZStochastic-gradient Langevin algorithms often use tamed denominators to stabilize non-globally Lipschitz drifts. This paper shows that when the denominator depends on the same stochastic-gradient realization as the numerator, the taming step changes the stochastic oracle itself and can create a stationary bias even if the original stochastic gradient is unbiased. We propose a structure-preserving framework for designing tamed denominators. It fixes the denominator before the oracle noise is sampled and uses localized deterministic envelopes to avoid unnecessary taming in typical regions. These kernels keep the stabilizing effect of taming while avoiding the bias introduced by a gradient-dependent denominator. Our theory explains how the stationary error splits into the bias caused by oracle-dependent taming and the remaining error introduced by deterministic stabilization. Within this deterministic-envelope family, the analysis identifies a far-tail condition that explains the limitation of local soft envelopes and motivates a hybrid member: soft in the typical region, but protected by hard-tail control on rare excursions. Experiments confirm the predicted stationary distortions of random denominators, the bias reduction of deterministic-envelope designs, and the stabilizing effect of the hybrid construction.2026-06-03T07:23:16Z40 pages, 11 tables, 2 figuresYiwei ZhouZiheng Chenhttp://arxiv.org/abs/2606.04486v1Global Sketch-Based Watermarking for Diffusion Language Models2026-06-03T06:08:58ZWatermarking methods for language models have been studied extensively in the autoregressive setting, where tokens are generated sequentially. These works largely focus on local-context schemes that perturb the next token's distribution as a function of its preceding tokens. In diffusion language models, distributions over many unresolved positions are jointly sampled, allowing additive statistics of the entire sequence to be tractable during generation. We propose a watermark for masked diffusion language models that controls a global, vector-valued sketch representation of the text. Compared to context-dependent watermarking, the sketch formulation decouples detection from the local contexts seen during generation, resulting in an order-agnostic statistic and a watermarking rule which does not manifest as a simple token bias. We analyze the distortion, soundness, and robustness properties of the method.2026-06-03T06:08:58ZDaniel Zhaohttp://arxiv.org/abs/2606.05239v1HyFAD: Hybrid Time-Frequency Diffusion with Frequency-Aware Embedding for Time Series Imputation2026-06-03T06:03:37ZDiffusion models have demonstrated strong performance in time series modeling due to their ability to progressively capture complex data distributions through iterative denoising. However, existing approaches struggle with frequency-sensitive denoising, high-frequency reconstruction and balancing global trends with local dynamics. To address these limitations, we propose \textbf{HyFAD}, a \textbf{Hy}brid time-frequency \textbf{D}iffusion model with \textbf{F}requency-\textbf{A}ware embedding for time series imputation. Built upon the DDPM paradigm, HyFAD adopts a coupled time-frequency diffusion framework, in which the reverse denoising proceeds sequentially from the time domain to the frequency domain, enabling coarse-to-fine generation. Specifically, the time-domain diffusion process captures low-frequency global trends, while the frequency-domain diffusion process refines high-frequency spectral components. We further introduce a frequency-aware step embedding that exploits the relationship between diffusion steps and spectral components, providing step-dependent spectral guidance and facilitates more accurate band-wise reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that HyFAD achieves state-of-the-art performance. Our source code is available at https://github.com/hongfangao/HyFAD.2026-06-03T06:03:37ZHongfan GaoWangmeng ShenBin YangJilin Huhttp://arxiv.org/abs/2606.04476v1When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks2026-06-03T05:44:30ZIn this paper, we study the gradient descent dynamics for jointly training both layers of a one-hidden-layer ReLU network to fit a linear target function. Concretely, we consider a realizable setting where inputs are drawn i.i.d. from a Gaussian distribution and labels follow a planted linear model. This stylized framework captures salient features of end-to-end training in inverse problems and certain auto-encoder models. Despite its apparent simplicity, the dynamics remain poorly understood, in part because the loss landscape contains multiple non-strict saddle points, making it unclear why gradient descent from random initialization reliably escapes bad stationary regions. We provide a detailed characterization of the optimization landscape and prove that gradient descent from a moderately small random initialization-simultaneously training both layers-converges to a global minimizer at a linear rate with order-wise optimal sample complexity. Our analysis tracks the trajectory through three phases: an alignment phase in which hidden weights progressively align with the planted direction while the output weights maintain the correct sign pattern; a growth phase in which the norms of both layers increase while preserving alignment; and a local refinement phase in which the aligned neurons rapidly converge to the planted direction, yielding fast local convergence. To rigorously show that GD avoids non-strict saddles, we develop trajectory-level control arguments for the end-to-end dynamics. In addition, we establish novel uniform concentration results that hold along the entire trajectory, and are essential for obtaining order-wise optimal sample complexity. We corroborate our theory with extensive experiments across a range of configurations.2026-06-03T05:44:30Z47 pages, 8 figures, published at the 39th Annual Conference on Learning Theory (COLT), 2026Berk TinazChangzhi XieMahdi Soltanolkotabihttp://arxiv.org/abs/2411.03383v4Near-Optimal and Tractable Estimation under Shift-Invariance2026-06-03T05:06:35ZHow hard is it to estimate a discrete-time signal $(x_{1}, ..., x_{n}) \in \mathbb{C}^n$ satisfying an unknown linear recurrence relation of order $s$ and observed in i.i.d. complex Gaussian noise? The class of all such signals is parametric but extremely rich: it contains all exponential polynomials over $\mathbb{C}$ with total degree $s$, including harmonic oscillations with $s$ arbitrary frequencies. Geometrically, this class corresponds to the projection onto $\mathbb{C}^{n}$ of the union of all shift-invariant subspaces of $\smash{\mathbb{C}^\mathbb{Z}}$ of dimension $s$. We show that the statistical complexity of this class, as measured by the squared minimax radius of the $(1-δ)$-confidence $\ell_2$-ball, is nearly the same as for the class of $s$-sparse signals, namely $\smash{O\left(s\log(en) + \log(δ^{-1})\right) \cdot \log^2(es) \cdot \log(en/s).}$ Moreover, the corresponding near-minimax estimator is tractable, and it can be used to build a test statistic with a near-minimax detection threshold in the associated detection problem. These statistical results rely upon a simple analytic observation: the interpretation of the Fourier coefficients of the Christoffel function of any shift-invariant subspace of $\smash{\mathbb{C}^\mathbb{Z}}$ as a reproducing filter with the smallest possible spectrum in all $\ell_p$-norms, $p \in [1,\infty]$, at once.2024-11-05T18:11:23Z28 pages. In the previous version (v2), our construction of the reproducing filter was erroneous. It is now replaced with an alternative construction using the Christoffel function. The only change from v3 is a typesetting correction in the abstractDmitrii M. Ostrovskiihttp://arxiv.org/abs/2506.01250v3Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration2026-06-03T04:29:50ZWe introduce the first variance-aware algorithms for contextual dueling bandits that leverage shallow exploration strategies with neural networks for nonlinear utility approximation. A key theoretical challenge is the absence of a closed-form estimator, which led prior work to require an extremely large network width $m$ (i.e., $m = \widetildeΩ(T^{14})$). We address this constraint with a novel analytical approach that combines iterative self-improvement with spectral analysis. Our analysis significantly reduces the network width requirement to $m = \widetildeΩ(T^{6})$, and shows that our algorithms achieve a sublinear regret of $\widetilde{\mathcal{O}}(d\sqrt{\sum_{t=1}^{T} σ_t^2} + \sqrt{dT})$ under both UCB and TS frameworks. Empirical results show that the proposed algorithms are not only computationally efficient and exhibit sublinear regret in practical settings, but also achieve state-of-the-art performance on both synthetic and real-world tasks.2025-06-02T01:58:48ZAccepted at AISTATS 2026; code at https://github.com/youngmin0oh/NVLDB-AISTATS2026Youngmin OhJinje ParkTaejin Paikhttp://arxiv.org/abs/2606.04429v1Flatness and Generalization: Learning Multi-Index Models with Homogeneous Neural Networks2026-06-03T04:20:03ZA common heuristic used to explain the generalization of first-order gradient methods on non-convex neural networks is that "flat interpolators generalize well" (Hochreiter and Schmidhuber, 1994; Keskar et al., 2017), where flatness can be measured by the trace of the Hessian of the empirical loss. However, Dinh et al. 2017) showed that, using symmetry of the network that can change flatness while keeping the population and empirical losses unchanged, any interpolator can be made sharper or flatter. This result makes the earlier heuristic statement vacuous. In this paper, we show that for learning an unknown multi-index model with $2$-layer non-convex homogeneous neural networks, there is a connection between flatness and generalization, despite the existence of symmetries. This connection pertains to the "flattest" interpolators, i.e., the interpolators that have orderwise minimum flatness among all interpolators. First, we show that there exists a natural class of non-generalizing interpolators whose flatness cannot be made closer to the flattest possible, even using symmetries. Second, we show that for data generated by a sum of single-index models, if the approximation error and label noise are low, any flattest interpolator achieves small population loss, i.e., the flattest interpolators always generalize. This establishes a direct link between flatness and generalization which applies to a large class of activations and realistic data distributions.2026-06-03T04:20:03ZHarsh VardhanHossein TaheriArya Mazumdarhttp://arxiv.org/abs/2606.04423v1The price of multi-group transductive learning2026-06-03T04:07:24ZWe show every multi-group learner in the transductive setting may incur a multiplicative penalty in its error rate on some group relative to the error rate achievable in the single-group setting, and the penalty can increasing linearly with the number of groups, up to roughly the square-root of the sample size. This stands in stark contrast to optimal multi-group learners in an analogous (group-realizable) statistical setting, where the penalty is always at most logarithmic in the sample size and independent of the number of groups.2026-06-03T04:07:24ZNoah BergamSamuel DengDaniel Hsu