https://arxiv.org/api/1yfbTIk2Zb6ZDjAZcZIhxQd4Cpk2026-06-14T13:18:22Z7835436015http://arxiv.org/abs/2606.05247v1DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables2026-06-03T11:58:06ZEnforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the constraint set or introduce substantial computational overhead for large-scale nonlinear problems. Here, we propose DiffSlack, a differentiable projection layer for nonlinear inequality-constrained neural prediction. DiffSlack reformulates inequalities as equalities with learnable slack variables, which are predicted as part of the augmented network output and provide a data-driven warm start for damped Gauss-Newton projection. The projection layer maps raw predictions onto the augmented feasible manifold while preserving end-to-end differentiability. A two-stage curriculum further stabilizes training and improves constraint satisfaction. We evaluate DiffSlack on vehicle path planning with 200 nonlinear inequality constraints from collision avoidance, curvature limits, and waypoint spacing. Compared with existing learning-based baselines, DiffSlack achieves a higher planning success rate and stronger geometric constraint satisfaction under a comparable inference budget. Ablation studies further show that the hard projection layer reduces sensitivity to supervision quality. Closed-loop tracking in CARLA and real-world vehicle experiments confirms the executability of the generated trajectories. These results demonstrate that DiffSlack provides a practical and scalable approach to embedding hard inequality constraints into neural networks for engineering applications.2026-06-03T11:58:06ZZiqian WangChenxi FangZhen Zhanghttp://arxiv.org/abs/2504.12988v7Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts2026-06-03T09:17:46ZExisting Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.2025-04-17T14:50:40ZYannis MontreuilAxel CarlierLai Xing NgWei Tsang Ooihttp://arxiv.org/abs/2410.15761v6Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees2026-06-03T09:17:37ZLarge Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.2024-10-21T08:21:00Z25 pages, 17 main paperYannis MontreuilShu Heng YeoAxel CarlierLai Xing NgWei Tsang Ooihttp://arxiv.org/abs/2601.07144v3Optimal Transport under Group Fairness Constraints2026-06-03T09:10:56ZEnsuring fairness in matching algorithms is a key challenge in allocating scarce resources and positions. Focusing on Optimal Transport (OT), we introduce a novel notion of group fairness requiring that the probability of matching two individuals from any two given groups in the OT plan satisfies a predefined target. We first propose a modified Sinkhorn algorithm to compute perfectly fair transport plans efficiently. Since exact fairness can significantly degrade matching quality in practice, we then develop two relaxation strategies. The first one involves solving a penalized OT problem, for which we derive novel finite-sample complexity guarantees. Our second strategy leverages bilevel optimization to learn a ground cost that induces a fair OT solution, and we establish a bound on the deviation of fairness when matching unseen data. Finally, we present empirical results illustrating the performance of our approaches and the trade-off between fairness and transport cost.2026-01-12T02:26:32ZAccepted at ICML 2026 (spotlight)Linus BleisteinMathieu DagréouFrancisco AndradeThomas BoudouAurélien Bellethttp://arxiv.org/abs/2503.18721v3Differentially Private Joint Independence Test2026-06-03T09:01:24ZIdentification of joint dependence among several random vectors plays an important role in many statistical applications, where the data may contain sensitive or confidential information. In this paper, we consider the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) in the context of differential privacy. Given that the limiting distribution of the empirical estimate of dHSIC is a complicated Gaussian chaos, constructing tests in the non-private regime is typically based on permutation and bootstrap methods. To detect joint dependence under privacy constraints, we propose a dHSIC-based testing procedure employing a differentially private permutation methodology. We show that our method enjoys privacy guarantees, a valid level, and pointwise consistency, whereas the bootstrap counterpart suffers from inconsistent power. We further investigate the uniform power of the proposed test under the dHSIC and $L_2$ metrics, showing that the proposed test attains the minimax optimal power across different privacy regimes. As a byproduct, we show that the non-private permutation dHSIC test proposed in Pfister et al. (2018) is a special case of our differentially private permutation test, and our results also establish its pointwise and uniform power--thus resolving an open problem from that work. Both numerical simulations and real data analysis in causal inference suggest that our proposed test performs well empirically.2025-03-24T14:32:05Z57 pages, 7 figuresXingwei LiuYuexin ChenJin-Ting ZhangWangli Xuhttp://arxiv.org/abs/2606.04603v1Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval2026-06-03T08:41:27ZApproximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content.
We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure.
On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.2026-06-03T08:41:27ZOlivier Jeunenhttp://arxiv.org/abs/2606.04576v1ReSGA: A Large Tail Risk Model for Learning Value-at-Risk and Expected Shortfall2026-06-03T08:11:45ZLearning Value-at-Risk (VaR) and Expected Shortfall (ES) is important for managing financial risks effectively. Existing approaches with limited parameters are vulnerable to model misspecification in the era of big data. To address this limitation, we propose a large tail risk model, the retrieval-enhanced self-grouping autoencoder (ReSGA), which is designed with millions of parameters to exploit the rich cross-sectional dependence and long-term temporal dynamics of assets using their characteristics. Applied to monthly US equity returns from 1926 to 2023 with 153 firm characteristics, ReSGA outperforms twelve econometric and machine learning competitors in terms of out-of-sample loss and statistical backtesting. In addition, its forecast advantages can translate into significant economic gains from long-short decile portfolios that are constructed by a new size-enhanced left-side momentum strategy. To clarify the role of complexity, we further conduct a systematic scaling analysis and demonstrate that improvements in joint VaR-ES forecasting are primarily driven by data complexity rather than model complexity. Finally, our analyses of group-importance and transfer-learning exhibit the interpretability and cross-market generalizability of ReSGA.2026-06-03T08:11:45ZYichi ZhangKe ZhuZhoufan Zhuhttp://arxiv.org/abs/2606.04574v1Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning2026-06-03T08:10:33ZThis study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cryptocurrency markets. Although classical implementations of the strategy have proven successful in traditional equities, they frequently exhibit rigidity and suffer from severe divergence risks when applied to high-variance environments. To address this need, this research introduces novel concepts. To construct a robust system, we developed a hierarchical "Filter-then-Rank" pair selection methodology and a proprietary "Fixed Risk, Adaptive Mean" execution model. The system employs a Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer to govern execution decisions within strict deterministic risk management boundaries. Evaluated on 1-hour interval data from the Binance USD-M Futures market, the optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent's risk-adjusted outperformance is statistically significant at the 10 percent level. Although falling marginally short of the stricter 5 percent threshold, this result highlights the extreme idiosyncratic variance characteristic of digital assets. Ultimately, this thesis contributes to the quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies. Furthermore, it delivers a novel framework for safe reinforcement learning via deterministic shielding, proving that anchoring a neural policy to statistically robust boundaries successfully mitigates severe divergence risks.2026-06-03T08:10:33Z61 pages, 37 figures, 16 tablesDamian LebiedźRobert Ślepaczukhttp://arxiv.org/abs/2606.09885v1TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts2026-06-03T07:49:31ZMixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron's projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.2026-06-03T07:49:31ZJiangyang HeShaolin ZhuDeyi Xionghttp://arxiv.org/abs/2510.05013v6Curiosity-Driven Development of Action and Language in Robots Through Self-Exploration2026-06-03T07:36:06ZInfants acquire language with generalization from minimal experience, whereas large language models require billions of training tokens. What underlies efficient development in humans? We investigated this problem through experiments wherein robotic agents learn to perform actions associated with imperative sentences (e.g., push red cube) via curiosity-driven self-exploration. Our approach amortizes active inference using Q-learning, enabling intrinsically motivated developmental learning. The simulations reveal key findings corresponding to observations in developmental psychology. i) Generalization improves drastically as the scale of compositional elements increases. ii) Curiosity-driven exploration enables faster learning. iii) Rote pairing of sentences and actions precedes compositional generalization. iv) Exception-handling induces U-shaped developmental performance, a pattern like representational redescription in child language learning. These results suggest that curiosity-driven active inference accounts for how intrinsically motivated sensorimotor-linguistic learning supports scalable compositional generalization and exception handling in humans and artificial agents.2025-10-06T16:53:39Z27 pages, 22 pages of supplementary materialTheodore Jerome TinkerKenji DoyaJun Tanihttp://arxiv.org/abs/2606.05242v1Deterministic Envelopes for Tamed SGLD: Decoupling Stochastic-Gradient Noise and Localizing Taming2026-06-03T07:23:16ZStochastic-gradient Langevin algorithms often use tamed denominators to stabilize non-globally Lipschitz drifts. This paper shows that when the denominator depends on the same stochastic-gradient realization as the numerator, the taming step changes the stochastic oracle itself and can create a stationary bias even if the original stochastic gradient is unbiased. We propose a structure-preserving framework for designing tamed denominators. It fixes the denominator before the oracle noise is sampled and uses localized deterministic envelopes to avoid unnecessary taming in typical regions. These kernels keep the stabilizing effect of taming while avoiding the bias introduced by a gradient-dependent denominator. Our theory explains how the stationary error splits into the bias caused by oracle-dependent taming and the remaining error introduced by deterministic stabilization. Within this deterministic-envelope family, the analysis identifies a far-tail condition that explains the limitation of local soft envelopes and motivates a hybrid member: soft in the typical region, but protected by hard-tail control on rare excursions. Experiments confirm the predicted stationary distortions of random denominators, the bias reduction of deterministic-envelope designs, and the stabilizing effect of the hybrid construction.2026-06-03T07:23:16Z40 pages, 11 tables, 2 figuresYiwei ZhouZiheng Chenhttp://arxiv.org/abs/2606.04486v1Global Sketch-Based Watermarking for Diffusion Language Models2026-06-03T06:08:58ZWatermarking methods for language models have been studied extensively in the autoregressive setting, where tokens are generated sequentially. These works largely focus on local-context schemes that perturb the next token's distribution as a function of its preceding tokens. In diffusion language models, distributions over many unresolved positions are jointly sampled, allowing additive statistics of the entire sequence to be tractable during generation. We propose a watermark for masked diffusion language models that controls a global, vector-valued sketch representation of the text. Compared to context-dependent watermarking, the sketch formulation decouples detection from the local contexts seen during generation, resulting in an order-agnostic statistic and a watermarking rule which does not manifest as a simple token bias. We analyze the distortion, soundness, and robustness properties of the method.2026-06-03T06:08:58ZDaniel Zhaohttp://arxiv.org/abs/2606.05239v1HyFAD: Hybrid Time-Frequency Diffusion with Frequency-Aware Embedding for Time Series Imputation2026-06-03T06:03:37ZDiffusion models have demonstrated strong performance in time series modeling due to their ability to progressively capture complex data distributions through iterative denoising. However, existing approaches struggle with frequency-sensitive denoising, high-frequency reconstruction and balancing global trends with local dynamics. To address these limitations, we propose \textbf{HyFAD}, a \textbf{Hy}brid time-frequency \textbf{D}iffusion model with \textbf{F}requency-\textbf{A}ware embedding for time series imputation. Built upon the DDPM paradigm, HyFAD adopts a coupled time-frequency diffusion framework, in which the reverse denoising proceeds sequentially from the time domain to the frequency domain, enabling coarse-to-fine generation. Specifically, the time-domain diffusion process captures low-frequency global trends, while the frequency-domain diffusion process refines high-frequency spectral components. We further introduce a frequency-aware step embedding that exploits the relationship between diffusion steps and spectral components, providing step-dependent spectral guidance and facilitates more accurate band-wise reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that HyFAD achieves state-of-the-art performance. Our source code is available at https://github.com/hongfangao/HyFAD.2026-06-03T06:03:37ZHongfan GaoWangmeng ShenBin YangJilin Huhttp://arxiv.org/abs/2606.04476v1When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks2026-06-03T05:44:30ZIn this paper, we study the gradient descent dynamics for jointly training both layers of a one-hidden-layer ReLU network to fit a linear target function. Concretely, we consider a realizable setting where inputs are drawn i.i.d. from a Gaussian distribution and labels follow a planted linear model. This stylized framework captures salient features of end-to-end training in inverse problems and certain auto-encoder models. Despite its apparent simplicity, the dynamics remain poorly understood, in part because the loss landscape contains multiple non-strict saddle points, making it unclear why gradient descent from random initialization reliably escapes bad stationary regions. We provide a detailed characterization of the optimization landscape and prove that gradient descent from a moderately small random initialization-simultaneously training both layers-converges to a global minimizer at a linear rate with order-wise optimal sample complexity. Our analysis tracks the trajectory through three phases: an alignment phase in which hidden weights progressively align with the planted direction while the output weights maintain the correct sign pattern; a growth phase in which the norms of both layers increase while preserving alignment; and a local refinement phase in which the aligned neurons rapidly converge to the planted direction, yielding fast local convergence. To rigorously show that GD avoids non-strict saddles, we develop trajectory-level control arguments for the end-to-end dynamics. In addition, we establish novel uniform concentration results that hold along the entire trajectory, and are essential for obtaining order-wise optimal sample complexity. We corroborate our theory with extensive experiments across a range of configurations.2026-06-03T05:44:30Z47 pages, 8 figures, published at the 39th Annual Conference on Learning Theory (COLT), 2026Berk TinazChangzhi XieMahdi Soltanolkotabihttp://arxiv.org/abs/2411.03383v4Near-Optimal and Tractable Estimation under Shift-Invariance2026-06-03T05:06:35ZHow hard is it to estimate a discrete-time signal $(x_{1}, ..., x_{n}) \in \mathbb{C}^n$ satisfying an unknown linear recurrence relation of order $s$ and observed in i.i.d. complex Gaussian noise? The class of all such signals is parametric but extremely rich: it contains all exponential polynomials over $\mathbb{C}$ with total degree $s$, including harmonic oscillations with $s$ arbitrary frequencies. Geometrically, this class corresponds to the projection onto $\mathbb{C}^{n}$ of the union of all shift-invariant subspaces of $\smash{\mathbb{C}^\mathbb{Z}}$ of dimension $s$. We show that the statistical complexity of this class, as measured by the squared minimax radius of the $(1-δ)$-confidence $\ell_2$-ball, is nearly the same as for the class of $s$-sparse signals, namely $\smash{O\left(s\log(en) + \log(δ^{-1})\right) \cdot \log^2(es) \cdot \log(en/s).}$ Moreover, the corresponding near-minimax estimator is tractable, and it can be used to build a test statistic with a near-minimax detection threshold in the associated detection problem. These statistical results rely upon a simple analytic observation: the interpretation of the Fourier coefficients of the Christoffel function of any shift-invariant subspace of $\smash{\mathbb{C}^\mathbb{Z}}$ as a reproducing filter with the smallest possible spectrum in all $\ell_p$-norms, $p \in [1,\infty]$, at once.2024-11-05T18:11:23Z28 pages. In the previous version (v2), our construction of the reproducing filter was erroneous. It is now replaced with an alternative construction using the Christoffel function. The only change from v3 is a typesetting correction in the abstractDmitrii M. Ostrovskii