https://arxiv.org/api/B/unV8iLg6xMWTtuj5PcxU7xZDQ2026-06-11T14:42:03Z27245331515http://arxiv.org/abs/2502.07027v4Representational Alignment with Chemical Induced Fit for Molecular Relational Learning2026-06-09T17:37:19ZMolecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge, resulting in unstable model performance in chemical space (\textit{e.g.}, functional group, scaffold) shifted data. With theoretical justification, we propose the \textbf{Re}presentational \textbf{Align}ment with Chemical Induced \textbf{Fit} (ReAlignFit) to enhance the stability of MRL. ReAlignFit dynamically aligns substructure representation in MRL by introducing chemical Induced Fit-based inductive bias. In the induction process, we design the Bias Correction Function based on substructure edge reconstruction to align representations between substructure pairs by simulating chemical conformational changes (dynamic combination of substructures). ReAlignFit further integrates the Subgraph Information Bottleneck during fit process to refine and optimize substructure pairs exhibiting high chemical functional compatibility, leveraging them to generate molecular embeddings. Experimental results on nine datasets demonstrate that ReAlignFit outperforms state-of-the-art models in two tasks and significantly enhances model's stability in both rule-shifted and scaffold-shifted data distributions.2025-02-07T09:29:21ZAccepted by SIGKDD2026 AI for Science TrackPeiliang ZhangJingling YuanQing XieYongjun ZhuChao CheLin Lihttp://arxiv.org/abs/2606.11149v1Efficiently Learning Drifting Halfspaces with Massart Noise2026-06-09T17:35:18ZWe study the problem of learning a drifting concept in the presence of Massart noise. In this framework, an online learner has access to a history of independent samples whose labels are noisy versions of a target concept that may change from round to round. The goal is to output, in each round, a hypothesis with small prediction error. We study the complexity of this learning problem for the fundamental class of margin-separable linear classifiers (halfspaces). On the positive side, we give a computationally efficient learner achieving error $η+ \tilde O(Δ^{1/3}/γ)$, where $η$ upper bounds the Massart noise rate, $Δ$ is the drift rate, and $γ$ is the margin. Interestingly, in the realizable setting, an adaptation of our techniques yields an efficient learner with an improved error rate over prior work. On the lower-bound side, we provide formal evidence of an information-computation tradeoff, strongly suggesting that our algorithm's performance is essentially optimal. Specifically, while the information-theoretically optimal error scales with $Δ^{1/2}$, we prove that $Δ^{1/3}$-scaling is unavoidable for low-degree polynomial tests, even in the special case of random classification noise.2026-06-09T17:35:18ZTo appear at ICML 2026Mingchen MaGuyang CaoJelena DiakonikolasIlias Diakonikolashttp://arxiv.org/abs/2606.11144v1OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib2026-06-09T17:33:24ZResistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.2026-06-09T17:33:24Z24 pages, 7 figures, 4 tables. Code, data, and trained model weights: https://github.com/span-ai-labs/oncotraj. Python package: pip install oncotraj. Dataset: https://huggingface.co/datasets/span-ai-labs/oncotraj-v1Abhijoy SarkarAarchi Singh Thakurhttp://arxiv.org/abs/2606.11140v1Data assimilation for subsurface flow using latent diffusion model parameterization: performance of ensemble-Kalman and Monte Carlo techniques2026-06-09T17:29:47ZData assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving geological realism. Latent diffusion models (LDMs) provide efficient mappings from high-dimensional geological model space to a low-dimensional latent variable, reducing the dimensionality of the inverse problem while maintaining plausibility in posterior geomodels. However, the high nonlinearity in the LDM mapping may degrade the performance of Kalman-gain-based ensemble updates. We present a systematic comparison of DA algorithms applied to large-scale 3D channelized geomodels with hierarchical geological uncertainty. We compare model-space and latent-space DA using the ensemble smoother with multiple data assimilation (ESMDA), and demonstrate a key trade-off: model-space updates achieve significant uncertainty reduction but produce geologically unrealistic posterior models, while latent-space updates preserve realism but exhibit limited uncertainty reduction. Motivated by this, we explore rigorous Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) algorithms in the 3D-LDM latent space. To accommodate their high computational demands, we develop a fast surrogate flow model that approximates well-rate responses. MCMC and SMC are evaluated against ESMDA across three synthetic test cases, with DA performed in the LDM latent space. All models maintain geological realism due to the LDM parameterization. MCMC and SMC are consistent with one another and achieve lower data mismatch and more uncertainty reduction than latent-space ESMDA. Our overall results demonstrate that ensemble Kalman methods may provide overestimated posterior uncertainty with highly nonlinear parameterizations, while rigorous Monte Carlo sampling, enabled by fast surrogate models, can provide a more reliable alternative.2026-06-09T17:29:47ZGuido Di FedericoWenchao TengLouis J. Durlofskyhttp://arxiv.org/abs/2606.11138v1First-Order Trajectory Matching: Fast Ensemble Predictions of Chaotic, Turbulent, Stochastic Systems2026-06-09T17:29:05ZWe introduce First-Order Trajectory Matching (FTM), a surrogate-modeling method that learns the first-order local transport of probability mass from trajectories of stochastic systems. By matching the symmetric first-order motion of trajectories, FTM learns the probability current velocity, whose flow preserves time marginals to match ensemble averages, while also capturing current-like trajectory quantities such as fluxes, circulations, and barrier-crossing currents. FTM learns the current velocity directly from trajectories, avoiding drift, diffusion, and score estimation. Our stability analysis separates discretization error from sampling variance and shows that the one-step simulation-free FTM loss is stable when temporal resolution and sample size are properly balanced. Across stochastic dynamical systems and PDE examples, we empirically demonstrate that FTM provides trajectory-aware ensemble predictions at low, deterministic-rollout cost.2026-06-09T17:29:05ZShreya JhaTimo SchorleppNicholas GeisslerJules BermanBenjamin Peherstorferhttp://arxiv.org/abs/2606.11130v1Robust Regression of General ReLUs with Queries2026-06-09T17:24:47ZWe study the task of agnostically learning general (as opposed to homogeneous) ReLUs under the Gaussian distribution with respect to the squared loss. In the passive learning setting, recent work gave a computationally efficient algorithm that uses $poly(d,1/ε)$ labeled examples and outputs a hypothesis with error $O(opt)+ε$, where $opt$ is the squared loss of the best fit ReLU. Here we focus on the interactive setting, where the learner has some form of query access to the labels of unlabeled examples. Our main result is the first computationally efficient learner that uses $d polylog(1/ε)+\tilde{O}(\min\{1/p, 1/ε\})$ black-box label queries, where $p$ is the bias of the target function, and achieves error $O(opt)+ε$. We complement our algorithmic result by showing that its query complexity bound is qualitatively near-optimal, even ignoring computational constraints. Finally, we establish that query access is essentially necessary to improve on the label complexity of passive learning. Specifically, for pool-based active learning, any active learner requires $\tildeΩ(d/ε)$ labels, unless it draws a super-polynomial number of unlabeled examples.2026-06-09T17:24:47ZAppeared at NeurIPS 2025Ilias DiakonikolasDaniel M. KaneMingchen Mahttp://arxiv.org/abs/2606.11125v1DMT: Demographic Conditioning, Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals2026-06-09T17:23:07ZBlood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.2026-06-09T17:23:07ZYidan ShenNeville MathewMaham RahimiDeependra DhakalGeorge ZouridakisXin FuRenjie Huhttp://arxiv.org/abs/2606.11123v1Overcoming Rank Collapse in Feedback Alignment2026-06-09T17:21:25ZBackpropagation (BP) is widely viewed as biologically implausible, in part because it requires feedback weights to be the transpose of forward weights for error propagation. Interestingly, when training a network with fixed random feedback weights to circumvent this issue, learning aligns the forward weights with the feedback weights, leading the backpropagated error signal to become an approximation of the standard gradient used by BP. This process, called Feedback Alignment (FA), occurs in MLPs and very shallow CNNs but does not scale well to deeper architectures. In this work, we first investigated differences between BP and FA models, trained on CIFAR10, specifically focusing on the effective rank of the signal. We found that the FA error has a considerably lower rank and hence is constrained to a lower-dimensional subspace compared to BP, limiting exploration of the parameter space. Motivated by this observation, we evaluated two mechanisms for increasing the effective dimensionality of FA: Muon, an optimiser that orthogonalises weight updates; and hidden activity normalisation, which promotes activation orthogonality. Across larger architectures and benchmarks, we find that these methods consistently improve over FA baselines, for example, on CIFAR100 with a Resnet-18, accuracy increases by 9 percentage points. Our results identify low-dimensional gradient dynamics as a key obstacle to scaling FA and suggest that inducing higher-dimensional update geometry is a promising route toward scaling alternatives to backpropagation.2026-06-09T17:21:25Z9 pages and 4 figures, 1 table for main text. Total of 28 pages and 13 figures with appendixGauthier BoeshertzRazvan PascanuClaudia Clopathhttp://arxiv.org/abs/2606.11119v1TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning2026-06-09T17:16:03ZReinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.2026-06-09T17:16:03Z32 pages, 12 figures, 6 tablesHeming ZouQi WangYun QuYuhang JiangLizhou CaiYixiu MaoRu PengXin XuWeijie LiuKai YangSaiyong YangXiangyang Jihttp://arxiv.org/abs/2606.09026v2Structural Grid Descriptors Predict Within-Task Solver Success on ARC-AGI2026-06-09T17:11:01ZWe ask whether structural properties of intermediate grid states predict whether a symbolic ARC-AGI solver will succeed, framed as a test of conditional mutual information I(X;Y|task) > 0. Across 44,800 runs spanning two architecturally distinct solvers (beam search and Stochastic DFS), 400 ARC tasks, 28 configurations per solver, and both training and evaluation splits, hand-crafted grid descriptors measured at 50% trajectory completion discriminate successful from failed runs within the same task (mean within-task best-feature AUC = 0.885, p < 0.001 under within-task label permutation). Most predictive content lies along a single grid-complexity axis. The result generalizes across solver architectures: a feature selected on one solver predicts success on the other with AUC 0.747-0.762 in all four transfer directions (p < 0.001, leakage controlled). On a pre-registered held-out set of 41 reliable tasks, the frozen feature n_components_final achieves AUC = 0.765 (95% CI [0.717, 0.810], p < 0.001), robust under task-clustered bootstrap resampling and cross-solver task collapsing. The signal is not explained by solver capacity (configuration-residualized AUC = 0.927 and 0.896 for beam search and SDFS, p < 0.001) and is only weakly coupled to score trajectories (R^2 approximately 0). Early stopping at 50% completion reduces beam-search compute by 33.6% while retaining 98.9% of solves; degenerate-trajectory detection reduces SDFS compute by 65.3% with no solve loss. Finally, on 229 of 400 evaluation tasks the DSL primitive library produces no valid transition from the input grid. This 0-step collapse is invariant to search budget and universally failed by beam search, indicating a DSL coverage limitation rather than a search-budget effect.2026-06-08T04:51:24ZAyan Pendharkarhttp://arxiv.org/abs/2512.18531v2Pushing the limits of one-dimensional NMR spectroscopy for automated structure elucidation using artificial intelligence2026-06-09T17:07:47ZOne-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from $10^{20} - 10^{60}$. The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional $^1$H and/or $^{13}$C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 60.4% accuracy within the first 15 predictions using only the $^1$H and $^{13}$C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.2025-12-20T22:56:49ZFrank HuJonathan M. TubbDimitris ArgyropoulosSergey GolotvinMikhail ElyashbergGrant M. RotskoffMatthew W. KananThomas E. Marklandhttp://arxiv.org/abs/2606.09079v2FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention2026-06-09T17:04:45ZConventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory.
We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.2026-06-08T06:25:54ZTechnical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4Yan WangQifan ZhangJiachen YuTian LiangDongyang MaXiang HuZibo LinChunyang LiZhichao WangMiao PengNuo ChenJia LiYujiu YangHaitao MiDong Yuhttp://arxiv.org/abs/2606.11104v1Limitations of Learning Tanh Neural Networks with Finite Precision2026-06-09T17:02:27ZWe investigate limitations of learning $\tanh$ neural networks from point evaluations under finite-precision computations and $L^p$ accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated $\tanh$ activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on $m$ samples can achieve a convergence rate higher than the Monte Carlo rate $O(m^{-1/p})$ in the $L^p$ norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the $\tanh$ setting.2026-06-09T17:02:27ZPhilipp GrohsMatěj Trödlerhttp://arxiv.org/abs/2603.04689v4Generalizing Fair Top-$k$ Selection: An Integrative Approach2026-06-09T17:00:32ZFair top-$k$ selection, which ensures appropriate proportional representation of members from minority or historically disadvantaged groups among the top-$k$ selected candidates, has drawn significant attention. We study the problem of finding a fair (linear) scoring function with multiple protected groups while also minimizing the disparity from a reference scoring function. This generalizes the prior setup, which was restricted to the single-group setting without disparity minimization. Previous studies imply that the number of protected groups may have a limited impact on the runtime efficiency. However, driven by the need for experimental exploration, we find that this implication overlooks a critical issue that may affect the fairness of the outcome. Once this issue is properly considered, our hardness analysis shows that the problem may become computationally intractable even for a two-dimensional dataset and small values of $k$. However, our analysis also reveals a gap in the hardness barrier, enabling us to recover the efficiency for the case of small $k$ when the number of protected groups is sufficiently small. Furthermore, beyond measuring disparity as the "distance" between the fair and the reference scoring functions, we introduce an alternative disparity measure$\unicode{x2014}$utility loss$\unicode{x2014}$that may yield a more stable scoring function under small weight perturbations. Through careful engineering trade-offs that balance implementation complexity, robustness, and performance, our augmented two-pronged solution demonstrates strong empirical performance on real-world datasets, with experimental observations also informing algorithm design and implementation decisions.2026-03-05T00:06:47ZGuangya Caihttp://arxiv.org/abs/2510.08906v2Gradient-Guided Furthest Point Sampling for Robust Training Set Selection2026-06-09T16:59:34ZTraining set sampling methods are used to improve model performance and lower data costs in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy system (the Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Our numerical results indicate superior data efficiency and model robustness when using GGFPS compared to FPS and uniform random sampling (URS), as well as established supervised FPS-style selectors, PCov-FPS and PCov-CUR. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to twofold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinski-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.2025-10-10T01:41:06Z41 pages, 43 figures, 2 algorithms; journal article with supplementary information appendedMachine Learning: Science and Technology 7, 035047 (2026)Morris TrestmanStefan GuglerFelix A. FaberO. A. von Lilienfeld10.1088/2632-2153/ae68b8