https://arxiv.org/api/JMOnLYi3nsQB0pmzeG0k9DVNyPQ 2026-06-13T22:21:35Z 272646 525 15 http://arxiv.org/abs/2606.11599v1 When is Your LLM Steerable? 2026-06-10T02:55:34Z Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost. 2026-06-10T02:55:34Z Chenrui Fan Yize Cheng Ming Li Soheil Feizi Tianyi Zhou http://arxiv.org/abs/2510.07750v3 Calibrating Decision Robustness via Inverse Conformal Risk Control 2026-06-10T02:44:06Z Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage--regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost--risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance. This paper offers a principled data-driven methodology for guiding robustness selection and empowers practitioners to balance robustness and conservativeness in high-stakes decision-making. 2025-10-09T03:38:17Z Wenbin Zhou Shixiang Zhu http://arxiv.org/abs/2606.11585v1 Kuramoto Attention: Synchronizing Self-Attention on the Torus 2026-06-10T02:24:04Z We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(θ_u-θ_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization. 2026-06-10T02:24:04Z 13 pages, 2 figures, 3 tables Joshua Nunley http://arxiv.org/abs/2605.00545v2 Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots 2026-06-10T02:19:11Z Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution. 2026-05-01T09:57:13Z Junda Ying Yuxuan Wang Bowen Yang Peijie Zhou Lei Zhang http://arxiv.org/abs/2606.11583v1 Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching 2026-06-10T02:15:56Z Text-attributed graphs (TAGs) underlie real-world applications such as citation networks, social media, and e-commerce. Few-shot graph learning on TAGs is hard: with only a handful of labels per class and the rest of the graph unannotated, neither GNNs nor LLMs can learn well on their own. GNNs read topology and fail on cold nodes; LLMs read text and fail on text-ambiguous nodes. Existing LLM-GNN methods all follow the same recipe: designate one model as the golden teacher and use its outputs (e.g., features or pseudo-labels) to supervise the other. We argue this golden-teacher assumption breaks under sparse supervision: neither model is golden, and treating either as such transfers its blind spots into the student. We therefore ask: can we avoid designating either model as the golden teacher, and still perform effective graph learning? We answer with LLM-GNN Co-Teaching, a bidirectional co-teaching framework in which neither model is fixed as teacher. The GNN and LLM exchange their most confident pseudo-labels under an architecture-specific small-loss criterion, and both update every round. Supervision is then mined from the trajectory: whenever a node moves from cross-model contradiction at round t to cross-model agreement at round t+1, the LLM's two answers on the same input form a preference pair (old contradicting self < new peer-endorsed self) for DPO training. We call this Round-based Pseudo-Label Preference Optimization (RPL-PO). On six benchmarks, LLM-GNN Co-Teaching consistently outperforms GNN-as-Judge and all prior methods, with absolute 3-shot gains of 7.86% on Cora and 7.73% on ogbn-arxiv; improvements carry over to 5-shot and to zero-shot cross-dataset transfer. Error-structure analysis further shows that abandoning the golden-teacher assumption substantially improves the LLM's graph learning capability on challenging samples. 2026-06-10T02:15:56Z Code: https://github.com/llmgnncoteaching/LLM-GNN-Coteaching Zhuoyi Peng Hanlin Gu Lixin Fan Yi Yang http://arxiv.org/abs/2505.08784v2 PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework 2026-06-10T02:07:51Z As machine learning (ML) enters high-stakes domains, trustworthy uncertainty quantification (UQ) is essential for safety. In this paper we introduce PCS-UQ, a framework based on the Predictability, Computability, and Stability (PCS) principles for veridical data science. Starting with a candidate set of models or algorithms, PCS-UQ integrates a rigorous prediction-check to screen out unsuitable models in the set and utilizes bootstrap samples, in order to capture both inter-sample variability and algorithmic instability for the prediction-checked algorithms. We then introduce a novel multiplicative calibration scheme to enhance local adaptivity, which basically corresponds to a new score in conformal prediction. Moreover, we produce a compilation of 17 real-world regression datasets with manually-constructed subgroups. On this benchmark, PCS-UQ maintains the target coverage while outperforming or matching conformal methods equipped with oracle-selected algorithms in interval width. PCS-UQ achieves consistent subgroup coverage, outperforming these oracle-selected conformal methods. Notably, PCS-UQ stands out in achieving both competitive interval widths and consistent subgroup coverage.Across 6 classification datasets, PCS-UQ reduces prediction set sizes by 20\%. To scale the framework for deep learning, we propose computationally efficient variants that bypass expensive retraining. On three computer vision benchmarks, these variants reduce prediction set sizes by 20\% over conformal baselines. Finally, we provide theoretical proof that a modified PCS-UQ algorithm preserves valid coverage under exchangeability as a form of split conformal inference. 2025-05-13T17:58:16Z Abhineet Agarwal Fange Xiao Rebecca Barter Omer Ronen Boyu Fan Bin Yu http://arxiv.org/abs/2606.07591v2 ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research 2026-06-10T02:02:36Z AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research. 2026-05-28T16:27:40Z Wanghan Xu Shuo Li Tianlin Ye Qinglong Cao Yixin Chen Hengjian Gao Yiheng Wang Qi Li Kun Li Sheng Xu Shengdu Chai Fangchen Yu Xiangyu Zhao Zhangrui Zhao Weijie Ma Zijie Guo Haoyu Zhou Haoxiang Yin Lixue Cheng Chaofan Hu Haoxuan Li Lu Mi Xuxuan Xie Yifan Zhou Ruizhe Chen Zhiwang Zhou Xingjian Guo Yuhao Zhou Xuming He Shengyuan Xu Xinyu Gu Jiamin Wu Mianxin Liu Chunfeng Song Fenghua Ling Dongzhan Zhou Shixiang Tang Yuqiang Li Mao Su Peng Ye Siqi Sun Bin Wang Xue Yang Zhenfei Yin Tianfan Fu Guangtao Zhai Wanli Ouyang Bo Zhang Lei Bai Wenlong Zhang http://arxiv.org/abs/2606.11574v1 Range-Aware Bayesian Optimization for Discovering Diverse Designs within Target Property Windows 2026-06-10T01:58:30Z In many materials and product design problems, desirable candidates exhibit properties that fall within an acceptable range rather than achieve a single optimum. Recovering multiple, distinct solutions that satisfy such specifications is also practically valuable, as some candidates may be preferred for reasons of cost, processability, or robustness that are difficult to encode directly in an objective function. Here, we develop a range-aware Bayesian optimization (BO) framework in which the acquisition function directly scores the posterior probability that a candidate satisfies a target range. The framework naturally extends to parallel pursuit of multiple distinct specifications over a shared candidate space. Across benchmark tasks, range-aware acquisition consistently recovers larger and more diverse sets of valid designs than standard BO baselines and recent goal-seeking methods. Its utility is further demonstrated in two practically motivated design case studies involving optimizing reaction conditions for polymer synthesis and sequence-defined oligomer discovery for prescribed optical absorption bands, supported by quantum chemical calculations. These results suggest that range-aware BO can provide a practical and sample-efficient foundation for specification-driven design, particularly when design flexibility and solution diversity are important considerations. 2026-06-10T01:58:30Z 64 pages, 6 main text figures, 17 supporting figures, 6 supporting tables Shengli Jiang Jason Wu Charles M. Schroeder Michael A. Webb http://arxiv.org/abs/2606.11570v1 Enhancing Spectral Embedding through Robust and Flexible Knowledge Transfer in Electronic Health Records 2026-06-10T01:51:51Z We propose a spectral-based, unsupervised representation learning framework to derive low-dimensional embeddings for clinical concepts and patients in rare disease cohorts from electronic health records, where data are high-dimensional but sample sizes are limited. To overcome this challenge, we incorporate a knowledge matrix extracted from a broader population that shares a partially overlapping subspace with the rare-disease cohort. Our method departs from existing approaches by relaxing restrictive one-to-one signal-alignment assumptions between the latent data matrix and knowledge matrix, allowing more flexible and realistic forms of structured sharing. We introduce a novel two-step spectral embedding procedure: first, we identify and remove irrelevant components from the knowledge matrix; then, we apply a projection-based method to separately recover shared and heterogeneous components. Simulations and an analysis of a real-world multiple sclerosis cohort show that the proposed method outperforms competing approaches, particularly in challenging scenarios where shared signals are weak and only partially aligned, as is common in rare-disease data. 2026-06-10T01:51:51Z Feiqing Huang Zongqi Xia Rong Ma Tianxi Cai http://arxiv.org/abs/2606.11562v1 GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs 2026-06-10T01:41:53Z Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture. 2026-06-10T01:41:53Z Code: https://github.com/graphinfer/GraphInfer-Bench ; Dataset: https://huggingface.co/datasets/graphinfer/graphinfer Zhuoyi Peng Jingzhou Jiang Hanlin Gu Lixin Fan Yi Yang http://arxiv.org/abs/2411.10959v5 Program Evaluation with Remotely Sensed Outcomes 2026-06-10T01:37:08Z We study causal inference in experiments and quasi-experiments, where the economic outcome is imperfectly measured by a remotely sensed variable. The remotely sensed variable is low-cost, scalable, and predictive of the economic outcome in observational data; examples include satellite imagery and mobile phone activity. We model the remotely sensed variable as post-outcome: variation in the economic outcome causes variation in the remotely sensed variable. For example, changes in environmental quality cause changes in satellite imagery, not vice versa. Under this assumption, we propose a formula to nonparametrically identify the causal parameter by combining experimental and observational data. We develop a method for n^{-1/2} inference that is robust to misspecification and that does not restrict the algorithms used to process remotely sensed variables. 2024-11-17T04:43:04Z Ashesh Rambachan Rahul Singh Davide Viviano http://arxiv.org/abs/2606.11556v1 Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices 2026-06-10T01:33:20Z Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and detection quality under non-IID cross-hospital data. We design and evaluate an end-to-end federated system addressing all three for unsupervised 12-lead ECG anomaly detection on PTB-XL dataset, combining three autoencoder families (VanillaAE, ConvAE, VAE), Flower-based federated averaging (FedAvg) across ten simulated hospitals, client-side differentially private SGD (DP-SGD) with a Rényi-DP accountant, and 8-bit integer (INT8) post-training quantization with Raspberry Pi 4 benchmarking. Our main contributions are: an empirical characterization of how these mechanisms compose, practical DP-specific recommendations, and technical and security insights for a clinically sensitive setting. Federated learning matches or exceeds the centralized baseline across all architectures (ConvAE federated area under the ROC curve, AUROC, $0.782$), and an $\varepsilon$ sweep identifies $\varepsilon=4$ as the recommended clinical operating point. INT8 quantization roughly halves model size and cuts Pi 4 latency by up to $44%$ with $<0.12%$ AUROC loss. Crucially, DP and quantization penalties are empirically independent, so practitioners need not trade a strong privacy guarantee for a compact edge footprint. To our knowledge, this is the first system combining federated learning, formal $(\varepsilon,δ)$-DP, unsupervised reconstruction-based detection, and quantized AArch64 deployment. 2026-06-10T01:33:20Z 9 pages, 4 figures, 6 tables. Preprint prepared in IEEE conference format. Submitted to: FLTA 2026 Kaan Arda Akyol Jakub Kacper Szeląg Aydin Abadi Maha Alghamdi Ghadah Albalawi Ghouse Ibrahim Kaleelullah Hilal Tutus Sarah Al Subaiei Shardul Kapse Syed Mohammed Raheeb Mujeeb Ahmed Rehmat Ullah http://arxiv.org/abs/2606.11555v1 End-to-End Machine Learning for Depressive State Classification via EEG and fNIRS 2026-06-10T01:31:07Z The escalating demand for mental healthcare, driven by rising societal stress, highlights the limitations of traditional psychiatric diagnostics. Conventional methods - relying primarily on clinical interviews and patient self-reports - are inherently vulnerable to subjective bias and the varying empirical judgment of practitioners. To address the need for quantitative evaluation, biological signal-based detection, including electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS), has emerged as a promising objective alternative. Such technology is particularly vital for identifying latent depressive states that may be unrecognized by the subjects themselves. Furthermore, in aging populations, the high comorbidity between depression and dementia necessitates early differentiation to prevent mutual symptom exacerbation and maintain Quality of Life (QoL). This pilot study of eleven healthy students establishes a framework for biological signal-based depression detection, serving as a foundational step toward automated, objective diagnostic tools for clinical use. 2026-06-10T01:31:07Z 4 pages, 4 figures, Accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026 Riki Sakurai Simon Kojima Mihoko Otake-Matsuura Shin'ichiro Kanoh Tomasz M. Rutkowski http://arxiv.org/abs/2605.03065v2 OGPO: Sample Efficient Full-Finetuning of Generative Control Policies 2026-06-10T01:30:22Z Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement. 2026-05-04T18:36:40Z Sarvesh Patil Mitsuhiko Nakamoto Manan Agarwal Shashwat Saxena Jesse Zhang Giri Anantharaman Cleah Winston Chaoyi Pan Douglas Chen Nai-Chieh Huang Zeynep Temel Oliver Kroemer Sergey Levine Abhishek Gupta Hongkai Dai Paarth Shah Max Simchowitz http://arxiv.org/abs/2510.06596v2 SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation 2026-06-10T01:28:46Z The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM 2025-10-08T03:01:26Z Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3 Journal of Electronic Imaging 35(3), 033014 (2026) Ayush Zenith Arnold Zumbrun Neel Raut Jing Lin 10.1117/1.JEI.35.3.033014