https://arxiv.org/api/ITcRdoCy8YEe2H3gfK7ulvNehTo 2026-06-11T08:04:09Z 272453 225 15 http://arxiv.org/abs/2606.07591v2 ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research 2026-06-10T02:02:36Z AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research. 2026-05-28T16:27:40Z Wanghan Xu Shuo Li Tianlin Ye Qinglong Cao Yixin Chen Hengjian Gao Yiheng Wang Qi Li Kun Li Sheng Xu Shengdu Chai Fangchen Yu Xiangyu Zhao Zhangrui Zhao Weijie Ma Zijie Guo Haoyu Zhou Haoxiang Yin Lixue Cheng Chaofan Hu Haoxuan Li Lu Mi Xuxuan Xie Yifan Zhou Ruizhe Chen Zhiwang Zhou Xingjian Guo Yuhao Zhou Xuming He Shengyuan Xu Xinyu Gu Jiamin Wu Mianxin Liu Chunfeng Song Fenghua Ling Dongzhan Zhou Shixiang Tang Yuqiang Li Mao Su Peng Ye Siqi Sun Bin Wang Xue Yang Zhenfei Yin Tianfan Fu Guangtao Zhai Wanli Ouyang Bo Zhang Lei Bai Wenlong Zhang http://arxiv.org/abs/2606.11574v1 Range-Aware Bayesian Optimization for Discovering Diverse Designs within Target Property Windows 2026-06-10T01:58:30Z In many materials and product design problems, desirable candidates exhibit properties that fall within an acceptable range rather than achieve a single optimum. Recovering multiple, distinct solutions that satisfy such specifications is also practically valuable, as some candidates may be preferred for reasons of cost, processability, or robustness that are difficult to encode directly in an objective function. Here, we develop a range-aware Bayesian optimization (BO) framework in which the acquisition function directly scores the posterior probability that a candidate satisfies a target range. The framework naturally extends to parallel pursuit of multiple distinct specifications over a shared candidate space. Across benchmark tasks, range-aware acquisition consistently recovers larger and more diverse sets of valid designs than standard BO baselines and recent goal-seeking methods. Its utility is further demonstrated in two practically motivated design case studies involving optimizing reaction conditions for polymer synthesis and sequence-defined oligomer discovery for prescribed optical absorption bands, supported by quantum chemical calculations. These results suggest that range-aware BO can provide a practical and sample-efficient foundation for specification-driven design, particularly when design flexibility and solution diversity are important considerations. 2026-06-10T01:58:30Z 64 pages, 6 main text figures, 17 supporting figures, 6 supporting tables Shengli Jiang Jason Wu Charles M. Schroeder Michael A. Webb http://arxiv.org/abs/2606.11570v1 Enhancing Spectral Embedding through Robust and Flexible Knowledge Transfer in Electronic Health Records 2026-06-10T01:51:51Z We propose a spectral-based, unsupervised representation learning framework to derive low-dimensional embeddings for clinical concepts and patients in rare disease cohorts from electronic health records, where data are high-dimensional but sample sizes are limited. To overcome this challenge, we incorporate a knowledge matrix extracted from a broader population that shares a partially overlapping subspace with the rare-disease cohort. Our method departs from existing approaches by relaxing restrictive one-to-one signal-alignment assumptions between the latent data matrix and knowledge matrix, allowing more flexible and realistic forms of structured sharing. We introduce a novel two-step spectral embedding procedure: first, we identify and remove irrelevant components from the knowledge matrix; then, we apply a projection-based method to separately recover shared and heterogeneous components. Simulations and an analysis of a real-world multiple sclerosis cohort show that the proposed method outperforms competing approaches, particularly in challenging scenarios where shared signals are weak and only partially aligned, as is common in rare-disease data. 2026-06-10T01:51:51Z Feiqing Huang Zongqi Xia Rong Ma Tianxi Cai http://arxiv.org/abs/2606.11562v1 GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs 2026-06-10T01:41:53Z Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture. 2026-06-10T01:41:53Z Code: https://github.com/graphinfer/GraphInfer-Bench ; Dataset: https://huggingface.co/datasets/graphinfer/graphinfer Zhuoyi Peng Jingzhou Jiang Hanlin Gu Lixin Fan Yi Yang http://arxiv.org/abs/2411.10959v5 Program Evaluation with Remotely Sensed Outcomes 2026-06-10T01:37:08Z We study causal inference in experiments and quasi-experiments, where the economic outcome is imperfectly measured by a remotely sensed variable. The remotely sensed variable is low-cost, scalable, and predictive of the economic outcome in observational data; examples include satellite imagery and mobile phone activity. We model the remotely sensed variable as post-outcome: variation in the economic outcome causes variation in the remotely sensed variable. For example, changes in environmental quality cause changes in satellite imagery, not vice versa. Under this assumption, we propose a formula to nonparametrically identify the causal parameter by combining experimental and observational data. We develop a method for n^{-1/2} inference that is robust to misspecification and that does not restrict the algorithms used to process remotely sensed variables. 2024-11-17T04:43:04Z Ashesh Rambachan Rahul Singh Davide Viviano http://arxiv.org/abs/2606.11556v1 Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices 2026-06-10T01:33:20Z Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and detection quality under non-IID cross-hospital data. We design and evaluate an end-to-end federated system addressing all three for unsupervised 12-lead ECG anomaly detection on PTB-XL dataset, combining three autoencoder families (VanillaAE, ConvAE, VAE), Flower-based federated averaging (FedAvg) across ten simulated hospitals, client-side differentially private SGD (DP-SGD) with a Rényi-DP accountant, and 8-bit integer (INT8) post-training quantization with Raspberry Pi 4 benchmarking. Our main contributions are: an empirical characterization of how these mechanisms compose, practical DP-specific recommendations, and technical and security insights for a clinically sensitive setting. Federated learning matches or exceeds the centralized baseline across all architectures (ConvAE federated area under the ROC curve, AUROC, $0.782$), and an $\varepsilon$ sweep identifies $\varepsilon=4$ as the recommended clinical operating point. INT8 quantization roughly halves model size and cuts Pi 4 latency by up to $44%$ with $<0.12%$ AUROC loss. Crucially, DP and quantization penalties are empirically independent, so practitioners need not trade a strong privacy guarantee for a compact edge footprint. To our knowledge, this is the first system combining federated learning, formal $(\varepsilon,δ)$-DP, unsupervised reconstruction-based detection, and quantized AArch64 deployment. 2026-06-10T01:33:20Z 9 pages, 4 figures, 6 tables. Preprint prepared in IEEE conference format. Submitted to: FLTA 2026 Kaan Arda Akyol Jakub Kacper Szeląg Aydin Abadi Maha Alghamdi Ghadah Albalawi Ghouse Ibrahim Kaleelullah Hilal Tutus Sarah Al Subaiei Shardul Kapse Syed Mohammed Raheeb Mujeeb Ahmed Rehmat Ullah http://arxiv.org/abs/2606.11555v1 End-to-End Machine Learning for Depressive State Classification via EEG and fNIRS 2026-06-10T01:31:07Z The escalating demand for mental healthcare, driven by rising societal stress, highlights the limitations of traditional psychiatric diagnostics. Conventional methods - relying primarily on clinical interviews and patient self-reports - are inherently vulnerable to subjective bias and the varying empirical judgment of practitioners. To address the need for quantitative evaluation, biological signal-based detection, including electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS), has emerged as a promising objective alternative. Such technology is particularly vital for identifying latent depressive states that may be unrecognized by the subjects themselves. Furthermore, in aging populations, the high comorbidity between depression and dementia necessitates early differentiation to prevent mutual symptom exacerbation and maintain Quality of Life (QoL). This pilot study of eleven healthy students establishes a framework for biological signal-based depression detection, serving as a foundational step toward automated, objective diagnostic tools for clinical use. 2026-06-10T01:31:07Z 4 pages, 4 figures, Accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026 Riki Sakurai Simon Kojima Mihoko Otake-Matsuura Shin'ichiro Kanoh Tomasz M. Rutkowski http://arxiv.org/abs/2605.03065v2 OGPO: Sample Efficient Full-Finetuning of Generative Control Policies 2026-06-10T01:30:22Z Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement. 2026-05-04T18:36:40Z Sarvesh Patil Mitsuhiko Nakamoto Manan Agarwal Shashwat Saxena Jesse Zhang Giri Anantharaman Cleah Winston Chaoyi Pan Douglas Chen Nai-Chieh Huang Zeynep Temel Oliver Kroemer Sergey Levine Abhishek Gupta Hongkai Dai Paarth Shah Max Simchowitz http://arxiv.org/abs/2510.06596v2 SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation 2026-06-10T01:28:46Z The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM 2025-10-08T03:01:26Z Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3 Journal of Electronic Imaging 35(3), 033014 (2026) Ayush Zenith Arnold Zumbrun Neel Raut Jing Lin 10.1117/1.JEI.35.3.033014 http://arxiv.org/abs/2606.10198v2 Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity 2026-06-10T01:25:04Z Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity. 2026-06-08T21:36:12Z Nina I. Shamsi http://arxiv.org/abs/2606.11553v1 APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations 2026-06-10T01:23:24Z Generic time-series foundation models transfer poorly to wireless network telemetry whose signals are bursty, zero-inflated, and coupled across protocol layers. We present APEX, a network-native, decoder-only transformer for forecasting enterprise AP telemetry, and evaluate it on DHCP degradation as a representative network task. APEX is pre-trained on 10-channel multivariate telemetry from ~4,500 production wireless networks (~100K AP time series, 34 metrics per AP), and is available as APEX-Large (269M, cloud) and APEX-Edge (10.5M, edge). On a 192-step (4-day) DHCP degradation benchmark, APEX-Large reduces MAE by 18% over the strongest foundation-model baseline (Toto) and 38% over SARIMA, with anomaly-detection F1 = 0.93, while APEX-Edge enables sub-second, privacy-preserving inference on AP-class edge hardware. These results suggest network-native pre-training is a practical foundation for proactive wireless operations. 2026-06-10T01:23:24Z 5 pages, 1 figure, 4 tables. Discusses a network-native time-series foundation model for wireless edge operations Swadhin Pradhan Niloo Bahadori Peiman Amini http://arxiv.org/abs/2606.11552v1 Teaching Diffusion to Speculate Left-to-Right 2026-06-10T01:21:56Z Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract. 2026-06-10T01:21:56Z 13 pages, technical report Lexington Whalen Yuki Ito Ryo Sakamoto http://arxiv.org/abs/2512.22219v2 MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs 2026-06-10T00:43:42Z We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance mega-kernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, \rev{fine-grained overlap of computation and communication, and other optimizations that are infeasible under the conventional kernel-per-operator execution model}. The MPK compiler lowers tensor programs into optimized SM-level task graphs and generates fast CUDA implementations for each task, while the MPK in-kernel parallel runtime executes these tasks within a single persistent mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems, achieving up to 1.7$\times$ lower end-to-end inference latency and pushing LLM inference performance close to the limits of the underlying hardware. MPK is publicly available at https://github.com/mirage-project/mirage. 2025-12-22T14:18:20Z 14 pages Xinhao Cheng Zhihao Zhang Yu Zhou Jianan Ji Jinchen Jiang Zepeng Zhao Ziruo Xiao Zihao Ye Yingyi Huang Ruihang Lai Hongyi Jin Bohan Hou Mengdi Wu Yixin Dong Anthony Yip Zihao Ye Songting Wang Wenqin Yang Xupeng Miao Tianqi Chen Zhihao Jia http://arxiv.org/abs/2606.11534v1 Urban Heat MiniCubes: An AI-Ready dataset for urban heat research 2026-06-10T00:34:20Z Urban heat is amplified by impermeable surfaces and heterogeneous built environments, yet street-level variability remains difficult to quantify because multi-sensor observations are rarely available in consistent, analysis-ready form at the necessary spatiotemporal scales. We present "Urban Heat MiniCubes," a publicly available, FAIR-oriented dataset designed for machine learning applications in urban heat research. The dataset provides harmonized 90 x 90 km gridded data cubes for 48 cities in the Western Hemisphere spanning 2022-2023, with variables reprojected and collocated to a common grid to reduce preprocessing (e.g., reprojection, resampling, and spatiotemporal alignment). Urban Heat MiniCubes includes two complementary modalities: (i) higher-spatial-resolution, lower-frequency observations from Landsat 8/9 (e.g., surface reflectances) and Sentinel-1 (e.g., synthetic aperture radar backscatter), and (ii) higher-temporal-frequency, coarser observations from GOES-R (e.g., longwave infrared brightness temperatures) and a microwave land surface temperature product. We document variables and metadata and provide technical assessment using inter-variable analyses and autoencoder-based reconstruction-error summaries across pixel classes (e.g., water and cloud). Potential use cases and limitations are also discussed. 2026-06-10T00:34:20Z 53 pages, 26 figures, Submitted to Nature Scientific Data Jonathan Starfeldt Maria J. Molina Alexander Kerr Adam Yang Thomas R. H. Holmes Christopher R. Hain http://arxiv.org/abs/2606.11533v1 AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks 2026-06-10T00:34:04Z The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-term concern is the regulation of military AI applications. Armament manufacturers and defense contractors are increasingly investing in AI capabilities and forging partnerships with AI companies, creating a burgeoning coalition that demands military leaders, arms control diplomacy experts, and AI researchers collaborate to ensure a safer future. While AI researchers often focus on the long-term implications of superintelligent AI, this approach may not adequately address the immediate challenges posed by AI in military applications. Success requires acknowledging and mitigating the emerging risks of frontier AI models that plan to be integrated into defense applications, like military AI systems. Arms control has reduced past catastrophic risks, so lessons learned from nuclear deterrence can guide AI safety and security research towards innovations in verification and diplomacy. AI researchers, however, must assist in leading the technical research that clearly defines and alleviates instability in military settings. Given these new responsibilities and the lack of sufficiently reliable solutions, we argue that AI researchers must take a leading role in advancing arms control research to minimize risk in military AI applications. 2026-06-10T00:34:04Z 9 pages, 1 figure, ICML 2026 Position Paper Ted Fujimoto Jacob Benz