https://arxiv.org/api/LUY4l+4pAg8Mq4MfCFxKf20/SHo2026-04-07T08:32:06Z17077210515http://arxiv.org/abs/2603.13842v2Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving2026-04-06T08:06:45ZEnd-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.2026-03-14T08:53:47Z11 pages, 7 figures, 6 tablesZhexi LianHaoran WangXuerun YanWeimeng LinXianhong ZhangYongyu ChenJia Huhttp://arxiv.org/abs/2509.11481v2RAPTOR: A Foundation Policy for Quadrotor Control2026-04-06T08:06:29ZHumans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through in-context learning is made possible by using a recurrence in the hidden layer. The policy is trained through our proposed Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using RL. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).2025-09-15T00:05:40ZJonas EschmannDario AlbaniGiuseppe Loiannohttp://arxiv.org/abs/2604.04503v1Memory Intelligence Agent2026-04-06T07:59:52ZDeep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.2026-04-06T07:59:52ZJingyang QiaoWeicheng MengYu ChengZhihang LinZhizhong ZhangXin TanJingyu GongKun ShaoYuan Xiehttp://arxiv.org/abs/2512.03795v2MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving2026-04-06T07:53:28ZAutonomous Driving (AD) vehicles still struggle to exhibit human-like behavior in highly dynamic and interactive traffic scenarios. The key challenge lies in AD's limited ability to interact with surrounding vehicles, largely due to a lack of understanding the underlying mechanisms of social interaction. To address this issue, we introduce MPCFormer, an explainable socially-aware autonomous driving approach with physics-informed and data-driven coupled social interaction dynamics. In this model, the dynamics are formulated into a discrete space-state representation, which embeds physics priors to enhance modeling explainability. The dynamics coefficients are learned from naturalistic driving data via a Transformer-based encoder-decoder architecture. To the best of our knowledge, MPCFormer is the first approach to explicitly model the dynamics of multi-vehicle social interactions. The learned social interaction dynamics enable the planner to generate manifold, human-like behaviors when interacting with surrounding traffic. By leveraging the MPC framework, the approach mitigates the potential safety risks typically associated with purely learning-based methods. Open-looped evaluation on NGSIM dataset demonstrates that MPCFormer achieves superior social interaction awareness, yielding the lowest trajectory prediction errors compared with other state-of-the-art approaches. The prediction achieves an ADE as low as 0.86 m over a long prediction horizon of 5 seconds. Close-looped experiments in highly intense interaction scenarios, where consecutive lane changes are required to exit an off-ramp, further validate the effectiveness of MPCFormer. Results show that MPCFormer achieves the highest planning success rate of 94.67%, improves driving efficiency by 15.75%, and reduces the collision rate from 21.25% to 0.5%, outperforming a frontier Reinforcement Learning (RL) based planner.2025-12-03T13:43:33Z17 pages, 17 figuresJia HuZhexi LianXuerun YanRuiang BiDou ShenYu RuanChunlong XiaHaoran Wanghttp://arxiv.org/abs/2604.04497v1One Model for All: Multi-Objective Controllable Language Models2026-04-06T07:48:32ZAligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC's potential for real-world applications requiring scalable and customizable LLMs.2026-04-06T07:48:32ZPublished in Transactions on Machine Learning Research (03/2026): https://openreview.net/forum?id=qAM5PmvFYYQiang HeYucheng YangTianyi ZhouMeng FangMykola PechenizkiySetareh Maghsudihttp://arxiv.org/abs/2510.09901v2Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics2026-04-06T07:37:55ZComputing has long served as a cornerstone of scientific discovery. Recently, a paradigm shift has emerged with the rise of large language models (LLMs), introducing autonomous systems, referred to as agents, that accelerate discovery across varying levels of autonomy. These language agents provide a flexible and versatile framework that orchestrates interactions with human scientists, natural language, computer language and code, and physics. This paper presents our view and vision of LLM-based scientific agents and their growing role in transforming the scientific discovery lifecycle, from hypothesis discovery, experimental design and execution, to result analysis and refinement. We critically examine current methodologies, emphasizing key innovations, practical achievements, and outstanding limitations. Additionally, we identify open research challenges and outline promising directions for building more robust, generalizable, and adaptive scientific agents. Our analysis highlights the transformative potential of autonomous agents to accelerate scientific discovery across diverse domains.2025-10-10T22:26:26ZLianhao ZhouHongyi LingCong FuYepeng HuangMichael SunWendi YuXiaoxuan WangXiner LiXingyu SuJunkai ZhangXiusi ChenChenxing LiangXiaofeng QianHeng JiWei WangMarinka ZitnikShuiwang Jihttp://arxiv.org/abs/2604.04493v1SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models2026-04-06T07:36:48ZThe rapid growth of large language models (LLMs) presents significant deployment challenges due to their massive computational and memory demands. While model compression, such as network pruning, offers potential solutions, most existing methods often fail to maintain good performance at high compression ratios. To address this, we propose SLaB, a novel framework that decomposes each linear layer weight into three complementary components: a sparse matrix, a low-rank matrix, and a binary matrix. SLaB eliminates the need for retraining and leverages activation-aware pruning scores to guide the decomposition process. Experiments on Llama-family models demonstrate that SLaB achieves state-of-the-art performance, reducing perplexity by up to 36% compared to existing methods at 50% compression and improving accuracy by up to 8.98% over the baseline on zero-shot tasks.2026-04-06T07:36:48ZZiwei LiYuang MaYi Kanghttp://arxiv.org/abs/2604.04490v1RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation2026-04-06T07:30:45ZThis paper presents RAVEN, a computationally efficient deep learning architecture for FMCW radar perception. The method processes raw ADC data in a chirp-wise streaming manner, preserves MIMO structure through independent receiver state-space encoders, and uses a learnable cross-antenna mixing module to recover compact virtual-array features. It also introduces an early-exit mechanism so the model can make decisions using only a subset of chirps when the latent state has stabilized. Across automotive radar benchmarks, the approach reports strong object detection and BEV free-space segmentation performance while substantially reducing computation and end-to-end latency compared with conventional frame-based radar pipelines.2026-04-06T07:30:45ZCVPR submission / conference paperComputer Vision and Pattern Recognition Conference 2026Anuvab SenMir Sayeed MohammadSaibal Mukhopadhyayhttp://arxiv.org/abs/2604.02368v2Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation2026-04-06T07:26:20ZAs Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.2026-03-27T11:28:15ZXue LiuXin MaYuxin MaYongchang PengDuo WangZhoufutu WenGe ZhangKaiyuan ZhangXinyu ChenTianci HeJiani HouLiang HuZiyun HuangYongzhe HuiJianpeng JiaoChennan JuYingru KongYiran LiMengyun LiuLuyao MaFei NiYiqing NiYueyan QiuYanle RenZilin ShiZaiyuan WangWenjie YueShiyu ZhangXinyi ZhangKaiwen ZhaoZhenwei ZhuShanshan WuQi ZhaoWenhao Huanghttp://arxiv.org/abs/2601.17581v3How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests2026-04-06T07:24:28ZAI coding agents are increasingly acting as autonomous contributors by generating and submitting pull requests (PRs). However, we lack empirical evidence on how these agent-generated PRs differ from human contributions, particularly in how they modify code and describe their changes. Understanding these differences is essential for assessing their reliability and impact on development workflows. Using the MSR 2026 Mining Challenge version of the AIDev dataset, we analyze 24,014 merged Agentic PRs (440,295 commits) and 5,081 merged Human PRs (23,242 commits). We examine additions, deletions, commits, and files touched, and evaluate the consistency between PR descriptions and their diffs using lexical and semantic similarity. Agentic PRs differ substantially from Human PRs in commit count (Cliff's $δ= 0.5429$) and show moderate differences in files touched and deleted lines. They also exhibit slightly higher description-to-diff similarity across all measures. These findings provide a large-scale empirical characterization of how AI coding agents contribute to open source development.2026-01-24T20:27:04ZAccepted at the 23rd IEEE/ACM International Conference on Mining Software Repositories - Mining Challenge TrackDaniel OgenrwotJohn Businge10.1145/3793302.3793603http://arxiv.org/abs/2509.24186v2Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks2026-04-06T07:24:03ZAccuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.2025-09-29T02:06:13ZZhimeng LuoLixin WuAdam FrischDaqing Hehttp://arxiv.org/abs/2604.04485v1ECG Biometrics with ArcFace-Inception: External Validation on MIMIC and HEEDB2026-04-06T07:20:34ZECG biometrics has been studied mainly on small cohorts and short inter-session intervals, leaving open how identification behaves under large galleries, external domain shift, and multi-year temporal gaps. We evaluated a 1D
Inception-v1 model trained with ArcFace on an internal clinical corpus of 164,440 12-lead ECGs from 53,079 patients and tested it on larger cohorts derived from MIMIC-IV-ECG and HEEDB. The study used a unified closed-set leave-one-out
protocol with Rank@K and TAR@FAR metrics, together with scale, temporal-stress, reranking, and confidence analyses. Under general comparability, the system achieved Rank@1 of 0.9506 on ASUGI-DB, 0.8291 on MIMIC-GC, and 0.6884 on
HEEDB-GC. In the temporal stress test at constant gallery size, Rank@1 declined from 0.7853 to 0.6433 on MIMIC and from 0.6864 to 0.5560 on HEEDB from 1 to 5 years. Scale analysis on HEEDB showed monotonic degradation as gallery size
increased and recovery as more examinations per patient became available. On HEEDB-RR, post-hoc reranking further improved retrieval, with AS-norm reaching Rank@1 = 0.8005 from a 0.7765 baseline. ECG identity information therefore
remains measurable under externally validated large-scale closed-set conditions, but its operational quality is strongly affected by domain heterogeneity, longitudinal drift, gallery size, and second-stage score processing.2026-04-06T07:20:34ZArjuna Scagnettohttp://arxiv.org/abs/2604.04482v1Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models2026-04-06T07:12:46ZLearners' use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors' ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal large language models (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.2026-04-06T07:12:46ZAccepted as long paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)Dominik GlandorfFares FawziTanja Käserhttp://arxiv.org/abs/2604.04475v1Discrete Prototypical Memories for Federated Time Series Foundation Models2026-04-06T06:57:08ZLeveraging Large Language Models (LLMs) as federated learning (FL)-based time series foundation models offers a promising way to transfer the generalization capabilities of LLMs to time series data while preserving access to private data. However, the semantic misalignment between time-series data and the text-centric latent space of existing LLMs often leads to degraded performance. Meanwhile, the parameter-sharing mechanism in existing FL methods model heterogeneous cross-domain time-series data into a unified continuous latent space, which contradicts the fact that time-series semantics frequently manifest as discrete and recurring regimes. To address these limitations, we propose \textsc{FeDPM}, a federated framework for time-series foundation models based on discrete prototypical memories. Specifically, we learn local prototypical memory priors for intra-domain time-series data. We then align cross-domain memories to promote a unified discrete latent space and introduce a domain-specific memory update mechanism to balance shared and personalized prototypical knowledge. Extensive experiments demonstrate the efficiency and effectiveness of \textsc{FeDPM}. The code is publicly available at https://anonymous.4open.science/r/FedUnit-64D1.2026-04-06T06:57:08Z13 pages,5 figuresLiwei DengQingxiang LiuXinhe NiuShengchao ChenSheng SunYuankai WuGuodong LongYuxuan Lianghttp://arxiv.org/abs/2604.04474v1MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation2026-04-06T06:53:30ZDeep learning-based approaches, particularly graph neural networks (GNNs), have gained prominence in simulating flexible deformations and contacts of solids, due to their ability to handle unstructured physical fields and nonlinear regression on graph structures. However, existing GNNs commonly represent meshes with graphs built solely from vertices and edges. These approaches tend to overlook higher-dimensional spatial features, e.g., 2D facets and 3D cells, from the original geometry. As a result, it is challenging to accurately capture boundary representations and volumetric characteristics, though this information is critically important for modeling contact interactions and internal physical quantity propagation, particularly under sparse mesh discretization. In this paper, we introduce MAVEN, a mesh-aware volumetric encoding network for simulating 3D flexible deformation, which explicitly models geometric mesh elements of higher dimension to achieve a more accurate and natural physical simulation. MAVEN establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated into the model to alleviate the burden of implicitly learning geometric patterns. Experimental results show that MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.2026-04-06T06:53:30ZZhe FengShilong TaoHaonan SunShaohan ChenZhanxing ZhuYunhuai Liu