https://arxiv.org/api/IYK0GQAjN78GM7/IGiS8491nF0w2026-06-10T14:32:10Z18383825515http://arxiv.org/abs/2606.10286v1Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling2026-06-09T01:20:38ZOpen-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed-Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational complexity and inability to adapt in real time limit its practical deployment in dynamic industrial environments. This work introduces a simulator-driven Large Language Model (LLM) scheduling framework in which the LLM acts as an autonomous decision-making agent, guided at each step by a custom simulator that encodes geotechnical precedence, extraction-processing coupling, and dynamic capacity constraints directly into the action generation mechanism. Operating entirely zero-shot within a closed, data-secure environment, the framework produces complete, interpretable extraction and processing schedules without cloud-based inference, domain-specific fine-tuning, or retraining. To provide a trustworthy performance benchmark, a novel MILP formulation is developed that incorporates realistic operational and geotechnical constraints. Evaluated across mining instances of varying scale and time periods, the LLM-based framework recovers between 94\% and 99\% of the MILP optimal NPV while scaling linearly in computation time. These results position simulator-constrained LLM agents as a practical and scalable alternative to classical optimization for long-horizon industrial scheduling under complex operational constraints.2026-06-09T01:20:38ZMustavi Ibne MasumThiago Eustaquio Alves de OliveiraMahzabeen Emuhttp://arxiv.org/abs/2602.12424v2RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty2026-06-09T01:12:23ZBenchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.2026-02-12T21:28:46Z32 pages, 9 figures. Accepted by ICLR 2026Ziqian ZhangXingjian HuYue HuangKai ZhangRuoxi ChenYixin LiuQingsong WenKaidi XuXiangliang ZhangNeil Zhenqiang GongLichao Sunhttp://arxiv.org/abs/2606.10279v1Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction2026-06-09T01:00:04ZSupervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.2026-06-09T01:00:04ZBuxin SuBingxuan LiCheng QianYiwei WangJin JinBingxin Zhaohttp://arxiv.org/abs/2606.10278v1Towards Robust Arabic Speech Emotion Recognition with Deep Learning2026-06-09T00:59:43ZSpeech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies.
To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations.
Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling.
The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings.2026-06-09T00:59:43Z21 pages, 16 figures, 11 tables. Submitted manuscriptYoucef Soufiane GheffariSamiya Silarbihttp://arxiv.org/abs/2604.15414v2Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning2026-06-09T00:58:37ZContinual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.2026-04-16T17:06:54ZLute LilloNick Cheneyhttp://arxiv.org/abs/2506.09171v2Fact-Augmented Lookahead Planning for LLM Agents2026-06-09T00:55:52ZLarge Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observable, long-horizon environments when search is unguided or recent history is insufficient. We introduce LWM-Planner, a fact-augmented lookahead planning framework that improves agent behavior purely through in-context learning. After each episode, the agent extracts task-critical atomic facts from its trajectories, validates candidates with a lightweight predictive-consistency filter (and optionally compresses them), and uses the resulting fact set to condition action proposal, single-step latent world-model simulation, and state-value estimation. Planning then proceeds via recursive, depth-limited lookahead over candidate trajectories conditioned on the accumulated facts and recent history, enabling online improvement without parameter updates. We provide abstraction-style motivation: treating facts as reducing state aliasing (proxy $ε_{\mathrm{sim}}$) and fact-conditioned simulation as lowering one-step error (proxy $δ_{\mathrm{model}}$), without claiming formal guarantees. Empirically, on text FrozenLake variants, CrafterMini, and ALFWorld, the approach improves cumulative return over ReAct/Reflexion and search-only baselines, suggesting that additional test-time search is most useful when grounded by compact, experience-derived facts.2025-06-10T18:36:31ZAccepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026). Camera-ready version. 9-page main text plus appendices (63 pages total), 1 figureSamuel HoltMax Ruiz LuytenThomas PouplinMihaela van der Schaarhttp://arxiv.org/abs/2606.10276v1Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction2026-06-09T00:50:29ZFor natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.2026-06-09T00:50:29ZWe provide video demos and code in: https://project-edith.github.ioDongjun LeeJuheon ChoiDong Kyu ShinSinjae KangKimin Leehttp://arxiv.org/abs/2506.08134v4Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem2026-06-09T00:39:28ZPeer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.2025-06-09T18:37:14Z18 pages, 3 figures. Accepted (Oral) at the ICML 2026 Position Paper TrackQiyao WeiSamuel HoltJing YangMarkus WulfmeierMihaela van der Schaarhttp://arxiv.org/abs/2606.10267v1What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents2026-06-09T00:24:00ZHierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.2026-06-09T00:24:00ZJiaheng HuMohit ShridharCaden LuDhruv ShahHao-Tien Lewis ChiangJie TanAnnie Xiehttp://arxiv.org/abs/2606.10254v1RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning2026-06-08T23:40:34ZWhile Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.2026-06-08T23:40:34ZCode available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-EvalYiteng MaoKenan XuYijia LyuWenhao LiJianlong ChenXiangfeng Wanghttp://arxiv.org/abs/2605.16430v2A Theory of Training Profit-Optimal LLMs2026-06-08T23:38:28ZScaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.2026-05-14T18:57:40ZMinor edits for preprintSophie HaoWilliam Merrillhttp://arxiv.org/abs/2606.10250v1Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning2026-06-08T23:36:29ZClass imbalance is a common problem in deep learning that severely degrades performance. In federated learning (FL), it is a critical factor contributing to non-identically distributed data (non-IID). Building on several previous attempts, we define and analyze imbalance issues in FL at three levels: inter-case, inter-class, and inter-client. Inter-case imbalance addresses the imbalance in every single class; inter-class imbalance compares the number of data between different classes. Inter-client imbalance represents different skewness of local data between clients. Based on these concepts, we propose FedBB, which consists of two main components: (1) Positive Negative Balanced (PNB) loss function addresses the inter-case and inter-class imbalances in local training, enhancing generalization on highly skewed local client datasets. It optimizes both multi-label and multi-class classifications by assigning higher weights to minority cases or classes. (2) Client Balanced Reweighting (CBR) reweights clients based on inter-client imbalance during model aggregation, giving greater weight to models trained on less skewed datasets. Various experiments on X-ray and natural image datasets demonstrate that FedBB outperforms other algorithms in both performance and efficiency. Additionally, it requires limited statistical information, which is beneficial for privacy protection. Through ablation studies, we proved that PNB loss and CBR independently contribute to performance. As FedBB aims to build a global model that accurately classifies all classes, it can serve as a baseline for the generic and personalized FL.2026-06-08T23:36:29Z27 pages, 5 figures, 13 tables. Accepted for publication in Neurocomputing (2025). Author Accepted ManuscriptNeurocomputing, Volume 626, 2025, Article 129528Haengbok ChungJae Sung Lee10.1016/j.neucom.2025.129528http://arxiv.org/abs/2509.04154v7Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation2026-06-08T23:28:32ZWe introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention. On language modeling benchmarks, RFA achieves lower perplexity than RoPE within the training window while remaining stable under zero-shot extrapolation to longer contexts. The framework also provides a dynamical interpretation of standard positional mechanisms, connecting rotational embeddings and recency biases to transport and uncertainty propagation induced by stochastic dynamics.2025-09-04T12:29:14ZPeter Racioppohttp://arxiv.org/abs/2606.10246v1Linguistically Augmented Audio Speech Data (LinguAS)2026-06-08T23:26:39ZMaliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.2026-06-08T23:26:39ZAshley R. KeatonZahra KhanjaniChristine MallinsonVandana P. Janejahttp://arxiv.org/abs/2606.10244v1YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale2026-06-08T23:21:14ZWe introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.2026-06-08T23:21:14ZProject page: https://yubi.airoa.io/Takehiko OhkawaJumpei ArimaYuki NoguchiMasatoshi TatenoMakoto SugiuraTakuya OkuboKengo IkeuchiYuma ShinHiroki NishizawaNaoaki KanazawaYuki WakayamaDaiki FukunagaKoshi MakiharaTomohiro MotodaFloris ErichYukiyasu DomaeTatsuya MatsushimaYohishiro OkumatsuKei Ota