https://arxiv.org/api/obMfi0N4zpkBkjggUgryQIw4ONU 2026-06-10T07:53:21Z 111833 135 15 http://arxiv.org/abs/2601.18026v2 CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data 2026-06-08T23:11:58Z

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

2026-01-25T22:49:30Z 18 pages, 8 tables, 5 figures Pedro Ortiz Suarez Laurie Burchell Catherine Arnett Rafael Mosquera-Gómez Sara Hincapie-Monsalve Thom Vaughan Damian Stewart Malte Ostendorff Idris Abdulmumin Vukosi Marivate Shamsuddeen Hassan Muhammad Atnafu Lambebo Tonja Hend Al-Khalifa Nadia Ghezaiel Hammouda Verrah Otiende Tack Hwa Wong Jakhongir Saydaliev Melika Nobakhtian Muhammad Ravi Shulthan Habibi Chalamalasetti Kranti Carol Muchemi Khang Nguyen Faisal Muhammad Adam Luis Frentzen Salim Reem Alqifari Cynthia Amol Joseph Marvin Imperial Ilker Kesen Ahmad Mustafid Pavel Stepachev Leshem Choshen David Anugraha Hamada Nayel Seid Muhie Yimam Vallerie Alexandra Putra My Chiffon Nguyen Azmine Toushik Wasi Gouthami Vadithya Rob van der Goot Lanwenn ar C'horr Karan Dua Andrew Yates Mithil Bangera Yeshil Bangera Hitesh Laxmichand Patel Shu Okabe Fenal Ashokbhai Ilasariya Dmitry Gaynullin Genta Indra Winata Yiyuan Li Juan Pablo Martínez Amit Agarwal Ikhlasul Akmal Hanif Raia Abu Ahmad Esther Adenuga Filbert Aurelian Tjiaranata Weerayut Buaphet Michael Anugraha Sowmya Vajjala Benjamin Rice Azril Hafizi Amirudin Jesujoba O. Alabi Srikant Panda Yassine Toughrai Bruhan Kyomuhendo Daniel Ruffinelli Akshata A Manuel Goulão Ej Zhou Ingrid Gabriela Franco Ramirez Cristina Aggazzotti Konstantin Dobler Jun Kevin Quentin Pagès Nicholas Andrews Nuhu Ibrahim Mattes Ruckdeschel Amr Keleg Mike Zhang Casper Muziri Saron Samuel Sotaro Takeshita Kun Kerdthaisong Luca Foppiano Rasul Dent Tommaso Green Ahmad Mustapha Wali Kamohelo Makaaka Vicky Feliren Inshirah Idris Hande Celikkanat Abdulhamid Abubakar Jean Maillard Benoît Sagot Thibault Clérice Kenton Murray Sarah Luger http://arxiv.org/abs/2602.17907v3 Improving Topic Modeling by Distilling Soft Labels from Language Models 2026-06-08T23:05:44Z

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we introduce a novel topic model training framework by Distilling Soft Labels (DSL) from Language Models (LMs). To construct the contextually enriched reconstruction signals, we project the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary, and train the topic models to reconstruct the soft labels using the LM hidden states. This produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Extensive experiments demonstrate that DSL achieves substantial improvements in topic coherence and assignment accuracy over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

2026-02-20T00:12:04Z 22 pages, 5 figures. Camera-ready version for ICML 2026 Raymond Li Amirhossein Abaskohi Chuyuan Li Gabriel Murray Giuseppe Carenini http://arxiv.org/abs/2606.06622v2 UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs 2026-06-08T22:52:18Z

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

2026-06-04T18:18:02Z Amirhossein Abaskohi Amirhossein Dabiriaghdam Liang Luo Ellie Dingqiao Wen Lele Wang Giuseppe Carenini Peter West http://arxiv.org/abs/2505.23851v2 ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark 2026-06-08T22:21:42Z

Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present \textbf{ASyMOB}, a high-resolution dataset of \textit{35,368} validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, \textbf{ASyMOB} systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent \textit{regime shift} in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. \textbf{ASyMOB} serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

2025-05-28T23:11:14Z Published in ICML2026: https://icml.cc/virtual/2026/poster/63549 Code repository: https://github.com/RamanujanMachine/ASyMOB Complete benchmark dataset: https://huggingface.co/datasets/Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_Benchmark Michael Shalyt Rotem Elimelech Ido Kaminer http://arxiv.org/abs/2604.15771v2 Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing 2026-06-08T22:08:39Z

Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.

2026-04-17T07:25:43Z Kai Wei Raymond Li Xi Zhu Zhaoqian Xue Jiaojiao Han Jingcheng Niu Fan Yang http://arxiv.org/abs/2605.03344v2 RAG over Thinking Traces Can Improve Reasoning Tasks 2026-06-08T22:01:42Z

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

2026-05-05T04:03:28Z Negar Arabzadeh Wenjie Ma Sewon Min Matei Zaharia http://arxiv.org/abs/2604.14397v2 Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection 2026-06-08T21:41:00Z

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.

2026-04-15T20:27:26Z Paper presented at Canadian AI 2026 David Basil Chirooth Girigowda Bradley Hauer Sahir Momin Ning Shi Grzegorz Kondrak http://arxiv.org/abs/2606.10199v1 A Continuous-Time Markov Chain Framework for Insertion Language Models 2026-06-08T21:39:43Z

Insertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.

2026-06-08T21:39:43Z Accepted at AISTATS 2026. Code is available at https://github.com/dhruvdcoder/ctmc_dilm Dhruvesh Patel Benjamin Rozonoyer Soumitra Das Tahira Naseem Tim G. J. Rudner Andrew McCallum http://arxiv.org/abs/2604.22565v2 Learning Evidence Highlighting for Frozen LLMs 2026-06-08T21:29:43Z

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

2026-04-24T13:57:19Z Shaoang Li Yanhang Shi Yufei Li Mingfu Liang Xiaohan Wei Yunchen Pu Fei Tian Chonglin Sun Frank Shyu Luke Simon Sandeep Pandey Xi Liu Jian Li http://arxiv.org/abs/2509.25760v2 TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning 2026-06-08T21:01:17Z

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.

2025-09-30T04:25:17Z ICML 2026. Code: https://github.com/facebookresearch/TruthRL Zhepei Wei Xiao Yang Kai Sun Jiaqi Wang Rulin Shao Jingxiang Chen Mohammad Kachuee Teja Gollapudi Yiwei Liao Nicolas Scheffer Rakesh Wanga Anuj Kumar Yu Meng Wen-tau Yih Xin Luna Dong http://arxiv.org/abs/2606.10159v1 Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community 2026-06-08T20:38:06Z

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

2026-06-08T20:38:06Z Lin Li Qi Zhang Xander Davies Jianing Qiu Yarin Gal http://arxiv.org/abs/2606.10156v1 $τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems 2026-06-08T20:35:45Z

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.

2026-06-08T20:35:45Z Bharath Sivaram Narasimhan Karthik R Narasimhan http://arxiv.org/abs/2510.04491v3 Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents 2026-06-08T20:35:15Z

Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $τ$-Bench to $τ$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $τ$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $τ$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

2025-10-06T05:03:57Z ACL 2026 [Oral] Muyu He Anand Kumar Tsach Mackey Meghana Rajeev James Zou Nazneen Rajani http://arxiv.org/abs/2606.10147v1 From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs 2026-06-08T20:26:09Z

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

2026-06-08T20:26:09Z 40 pages, 29 figures Wish Suharitdamrong Muhammad Awais Xiatian Zhu Sara Atito http://arxiv.org/abs/2510.04514v3 ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering 2026-06-08T20:02:58Z

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

2025-10-06T06:05:36Z Accepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/) Rachneet Kaur Nishan Srishankar Zhen Zeng Sumitra Ganesh Manuela Veloso