https://arxiv.org/api/obMfi0N4zpkBkjggUgryQIw4ONU2026-06-10T07:53:21Z11183313515http://arxiv.org/abs/2601.18026v2CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data2026-06-08T23:11:58ZLanguage identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.2026-01-25T22:49:30Z18 pages, 8 tables, 5 figuresPedro Ortiz SuarezLaurie BurchellCatherine ArnettRafael Mosquera-GómezSara Hincapie-MonsalveThom VaughanDamian StewartMalte OstendorffIdris AbdulmuminVukosi MarivateShamsuddeen Hassan MuhammadAtnafu Lambebo TonjaHend Al-KhalifaNadia Ghezaiel HammoudaVerrah OtiendeTack Hwa WongJakhongir SaydalievMelika NobakhtianMuhammad Ravi Shulthan HabibiChalamalasetti KrantiCarol MuchemiKhang NguyenFaisal Muhammad AdamLuis Frentzen SalimReem AlqifariCynthia AmolJoseph Marvin ImperialIlker KesenAhmad MustafidPavel StepachevLeshem ChoshenDavid AnugrahaHamada NayelSeid Muhie YimamVallerie Alexandra PutraMy Chiffon NguyenAzmine Toushik WasiGouthami VadithyaRob van der GootLanwenn ar C'horrKaran DuaAndrew YatesMithil BangeraYeshil BangeraHitesh Laxmichand PatelShu OkabeFenal Ashokbhai IlasariyaDmitry GaynullinGenta Indra WinataYiyuan LiJuan Pablo MartínezAmit AgarwalIkhlasul Akmal HanifRaia Abu AhmadEsther AdenugaFilbert Aurelian TjiaranataWeerayut BuaphetMichael AnugrahaSowmya VajjalaBenjamin RiceAzril Hafizi AmirudinJesujoba O. AlabiSrikant PandaYassine ToughraiBruhan KyomuhendoDaniel RuffinelliAkshata AManuel GoulãoEj ZhouIngrid Gabriela Franco RamirezCristina AggazzottiKonstantin DoblerJun KevinQuentin PagèsNicholas AndrewsNuhu IbrahimMattes RuckdeschelAmr KelegMike ZhangCasper MuziriSaron SamuelSotaro TakeshitaKun KerdthaisongLuca FoppianoRasul DentTommaso GreenAhmad Mustapha WaliKamohelo MakaakaVicky FelirenInshirah IdrisHande CelikkanatAbdulhamid AbubakarJean MaillardBenoît SagotThibault ClériceKenton MurraySarah Lugerhttp://arxiv.org/abs/2602.17907v3Improving Topic Modeling by Distilling Soft Labels from Language Models2026-06-08T23:05:44ZTraditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we introduce a novel topic model training framework by Distilling Soft Labels (DSL) from Language Models (LMs). To construct the contextually enriched reconstruction signals, we project the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary, and train the topic models to reconstruct the soft labels using the LM hidden states. This produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Extensive experiments demonstrate that DSL achieves substantial improvements in topic coherence and assignment accuracy over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.2026-02-20T00:12:04Z22 pages, 5 figures. Camera-ready version for ICML 2026Raymond LiAmirhossein AbaskohiChuyuan LiGabriel MurrayGiuseppe Careninihttp://arxiv.org/abs/2606.06622v2UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs2026-06-08T22:52:18ZWe introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.2026-06-04T18:18:02ZAmirhossein AbaskohiAmirhossein DabiriaghdamLiang LuoEllie Dingqiao WenLele WangGiuseppe CareniniPeter Westhttp://arxiv.org/abs/2505.23851v2ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark2026-06-08T22:21:42ZLarge language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present \textbf{ASyMOB}, a high-resolution dataset of \textit{35,368} validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, \textbf{ASyMOB} systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent \textit{regime shift} in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. \textbf{ASyMOB} serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.2025-05-28T23:11:14ZPublished in ICML2026: https://icml.cc/virtual/2026/poster/63549 Code repository: https://github.com/RamanujanMachine/ASyMOB Complete benchmark dataset: https://huggingface.co/datasets/Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_BenchmarkMichael ShalytRotem ElimelechIdo Kaminerhttp://arxiv.org/abs/2604.15771v2Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing2026-06-08T22:08:39ZRetrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.2026-04-17T07:25:43ZKai WeiRaymond LiXi ZhuZhaoqian XueJiaojiao HanJingcheng NiuFan Yanghttp://arxiv.org/abs/2605.03344v2RAG over Thinking Traces Can Improve Reasoning Tasks2026-06-08T22:01:42ZRetrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.2026-05-05T04:03:28ZNegar ArabzadehWenjie MaSewon MinMatei Zahariahttp://arxiv.org/abs/2604.14397v2Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection2026-06-08T21:41:00ZWe study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.2026-04-15T20:27:26ZPaper presented at Canadian AI 2026David BasilChirooth GirigowdaBradley HauerSahir MominNing ShiGrzegorz Kondrakhttp://arxiv.org/abs/2606.10199v1A Continuous-Time Markov Chain Framework for Insertion Language Models2026-06-08T21:39:43ZInsertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.2026-06-08T21:39:43ZAccepted at AISTATS 2026. Code is available at https://github.com/dhruvdcoder/ctmc_dilmDhruvesh PatelBenjamin RozonoyerSoumitra DasTahira NaseemTim G. J. RudnerAndrew McCallumhttp://arxiv.org/abs/2604.22565v2Learning Evidence Highlighting for Frozen LLMs2026-06-08T21:29:43ZLarge Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.2026-04-24T13:57:19ZShaoang LiYanhang ShiYufei LiMingfu LiangXiaohan WeiYunchen PuFei TianChonglin SunFrank ShyuLuke SimonSandeep PandeyXi LiuJian Lihttp://arxiv.org/abs/2509.25760v2TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning2026-06-08T21:01:17ZWhile large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.2025-09-30T04:25:17ZICML 2026. Code: https://github.com/facebookresearch/TruthRLZhepei WeiXiao YangKai SunJiaqi WangRulin ShaoJingxiang ChenMohammad KachueeTeja GollapudiYiwei LiaoNicolas SchefferRakesh WangaAnuj KumarYu MengWen-tau YihXin Luna Donghttp://arxiv.org/abs/2606.10159v1Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community2026-06-08T20:38:06ZAI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.2026-06-08T20:38:06ZLin LiQi ZhangXander DaviesJianing QiuYarin Galhttp://arxiv.org/abs/2606.10156v1$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems2026-06-08T20:35:45ZAs recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.2026-06-08T20:35:45ZBharath Sivaram NarasimhanKarthik R Narasimhanhttp://arxiv.org/abs/2510.04491v3Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents2026-06-08T20:35:15ZDespite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $τ$-Bench to $τ$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $τ$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $τ$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.2025-10-06T05:03:57ZACL 2026 [Oral]Muyu HeAnand KumarTsach MackeyMeghana RajeevJames ZouNazneen Rajanihttp://arxiv.org/abs/2606.10147v1From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs2026-06-08T20:26:09ZMultimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.2026-06-08T20:26:09Z40 pages, 29 figuresWish SuharitdamrongMuhammad AwaisXiatian ZhuSara Atitohttp://arxiv.org/abs/2510.04514v3ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering2026-06-08T20:02:58ZRecent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.2025-10-06T06:05:36ZAccepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/)Rachneet KaurNishan SrishankarZhen ZengSumitra GaneshManuela Veloso