https://arxiv.org/api/CH9CPMg65nc1srPlWpN4ItUBUu02026-06-22T17:22:19Z11257961515http://arxiv.org/abs/2601.22777v2RASST: Retrieval-Augmented Simultaneous Speech Translation2026-06-13T01:58:28ZSimultaneous speech translation produces target text incrementally from partial speech input. Recent speech large language models have markedly improved SST quality but still struggle with rare and domain-specific terminology. Retrieval augmentation has helped in automatic speech recognition and neural machine translation, but extending it to SST is non-trivial: retrieval must be fast and accurate under partial speech, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which addresses both challenges. For accurate cross-modal retrieval under partial input, RASST trains a lightweight speech-text retriever that produces chunkwise terminology hints for the Speech LLM via multi-scale retrieval. To use these hints correctly, we synthesize training data that teaches the Speech LLM to decide whether and when to apply each retrieved term. Experiments on ACL 60/60 dev set and the ESO test set show that RASST improves terminology accuracy by nearly 40% and overall translation quality by up to 3 BLEU points, with negligible computational overhead.2026-01-30T09:59:24ZUnder ReviewJiaxuan LuoSiqi OuyangJiaxing XuLei Lihttp://arxiv.org/abs/2602.05060v2StagePilot: Stage-Level Planning for Long-Horizon Dialogue Simulation in Cybergrooming2026-06-13T01:48:05ZCybergrooming is an evolving threat to youth, requiring proactive educational interventions. We address this by modeling dialogue progression as a structured planning problem over stage-wise interactions. We propose StagePilot, a dialogue framework that separates stage-level planning from response generation, in which the model selects the next stage under constrained transitions and generates responses conditioned on it, enabling coherent and realistic progression. Reinforcement learning is used to learn stage-level policies from offline data, optimizing for both emotional alignment and goal-consistent progression. Our empirical experiments show that StagePilot generates more structured, coherent dialogue trajectories and reduces conversational stagnation compared to baselines; notably, the IQL+AWAC variant reaches the final stage more often while maintaining over 70% positive or neutral responses, yielding a 43% relative improvement.2026-02-04T21:22:45ZAccepted at the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2026)Heajun AnQi ZhangMinqian LiuXinyi ZhangSang Won LeeLifu HuangPamela J. WisniewskiJin-Hee Chohttp://arxiv.org/abs/2606.15044v1Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models2026-06-13T01:10:42ZMultilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.2026-06-13T01:10:42ZKieron Seven Jun Wei LeeMuhammad Reza QoribAndrew Ivan SoegengHwee Tou Nghttp://arxiv.org/abs/2606.15037v1ReportQA: QA-Based Radiology Report Evaluation2026-06-13T00:43:03ZRadiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.2026-06-13T00:43:03ZYiming ShiShaoshuai YangXi ChenHaolin LiHengyu ZhangChe JiangKaiwen WangXun ZhuDong XieFei WangDejing DouMiao LiJi Wuhttp://arxiv.org/abs/2606.15033v1Cloze: An Open Research Platform for Studying Human-AI Conversations in Mental Health Contexts2026-06-13T00:24:36ZCloze is an open-source web platform for conducting controlled, monitored studies of human-AI conversation in mental health research contexts. Consumer large language model (LLM) products such as ChatGPT, Claude, and Gemini are built for individual productivity, and offer researchers little experimental control, inconsistent data export, and no shared safety scaffolding that holds across providers. Cloze gives research teams a single environment in which they configure which models participants converse with, how the AI is instructed, how conversations are scheduled over time, and which safety constraints apply unconditionally, while every message is captured with full provenance (model version, prompt configuration, timing). The platform currently supports OpenAI, Anthropic, Google, and locally hosted open-weight models served through Ollama behind a unified interface, and runs in the cloud or fully on premises so that participant data need never leave an institution. Cloze is research infrastructure for building an evidence base on human-AI interaction in mental health contexts. It is not a therapeutic product.2026-06-13T00:24:36Z7 pages, 2 figures. Cloze is released under AGPL-3.0Matthew FlathersFrancesco CiprianiJohn Toroushttp://arxiv.org/abs/2606.15026v1Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals2026-06-12T23:50:33ZPhysiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.2026-06-12T23:50:33ZAccepted for publication in the 17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2026). DOI: https://doi.org/10.1145/3807503.3819363Desta Haileselassie HagosSaurav Keshari AryalPatrick Ymele-LekiAnietie AndyLegand L. Burge10.1145/3807503.3819363http://arxiv.org/abs/2606.15017v1Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents2026-06-12T23:30:14ZOnline web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.2026-06-12T23:30:14ZSina HajimiriMasih AminbeidokhtiJose DolzIsmail Ben AyedIssam H. LaradjiSpandana GellaNicolas Gontierhttp://arxiv.org/abs/2512.21577v3A Unified Definition of Hallucination: It's The World Model, Stupid!2026-06-12T23:25:06ZDespite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.2025-12-25T08:42:18ZICML 2026. HalluWorld benchmark at https://github.com/DegenAI-Labs/HalluWorldEmmy LiuVarun GangalChelsea ZouMichael YuXiaoqi HuangAlex ChangZhuofu TaoKaran SinghSachin KumarSteven Y. Fenghttp://arxiv.org/abs/2605.17106v2HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools2026-06-12T23:23:22ZProduction LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog -- adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA's tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot's VS Code Chat auto-mode and -- to our knowledge for the first time in the LLM routing literature -- demonstrates language-invariant routing across CJK, European, and other script families.2026-05-16T18:19:30Zpreprint v2Aashna GargSiddharth Singha RoyJinu JangFederico BrancasiShengyu Fuhttp://arxiv.org/abs/2606.15007v1Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning2026-06-12T22:56:12ZWe introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.2026-06-12T22:56:12Z NVIDIAAllan :AllanAaron BlakemanAllanAaron ThomasAllanAastha JhunjhunwalaAllanAbhibha GuptaAllanAbhinav KhattarAllanAdam RajferAllanAdi RenduchintalaAllanAdil AsifAllanAditya VavreAllanAdriana Flores MirandaAllanAhmad BilalAllanAileen ZamanAllanAjay HotchandaniAllanAkanksha ShuklaAllanAkhiad BercovichAllanAleksander FicekAllanAlex GronskiyAllanAlex KondratenkoAllanAlex SteinerAllanAlex YeAllanAlexander BukharinAllanAlexandre MilesiAllanAli TaghibakhshiAllanAlice GattiAllanAlisa LiuAllanAlok KumarAllanAmar PhanishayeeAllanAmeya Sunil MahabaleshwarkarAllanAmir KleinAllanAmit ZukerAllanAmnon GeifmanAllanAnahita BhiwandiwallaAllanAnanth SubramaniamAllanAndrea SantilliAllanAndrew FulksAllanAndrew McHargAllanAndrew TaoAllanAndrii SkliarAllanAnjulie AgrusaAllanAnkur SrivastavaAllanAnkur VermaAllanAnna ShorsAllanAnna WarnoAllanAntoni-Joan Solergibert I LlaquetAllanArham MehtaAllanArkadiusz NowaczynskiAllanArti JainAllanAshwath AithalAllanAshwin PoojaryAllanAsif AhamedAllanAsit MishraAllanAsma Kuriparambil ThekkumpateAllanAtefeh SohrabizadehAllanAvinash KaurAllanAvinash VemAllanAyush DattaguptaAllanBarath Subramaniam AnandanAllanBardiya SadeghiAllanBen LanirAllanBenedikt SchiffererAllanBesmira NushiAllanBilal KartalAllanBill ThiedeAllanBita Darvish RouhaniAllanBo DengAllanBob SchatzAllanBoris GinsburgAllanBoxin WangAllanBrad NemireAllanBrandon NorickAllanBrian DangAllanBrian WestphalAllanBrian YuAllanBrucek KhailanyAllanBryan CatanzaroAllanCarlo del MundoAllanCaryln AarishAllanChankyu LeeAllanChantal HwangAllanCharbel SakrAllanCharles WangAllanCharlie TruongAllanChen CuiAllanCheng ChengAllanCheng-Ping HsiehAllanChenghao ZhangAllanChenhui DengAllanChintan PatelAllanChris AlexiukAllanChristian CosgroveAllanChristian MunleyAllanChristine HarveyAllanChristopher ParisienAllanChunyang ShenAllanCoco LiAllanCollin NealeAllanCynthia GaoAllanCyril MeurillonAllanDan GilAllanDan SuAllanDan ZhaoAllanDane CorneilAllanDaniel AfrimiAllanDaniel EgertAllanDaniel KorzekwaAllanDaniel LoAllanDaniel MachlabAllanDaniel SerebrenikAllanDaniil SorokinAllanDaria GitmanAllanDaria LevyAllanDarko StosicAllanDavid MosallanezhadAllanDavid YuAllanDavit KaramyanAllanDeena DoniaAllanDeep DebroyAllanDeepak NarayananAllanDevin O'KellyAllanDheeraj PeriAllanDhruv NathawaniAllan DiAllan WuDima RekeshDivyanshu KakwaniDonald PlummerDong AnhDongfeng YuDongfu JiangDonnie KimDorrin PoorkayDuncan RiachDusan StosicDustin VanSteeEavan MengEdgar MinasyanEdward LinEileen Margaret Peters LongElad SarafinElad SegalElena LantzEllie EvansElliott NingEric ChungEric HarperEric Pham-HungEric TramelEric YangErick GalinkinErik PoundsErika Goncalves GoncalvesEvan BrionesEvan WuEvelina BakhturinaEvgeny TsykunovEwa DobrowolskaFaisal LadhakFarzan MemarianFay WangFei JiaFelipe SoaresFelipe Vieira FrujeriFeng ChenFengguang LinFerenc GalkoFrank SunFrankie SiinoFrida HouGal Hubara AgamGal KaplunGantavya BhattGargi PrasadGarvit KulshreshthaGeorge ArmstrongGerald ShenGiulio BorghesiGordana NeskovicGorkem BatmazGrace LamGreg MasonGreg PauloskiGrigor NalbandyanGrzegorz ChlebusGrzegorz KarchGuan-Ting LiuGuoming ZhangGuyue HuangHaggai MaronHaifeng QianHaim ElishaHaoxing RenHaran Kumar Shiv KumarHaribhau HudHarris NoverHarrison Saturley HallHayate IsoHelen NgoHerbert HumHerman SahotaHexin WangHimanshu SoniHovhannes TamoyanHua LiHuanhuan ChenHui LiHui WangHuy NguyenIan ChilesIdo GalilIdo ShahafIgor GitmanIgor ShovkunIlya LoshchilovIngo GuehringItamar SchenItay LevyItay NeemanIvan MoshkovIzik GolanIzzy PuttermanJaemin ChoiJakub SlowikowskiJan KautzJane Polak ScowcroftJared CasperJatin MitraJeffrey GlickJenny ChenJesse OliverJiacheng XuJiafan ZhuJialin SongJian ZhangJiantao JiaoJiaqi ZengJie LouJim KingJimmy ZhangJingquan WangJinhang ChoiJinju ChuJoey ConwayJoey GumanJohan JatkoJohannes RauschJohn KamaluJohn RobertsJohnny GrecoJohnny MenselJonah AlbenJonas YangJonathan CohenJonathan RaimanJoseph JenningsJoshua MabryJoshua PierceJoyjit DawJulien Veron VialardJunkeun YiJupinder ParmarKajal JainKan ZhuKari BriskiKatherine CheungKatherine LunaKeith WillowhawkKeith WyssKeshav SanthanamKevin ShihKezhi KongKhanh NguyenKhushi BhardwajKirthi Shankar SivamaniKonstantinos KrommydasKrishna C. PuvvadaKrzysztof PawelecKumar AnikKyle KepriosKylie DayLawrence McAfeeLeo DuLeon DerczynskiLi DingLinda LiuLingjie WuLior KadochLizzie WeiLuis VegaLuke RobisonLun SuMaarten Van SegbroeckMaciej Jakub MikulskiMaer Rodrigues de MeloMagda SypulaMahan FathiMakesh Narsimhan SreedharMakesh Tarun ChandranManoj KilaruMaor AshkenaziMarc CuevasMarc RomeijnMarcin ChochowskiMark CaiMark MozolewskiMarkus KlieglMarta Stepniewska-DziubinskaMartyna PatelkaMattei MachczynskiMatvei NovikovMauricio FerratoMaximilian GolubMehrzad SamadiMelissa CorpuzMengru WangMengxi WuMeredith PriceMeriem BoubdirMicah SchafferMichael AnderschMichael BooneMichael GschwindMichael LightstoneMichael LohMichal BienMichal ZawalskiMichelle GillMiguel MartinezMikail KhonaMike ChrzanowskiMike HoustonMingyuan MaMinseok LeeMohamed FawzyMohammad DabbahMohammad ShoeybiMostofa PatwaryNabin MulepatiNajeeb NabwaniNamit DhamejaNarimane HennouniNatalie HerethNathaniel PinckneyNave AlgariciNave AssafNetanel HaberNicholas KnightNick ReamaroonNickson QuakNidhi BhatiaNikhil DesaiNikolai LudwigNima TajbakhshNing XuNir AilonNirmal JuluruNitin NitinOfri MasadOleg RybakovOleksii HrinchukOleksii KuchaievOlivia ViessmannOlivier DelalleauOluwatobi OlabiyiOmer Ullman ArgovOmri PunyOren TroppPablo RibaltaPallab BhattacharyaPanos LampropoulosParth MannanPasha ShamisPatrick LegresleyPaul GibbonsPavlo MolchanovPawel MorkiszPeter DykasPeter JinPierre-Yves AquilantiPinky XuPiotr JanuszewskiPiotr LaskiewiczPooya JannatyPrakash GurumurthyPranav Prashant ThombrePrasoon VarshneyPritam GundechaPrzemek TredakPuhui MengQiyu WanRabeeh Karimi MahabadiRachel ObermanRachit GargRadha Sri-TharanRahul KanduRakshit SanadhyaRan El-YanivRan ZilbersteinRasoul ShafipourRay MacalisangRayen TianReka KovacsRenjie PiRick IzzoRima ShahbazyanRishabh GargRishi PuriRita Fernandes NevesRitchie ZhaoRitika BorkarRitu GalaRiyad IslamRobert ClarkRobert HesseRobert KirbyRoger WaleffeRohit WatveRoi KorenRon BannerRuoxi ZhangRussell J. HewettRyan PrengerRyan StewartRyota EgashiraSadegh MahdaviSaee PaliwalSagar SinghSahil ModiSalika DaveSamantha ShinagawaSamuel KrimanSandip BhaskarSangkug LymSanjay KariyappaSanjeev SatheeshSaran Vikas MurariSatish PasumarthiSaurabh MishraSaurav MuralidharanScott HaraSean NarentharenSelvaraj AnandarajSeonjin NaSeonmeyong BakSeonmyeong BakSepehr SameniSeph MardSerge PanevSeth HennemanSeth PoulosShahar MorShantanu AcharyaShaona GhoshSharath Turuvekere SreenivasSharon MendelsonShaun KotekShawn WangShay AharonShaya GharghabiSheng-Chieh LinShi ChenShiqing FanShirish BaskaranShreya GopaShrimai PrabhumoyeShubham PachoriShubham ToshniwalShuoyang DingShwetha KrishnamurthySiddharth SinghSimeng SunSirshak DasSivakumar Arayandi ThottakaraSmita IthapeSomshubra MajumdarSoumye SinghalSri Harsha SingudasuSridhar BhuvanapalliSrimukh VecchamStas SergienkoStefania AlborghettiStephen GeSu RongSugam Dipak DevareSukrit RaoSumeet Kumar BaruaSungsoo HaSunny GaiSuriya GunasekarSuseella PanguluriSuyog GuptaSviataslau HinzburhSweta PriyadarshiSyeda Nahida AkterTalor AbramovichTan BuiTanay VarshneyTatevik Ter-HovhannisyanTeodor-Dumitru EneTerry KongThanh DoTianhe ZhangTiffany MooreTijmen BlankevoortTim MoonTiyasa MitraTom BaloughTomasz GrzegorzekTomasz HliwiakTomer AsidaTomer Bar NatanTomer KerenTomer RonenTony SalimTony WangTraian RebedeaTugrul KonukTwinkle VashishthUdi KarpasUshnish DeVahid NooroziVenkat SrinivasanVenmugil ElangoVibhor AgrawalVictor CuiVijay KorthikantiVikas MehtaVinay RaoVirginia WuVitaly KurinVitaly LavrukhinVladimir AnisimovVu PhamWanli JiangWasi Uddin AhmadWataru IshiharaWei DuWei PingWeiheng ChaiWenliang DaiWesley HelmholzWill JenningsWill ZhuWojciech PrazuchXiaowei RenXiwen YuYan BreekYang ChenYang YuYangyi ChenYaniv GalronYashaswi KarnatiYejin ChoiYev MeyerYi-Fu WuYian ZhangYing LinYonatan GeifmanYonggan FuYoungeun KwonYu YaoYugi GuvvlaYuki HuangYunsheng LiuZach MosheZachary NewellZhilin WangZhiyu LiZhongbo ZhuZhuolin YangZihan LiuZijie YanZsolt-Alon Wertheimerhttp://arxiv.org/abs/2408.05568v2Metacognitive Myopia in Large Language Models2026-06-12T22:06:49ZLarge Language Models (LLMs) exhibit potentially harmful biases that reinforce culturally embedded stereotypes, influence moral judgments, or amplify positive evaluations of majority groups. We propose metacognitive myopia as a cognitive-ecological framework accounting for a conglomerate of established and emerging LLM biases. Our theoretical framework posits that biased samples in the information environment cause five symptoms of metacognitive myopia in LLMs: integration of invalid embeddings, susceptibility to redundant information, neglect of base rates in conditional computation, decision rules based on frequency, and inappropriate higher-order statistical inference for nested data structures. Moreover, it posits that the two main components of metacognition, monitoring and control, could account for these five symptoms. Accordingly, we further outline how monitoring and control could be approximated technically, for instance, through hidden parallel reasoning histories that allow interactive LLMs to evaluate risks of myopic inference before generating overt responses. Our theoretical framework provides a novel perspective on flawed human-machine interactions and agentic AI and raises significant ethical concerns regarding the implementation of LLMs in organizational structures and high-stakes decisions.2024-08-10T14:43:57ZFlorian ScholtenTobias R. RebholzMandy Hütterhttp://arxiv.org/abs/2510.06445v3A Survey on Agentic Security: Applications, Threats and Defenses2026-06-12T21:57:35ZLLM-based agents are now used throughout cybersecurity. While these agents facilitate powerful and autonomous security applications, their autonomy opens up new attack surfaces, and the security community is actively building defenses to secure them. Yet the literature on this subject has grown quickly and unevenly. Existing surveys treat applications, threats, and defenses in isolation, leaving no unified account of how an agent's capabilities, vulnerabilities, and countermeasures interconnect. In this work we present the first holistic survey of the agentic security landscape, structuring the field around the fundamental pillars of Applications, Threats and Defenses. We provide a comprehensive taxonomy of over 260 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. In addition, we provide detailed pillar-specific and cross-cutting analyses that show the security-lifecycle coverage of agentic applications, comparison between red-teaming and blue-teaming agents, and the adversarial use of red-teaming applications. On the threat side, we analyze the entry points and agent-loop stages that attacks target, their specificity to the agentic setting, and the threat models they assume. On the defense side, we analyze the prevailing defense strategies, their cost and security trade-offs, and where in the agent lifecycle they are deployed. We further map which defenses cover which attack classes and chart trends in agent architecture, backbone model usage, data modality coverage, and the growth of attack and defense research over time. Taken together, these findings indicate that agentic systems are structurally fragile by default and that securing them will require defenses that span the full agent lifecycle rather than single-layer fixes.2025-10-07T20:32:20ZAsif ShahriarMd Nafiu RahmanSadif AhmedFarig SadequeMd Rizwan Parvezhttp://arxiv.org/abs/2505.09655v5DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning2026-06-12T21:29:13ZPost-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment. The code is available at https://github.com/xiwenc1/DRA-GRPO.2025-05-14T02:02:32ZACL2026Xiwen ChenWenhui ZhuPeijie QiuXuanzhao DongHao WangHaiyu WuHuayu LiAristeidis SotirasYalin WangAbolfazl Razihttp://arxiv.org/abs/2606.14961v1CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning2026-06-12T21:10:33ZChain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.2026-06-12T21:10:33ZJuming XiongWeixin LiuKevin GuoCongning NiJunchao ZhuChongyu QuChao YanKatherine BrownAvinash BaidyaXiang GaoBradley MalinZhijun Yinhttp://arxiv.org/abs/2603.08999v3Learning When to Sample: Confidence-Aware Selective Sampling for Efficient Chain-of-Thought Reasoning2026-06-12T21:09:05ZLarge language models (LLMs) can achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet they often generate unnecessarily long reasoning paths that incur high inference cost. Self-consistency-based approaches push accuracy higher still, but they require sampling and aggregating multiple reasoning trajectories, leading to substantial computational overhead. In this paper, we introduce a confidence-aware selective sampling framework that, at inference time, analyzes a single reasoning trajectory to adaptively determine whether to rely on that trajectory alone or trigger multi-path sampling. The framework uses trajectory-level numeric features and sentence-level linguistic features extracted from reasoning states to guide selective multi-path reasoning. We train it on MedQA and evaluate it in-domain on MedQA and under calibration-only transfer on MathQA, MedMCQA, and MMLU, without further fine-tuning. Experimental results show that the proposed framework maintains comparable performance to full and efficient multi-path reasoning baselines, with accuracy changes of $-0.41 \pm 0.58$ and $-0.31 \pm 0.58$ percentage points, respectively, while reducing token usage by $71.7 \pm 5.0%$ and $36.6 \pm 9.1%$. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.2026-03-09T22:34:06ZJuming XiongKevin GuoCongning NiWexin LiuChao YanKatherine BrownAvinash BaidyaXiang GaoBradley MalinZhijun Yin