https://arxiv.org/api/CH9CPMg65nc1srPlWpN4ItUBUu0 2026-06-22T17:22:19Z 112579 615 15 http://arxiv.org/abs/2601.22777v2 RASST: Retrieval-Augmented Simultaneous Speech Translation 2026-06-13T01:58:28Z

Simultaneous speech translation produces target text incrementally from partial speech input. Recent speech large language models have markedly improved SST quality but still struggle with rare and domain-specific terminology. Retrieval augmentation has helped in automatic speech recognition and neural machine translation, but extending it to SST is non-trivial: retrieval must be fast and accurate under partial speech, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which addresses both challenges. For accurate cross-modal retrieval under partial input, RASST trains a lightweight speech-text retriever that produces chunkwise terminology hints for the Speech LLM via multi-scale retrieval. To use these hints correctly, we synthesize training data that teaches the Speech LLM to decide whether and when to apply each retrieved term. Experiments on ACL 60/60 dev set and the ESO test set show that RASST improves terminology accuracy by nearly 40% and overall translation quality by up to 3 BLEU points, with negligible computational overhead.

2026-01-30T09:59:24Z Under Review Jiaxuan Luo Siqi Ouyang Jiaxing Xu Lei Li http://arxiv.org/abs/2602.05060v2 StagePilot: Stage-Level Planning for Long-Horizon Dialogue Simulation in Cybergrooming 2026-06-13T01:48:05Z

Cybergrooming is an evolving threat to youth, requiring proactive educational interventions. We address this by modeling dialogue progression as a structured planning problem over stage-wise interactions. We propose StagePilot, a dialogue framework that separates stage-level planning from response generation, in which the model selects the next stage under constrained transitions and generates responses conditioned on it, enabling coherent and realistic progression. Reinforcement learning is used to learn stage-level policies from offline data, optimizing for both emotional alignment and goal-consistent progression. Our empirical experiments show that StagePilot generates more structured, coherent dialogue trajectories and reduces conversational stagnation compared to baselines; notably, the IQL+AWAC variant reaches the final stage more often while maintaining over 70% positive or neutral responses, yielding a 43% relative improvement.

2026-02-04T21:22:45Z Accepted at the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2026) Heajun An Qi Zhang Minqian Liu Xinyi Zhang Sang Won Lee Lifu Huang Pamela J. Wisniewski Jin-Hee Cho http://arxiv.org/abs/2606.15044v1 Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models 2026-06-13T01:10:42Z

Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

2026-06-13T01:10:42Z Kieron Seven Jun Wei Lee Muhammad Reza Qorib Andrew Ivan Soegeng Hwee Tou Ng http://arxiv.org/abs/2606.15037v1 ReportQA: QA-Based Radiology Report Evaluation 2026-06-13T00:43:03Z

Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

2026-06-13T00:43:03Z Yiming Shi Shaoshuai Yang Xi Chen Haolin Li Hengyu Zhang Che Jiang Kaiwen Wang Xun Zhu Dong Xie Fei Wang Dejing Dou Miao Li Ji Wu http://arxiv.org/abs/2606.15033v1 Cloze: An Open Research Platform for Studying Human-AI Conversations in Mental Health Contexts 2026-06-13T00:24:36Z

Cloze is an open-source web platform for conducting controlled, monitored studies of human-AI conversation in mental health research contexts. Consumer large language model (LLM) products such as ChatGPT, Claude, and Gemini are built for individual productivity, and offer researchers little experimental control, inconsistent data export, and no shared safety scaffolding that holds across providers. Cloze gives research teams a single environment in which they configure which models participants converse with, how the AI is instructed, how conversations are scheduled over time, and which safety constraints apply unconditionally, while every message is captured with full provenance (model version, prompt configuration, timing). The platform currently supports OpenAI, Anthropic, Google, and locally hosted open-weight models served through Ollama behind a unified interface, and runs in the cloud or fully on premises so that participant data need never leave an institution. Cloze is research infrastructure for building an evidence base on human-AI interaction in mental health contexts. It is not a therapeutic product.

2026-06-13T00:24:36Z 7 pages, 2 figures. Cloze is released under AGPL-3.0 Matthew Flathers Francesco Cipriani John Torous http://arxiv.org/abs/2606.15026v1 Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals 2026-06-12T23:50:33Z

Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

2026-06-12T23:50:33Z Accepted for publication in the 17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2026). DOI: https://doi.org/10.1145/3807503.3819363 Desta Haileselassie Hagos Saurav Keshari Aryal Patrick Ymele-Leki Anietie Andy Legand L. Burge 10.1145/3807503.3819363 http://arxiv.org/abs/2606.15017v1 Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents 2026-06-12T23:30:14Z

Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

2026-06-12T23:30:14Z Sina Hajimiri Masih Aminbeidokhti Jose Dolz Ismail Ben Ayed Issam H. Laradji Spandana Gella Nicolas Gontier http://arxiv.org/abs/2512.21577v3 A Unified Definition of Hallucination: It's The World Model, Stupid! 2026-06-12T23:25:06Z

Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.

2025-12-25T08:42:18Z ICML 2026. HalluWorld benchmark at https://github.com/DegenAI-Labs/HalluWorld Emmy Liu Varun Gangal Chelsea Zou Michael Yu Xiaoqi Huang Alex Chang Zhuofu Tao Karan Singh Sachin Kumar Steven Y. Feng http://arxiv.org/abs/2605.17106v2 HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools 2026-06-12T23:23:22Z

Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog -- adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA's tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot's VS Code Chat auto-mode and -- to our knowledge for the first time in the LLM routing literature -- demonstrates language-invariant routing across CJK, European, and other script families.

2026-05-16T18:19:30Z preprint v2 Aashna Garg Siddharth Singha Roy Jinu Jang Federico Brancasi Shengyu Fu http://arxiv.org/abs/2606.15007v1 Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning 2026-06-12T22:56:12Z

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

2026-06-12T22:56:12Z NVIDIA Allan : Allan Aaron Blakeman Allan Aaron Thomas Allan Aastha Jhunjhunwala Allan Abhibha Gupta Allan Abhinav Khattar Allan Adam Rajfer Allan Adi Renduchintala Allan Adil Asif Allan Aditya Vavre Allan Adriana Flores Miranda Allan Ahmad Bilal Allan Aileen Zaman Allan Ajay Hotchandani Allan Akanksha Shukla Allan Akhiad Bercovich Allan Aleksander Ficek Allan Alex Gronskiy Allan Alex Kondratenko Allan Alex Steiner Allan Alex Ye Allan Alexander Bukharin Allan Alexandre Milesi Allan Ali Taghibakhshi Allan Alice Gatti Allan Alisa Liu Allan Alok Kumar Allan Amar Phanishayee Allan Ameya Sunil Mahabaleshwarkar Allan Amir Klein Allan Amit Zuker Allan Amnon Geifman Allan Anahita Bhiwandiwalla Allan Ananth Subramaniam Allan Andrea Santilli Allan Andrew Fulks Allan Andrew McHarg Allan Andrew Tao Allan Andrii Skliar Allan Anjulie Agrusa Allan Ankur Srivastava Allan Ankur Verma Allan Anna Shors Allan Anna Warno Allan Antoni-Joan Solergibert I Llaquet Allan Arham Mehta Allan Arkadiusz Nowaczynski Allan Arti Jain Allan Ashwath Aithal Allan Ashwin Poojary Allan Asif Ahamed Allan Asit Mishra Allan Asma Kuriparambil Thekkumpate Allan Atefeh Sohrabizadeh Allan Avinash Kaur Allan Avinash Vem Allan Ayush Dattagupta Allan Barath Subramaniam Anandan Allan Bardiya Sadeghi Allan Ben Lanir Allan Benedikt Schifferer Allan Besmira Nushi Allan Bilal Kartal Allan Bill Thiede Allan Bita Darvish Rouhani Allan Bo Deng Allan Bob Schatz Allan Boris Ginsburg Allan Boxin Wang Allan Brad Nemire Allan Brandon Norick Allan Brian Dang Allan Brian Westphal Allan Brian Yu Allan Brucek Khailany Allan Bryan Catanzaro Allan Carlo del Mundo Allan Caryln Aarish Allan Chankyu Lee Allan Chantal Hwang Allan Charbel Sakr Allan Charles Wang Allan Charlie Truong Allan Chen Cui Allan Cheng Cheng Allan Cheng-Ping Hsieh Allan Chenghao Zhang Allan Chenhui Deng Allan Chintan Patel Allan Chris Alexiuk Allan Christian Cosgrove Allan Christian Munley Allan Christine Harvey Allan Christopher Parisien Allan Chunyang Shen Allan Coco Li Allan Collin Neale Allan Cynthia Gao Allan Cyril Meurillon Allan Dan Gil Allan Dan Su Allan Dan Zhao Allan Dane Corneil Allan Daniel Afrimi Allan Daniel Egert Allan Daniel Korzekwa Allan Daniel Lo Allan Daniel Machlab Allan Daniel Serebrenik Allan Daniil Sorokin Allan Daria Gitman Allan Daria Levy Allan Darko Stosic Allan David Mosallanezhad Allan David Yu Allan Davit Karamyan Allan Deena Donia Allan Deep Debroy Allan Deepak Narayanan Allan Devin O'Kelly Allan Dheeraj Peri Allan Dhruv Nathawani Allan Di Allan Wu Dima Rekesh Divyanshu Kakwani Donald Plummer Dong Anh Dongfeng Yu Dongfu Jiang Donnie Kim Dorrin Poorkay Duncan Riach Dusan Stosic Dustin VanStee Eavan Meng Edgar Minasyan Edward Lin Eileen Margaret Peters Long Elad Sarafin Elad Segal Elena Lantz Ellie Evans Elliott Ning Eric Chung Eric Harper Eric Pham-Hung Eric Tramel Eric Yang Erick Galinkin Erik Pounds Erika Goncalves Goncalves Evan Briones Evan Wu Evelina Bakhturina Evgeny Tsykunov Ewa Dobrowolska Faisal Ladhak Farzan Memarian Fay Wang Fei Jia Felipe Soares Felipe Vieira Frujeri Feng Chen Fengguang Lin Ferenc Galko Frank Sun Frankie Siino Frida Hou Gal Hubara Agam Gal Kaplun Gantavya Bhatt Gargi Prasad Garvit Kulshreshtha George Armstrong Gerald Shen Giulio Borghesi Gordana Neskovic Gorkem Batmaz Grace Lam Greg Mason Greg Pauloski Grigor Nalbandyan Grzegorz Chlebus Grzegorz Karch Guan-Ting Liu Guoming Zhang Guyue Huang Haggai Maron Haifeng Qian Haim Elisha Haoxing Ren Haran Kumar Shiv Kumar Haribhau Hud Harris Nover Harrison Saturley Hall Hayate Iso Helen Ngo Herbert Hum Herman Sahota Hexin Wang Himanshu Soni Hovhannes Tamoyan Hua Li Huanhuan Chen Hui Li Hui Wang Huy Nguyen Ian Chiles Ido Galil Ido Shahaf Igor Gitman Igor Shovkun Ilya Loshchilov Ingo Guehring Itamar Schen Itay Levy Itay Neeman Ivan Moshkov Izik Golan Izzy Putterman Jaemin Choi Jakub Slowikowski Jan Kautz Jane Polak Scowcroft Jared Casper Jatin Mitra Jeffrey Glick Jenny Chen Jesse Oliver Jiacheng Xu Jiafan Zhu Jialin Song Jian Zhang Jiantao Jiao Jiaqi Zeng Jie Lou Jim King Jimmy Zhang Jingquan Wang Jinhang Choi Jinju Chu Joey Conway Joey Guman Johan Jatko Johannes Rausch John Kamalu John Roberts Johnny Greco Johnny Mensel Jonah Alben Jonas Yang Jonathan Cohen Jonathan Raiman Joseph Jennings Joshua Mabry Joshua Pierce Joyjit Daw Julien Veron Vialard Junkeun Yi Jupinder Parmar Kajal Jain Kan Zhu Kari Briski Katherine Cheung Katherine Luna Keith Willowhawk Keith Wyss Keshav Santhanam Kevin Shih Kezhi Kong Khanh Nguyen Khushi Bhardwaj Kirthi Shankar Sivamani Konstantinos Krommydas Krishna C. Puvvada Krzysztof Pawelec Kumar Anik Kyle Keprios Kylie Day Lawrence McAfee Leo Du Leon Derczynski Li Ding Linda Liu Lingjie Wu Lior Kadoch Lizzie Wei Luis Vega Luke Robison Lun Su Maarten Van Segbroeck Maciej Jakub Mikulski Maer Rodrigues de Melo Magda Sypula Mahan Fathi Makesh Narsimhan Sreedhar Makesh Tarun Chandran Manoj Kilaru Maor Ashkenazi Marc Cuevas Marc Romeijn Marcin Chochowski Mark Cai Mark Mozolewski Markus Kliegl Marta Stepniewska-Dziubinska Martyna Patelka Mattei Machczynski Matvei Novikov Mauricio Ferrato Maximilian Golub Mehrzad Samadi Melissa Corpuz Mengru Wang Mengxi Wu Meredith Price Meriem Boubdir Micah Schaffer Michael Andersch Michael Boone Michael Gschwind Michael Lightstone Michael Loh Michal Bien Michal Zawalski Michelle Gill Miguel Martinez Mikail Khona Mike Chrzanowski Mike Houston Mingyuan Ma Minseok Lee Mohamed Fawzy Mohammad Dabbah Mohammad Shoeybi Mostofa Patwary Nabin Mulepati Najeeb Nabwani Namit Dhameja Narimane Hennouni Natalie Hereth Nathaniel Pinckney Nave Algarici Nave Assaf Netanel Haber Nicholas Knight Nick Reamaroon Nickson Quak Nidhi Bhatia Nikhil Desai Nikolai Ludwig Nima Tajbakhsh Ning Xu Nir Ailon Nirmal Juluru Nitin Nitin Ofri Masad Oleg Rybakov Oleksii Hrinchuk Oleksii Kuchaiev Olivia Viessmann Olivier Delalleau Oluwatobi Olabiyi Omer Ullman Argov Omri Puny Oren Tropp Pablo Ribalta Pallab Bhattacharya Panos Lampropoulos Parth Mannan Pasha Shamis Patrick Legresley Paul Gibbons Pavlo Molchanov Pawel Morkisz Peter Dykas Peter Jin Pierre-Yves Aquilanti Pinky Xu Piotr Januszewski Piotr Laskiewicz Pooya Jannaty Prakash Gurumurthy Pranav Prashant Thombre Prasoon Varshney Pritam Gundecha Przemek Tredak Puhui Meng Qiyu Wan Rabeeh Karimi Mahabadi Rachel Oberman Rachit Garg Radha Sri-Tharan Rahul Kandu Rakshit Sanadhya Ran El-Yaniv Ran Zilberstein Rasoul Shafipour Ray Macalisang Rayen Tian Reka Kovacs Renjie Pi Rick Izzo Rima Shahbazyan Rishabh Garg Rishi Puri Rita Fernandes Neves Ritchie Zhao Ritika Borkar Ritu Gala Riyad Islam Robert Clark Robert Hesse Robert Kirby Roger Waleffe Rohit Watve Roi Koren Ron Banner Ruoxi Zhang Russell J. Hewett Ryan Prenger Ryan Stewart Ryota Egashira Sadegh Mahdavi Saee Paliwal Sagar Singh Sahil Modi Salika Dave Samantha Shinagawa Samuel Kriman Sandip Bhaskar Sangkug Lym Sanjay Kariyappa Sanjeev Satheesh Saran Vikas Murari Satish Pasumarthi Saurabh Mishra Saurav Muralidharan Scott Hara Sean Narentharen Selvaraj Anandaraj Seonjin Na Seonmeyong Bak Seonmyeong Bak Sepehr Sameni Seph Mard Serge Panev Seth Henneman Seth Poulos Shahar Mor Shantanu Acharya Shaona Ghosh Sharath Turuvekere Sreenivas Sharon Mendelson Shaun Kotek Shawn Wang Shay Aharon Shaya Gharghabi Sheng-Chieh Lin Shi Chen Shiqing Fan Shirish Baskaran Shreya Gopa Shrimai Prabhumoye Shubham Pachori Shubham Toshniwal Shuoyang Ding Shwetha Krishnamurthy Siddharth Singh Simeng Sun Sirshak Das Sivakumar Arayandi Thottakara Smita Ithape Somshubra Majumdar Soumye Singhal Sri Harsha Singudasu Sridhar Bhuvanapalli Srimukh Veccham Stas Sergienko Stefania Alborghetti Stephen Ge Su Rong Sugam Dipak Devare Sukrit Rao Sumeet Kumar Barua Sungsoo Ha Sunny Gai Suriya Gunasekar Suseella Panguluri Suyog Gupta Sviataslau Hinzburh Sweta Priyadarshi Syeda Nahida Akter Talor Abramovich Tan Bui Tanay Varshney Tatevik Ter-Hovhannisyan Teodor-Dumitru Ene Terry Kong Thanh Do Tianhe Zhang Tiffany Moore Tijmen Blankevoort Tim Moon Tiyasa Mitra Tom Balough Tomasz Grzegorzek Tomasz Hliwiak Tomer Asida Tomer Bar Natan Tomer Keren Tomer Ronen Tony Salim Tony Wang Traian Rebedea Tugrul Konuk Twinkle Vashishth Udi Karpas Ushnish De Vahid Noorozi Venkat Srinivasan Venmugil Elango Vibhor Agrawal Victor Cui Vijay Korthikanti Vikas Mehta Vinay Rao Virginia Wu Vitaly Kurin Vitaly Lavrukhin Vladimir Anisimov Vu Pham Wanli Jiang Wasi Uddin Ahmad Wataru Ishihara Wei Du Wei Ping Weiheng Chai Wenliang Dai Wesley Helmholz Will Jennings Will Zhu Wojciech Prazuch Xiaowei Ren Xiwen Yu Yan Breek Yang Chen Yang Yu Yangyi Chen Yaniv Galron Yashaswi Karnati Yejin Choi Yev Meyer Yi-Fu Wu Yian Zhang Ying Lin Yonatan Geifman Yonggan Fu Youngeun Kwon Yu Yao Yugi Guvvla Yuki Huang Yunsheng Liu Zach Moshe Zachary Newell Zhilin Wang Zhiyu Li Zhongbo Zhu Zhuolin Yang Zihan Liu Zijie Yan Zsolt-Alon Wertheimer http://arxiv.org/abs/2408.05568v2 Metacognitive Myopia in Large Language Models 2026-06-12T22:06:49Z

Large Language Models (LLMs) exhibit potentially harmful biases that reinforce culturally embedded stereotypes, influence moral judgments, or amplify positive evaluations of majority groups. We propose metacognitive myopia as a cognitive-ecological framework accounting for a conglomerate of established and emerging LLM biases. Our theoretical framework posits that biased samples in the information environment cause five symptoms of metacognitive myopia in LLMs: integration of invalid embeddings, susceptibility to redundant information, neglect of base rates in conditional computation, decision rules based on frequency, and inappropriate higher-order statistical inference for nested data structures. Moreover, it posits that the two main components of metacognition, monitoring and control, could account for these five symptoms. Accordingly, we further outline how monitoring and control could be approximated technically, for instance, through hidden parallel reasoning histories that allow interactive LLMs to evaluate risks of myopic inference before generating overt responses. Our theoretical framework provides a novel perspective on flawed human-machine interactions and agentic AI and raises significant ethical concerns regarding the implementation of LLMs in organizational structures and high-stakes decisions.

2024-08-10T14:43:57Z Florian Scholten Tobias R. Rebholz Mandy Hütter http://arxiv.org/abs/2510.06445v3 A Survey on Agentic Security: Applications, Threats and Defenses 2026-06-12T21:57:35Z

LLM-based agents are now used throughout cybersecurity. While these agents facilitate powerful and autonomous security applications, their autonomy opens up new attack surfaces, and the security community is actively building defenses to secure them. Yet the literature on this subject has grown quickly and unevenly. Existing surveys treat applications, threats, and defenses in isolation, leaving no unified account of how an agent's capabilities, vulnerabilities, and countermeasures interconnect. In this work we present the first holistic survey of the agentic security landscape, structuring the field around the fundamental pillars of Applications, Threats and Defenses. We provide a comprehensive taxonomy of over 260 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. In addition, we provide detailed pillar-specific and cross-cutting analyses that show the security-lifecycle coverage of agentic applications, comparison between red-teaming and blue-teaming agents, and the adversarial use of red-teaming applications. On the threat side, we analyze the entry points and agent-loop stages that attacks target, their specificity to the agentic setting, and the threat models they assume. On the defense side, we analyze the prevailing defense strategies, their cost and security trade-offs, and where in the agent lifecycle they are deployed. We further map which defenses cover which attack classes and chart trends in agent architecture, backbone model usage, data modality coverage, and the growth of attack and defense research over time. Taken together, these findings indicate that agentic systems are structurally fragile by default and that securing them will require defenses that span the full agent lifecycle rather than single-layer fixes.

2025-10-07T20:32:20Z Asif Shahriar Md Nafiu Rahman Sadif Ahmed Farig Sadeque Md Rizwan Parvez http://arxiv.org/abs/2505.09655v5 DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning 2026-06-12T21:29:13Z

Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment. The code is available at https://github.com/xiwenc1/DRA-GRPO.

2025-05-14T02:02:32Z ACL2026 Xiwen Chen Wenhui Zhu Peijie Qiu Xuanzhao Dong Hao Wang Haiyu Wu Huayu Li Aristeidis Sotiras Yalin Wang Abolfazl Razi http://arxiv.org/abs/2606.14961v1 CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning 2026-06-12T21:10:33Z

Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.

2026-06-12T21:10:33Z Juming Xiong Weixin Liu Kevin Guo Congning Ni Junchao Zhu Chongyu Qu Chao Yan Katherine Brown Avinash Baidya Xiang Gao Bradley Malin Zhijun Yin http://arxiv.org/abs/2603.08999v3 Learning When to Sample: Confidence-Aware Selective Sampling for Efficient Chain-of-Thought Reasoning 2026-06-12T21:09:05Z

Large language models (LLMs) can achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet they often generate unnecessarily long reasoning paths that incur high inference cost. Self-consistency-based approaches push accuracy higher still, but they require sampling and aggregating multiple reasoning trajectories, leading to substantial computational overhead. In this paper, we introduce a confidence-aware selective sampling framework that, at inference time, analyzes a single reasoning trajectory to adaptively determine whether to rely on that trajectory alone or trigger multi-path sampling. The framework uses trajectory-level numeric features and sentence-level linguistic features extracted from reasoning states to guide selective multi-path reasoning. We train it on MedQA and evaluate it in-domain on MedQA and under calibration-only transfer on MathQA, MedMCQA, and MMLU, without further fine-tuning. Experimental results show that the proposed framework maintains comparable performance to full and efficient multi-path reasoning baselines, with accuracy changes of $-0.41 \pm 0.58$ and $-0.31 \pm 0.58$ percentage points, respectively, while reducing token usage by $71.7 \pm 5.0%$ and $36.6 \pm 9.1%$. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.

2026-03-09T22:34:06Z Juming Xiong Kevin Guo Congning Ni Wexin Liu Chao Yan Katherine Brown Avinash Baidya Xiang Gao Bradley Malin Zhijun Yin