https://arxiv.org/api/gdn2x+Wy1PXT3hpXHYQwEZL7n78 2026-06-22T16:24:43Z 112579 600 15 http://arxiv.org/abs/2512.19011v3 Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale 2026-06-13T05:46:54Z

Safety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches impose significant per-query latency and infrastructure cost. Very little research has asked whether CPU-based classifiers, such as support vector machines and gradient-boosted trees trained on TF-IDF features, can match their accuracy across the conditions that production deployments encounter. We evaluate five CPU classifier families, Mamba-130M as an SSM-based GPU classifier, and transformer-based GPU models (DeBERTa-v3 and Gemma-2B with LoRA) across nine jailbreak sources and three regimes: in-distribution (D1), out-of-distribution (D2), and adversarially obfuscated (D3). On D1, the best CPU classifier matches the best transformer GPU model at roughly one-fifth the deployment cost. On D2, CPU classifiers fail via confident miscalibration, producing high-confidence false negatives that bypass escalation entirely. On D3, CPU classifiers outperform transformer GPU models by more than 26 percentage points in F1. Based on these complementary failure modes, we design GuardChain, a three-stage safety pipeline (Regex -> CPU -> GPU) that routes each prompt to the cheapest stage capable of a confident decision. The CPU stage alone resolves 80\% of in-distribution prompts at near-peak accuracy, and the GPU stage recovers the out-of-distribution failures. For practitioners deploying LLM safety at scale, this work provides evidence that GPU-class infrastructure is unnecessary for the majority of traffic.

2025-12-22T04:00:35Z Under Review. 25 pages, 5 figures, 38 tables Vasudev Majhi Dhruv Gupta Advait Singh Matthew Barker Dhruv Kumar http://arxiv.org/abs/2606.13003v2 The Illusion of Multi-Agent Advantage 2026-06-13T05:32:43Z

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

2026-06-11T07:39:24Z Prathyusha Jwalapuram Hehai Lin Chuyuan Li Fangkai Jiao Sudong Wang Yifei Ming Zixuan Ke Chengwei Qin Giuseppe Carenini Shafiq Joty http://arxiv.org/abs/2606.08867v2 Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework 2026-06-13T05:21:25Z

The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.

2026-06-07T22:44:00Z 12 pages. Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining) Aman Gupta Kevin Rossell Edesio Alcobaça Jose Chrystian Lima Pacheco Carolina Baptista de Lima Shao Tang Luiz Paulo Rabachini Luis Moneda Herbert Fei Daniel Silva Rohan Ramanath 10.1145/3770855.3818332 http://arxiv.org/abs/2509.22808v2 ArFake: A Robust Framework for Multi-Dialect Arabic Speech Spoofing Detection Benchmark 2026-06-13T04:58:50Z

With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

2025-09-26T18:11:20Z Mohamed Elsetohy Alhassan Ehab Ali Mekky Besher Hassan Shady Shehata http://arxiv.org/abs/2606.05014v2 Depth-Attention: Cross-Layer Value Mixing for Language Models 2026-06-13T04:31:25Z

Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.

2026-06-03T15:33:45Z 21 pages, 4 figures, 9 tables Boyi Zeng Yiqin Hao Zitong Wang Shixiang Song He Li Feichen Song Yifan Liu Ziwei He Xinbing Wang Zhouhan Lin http://arxiv.org/abs/2606.04547v2 Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization 2026-06-13T04:11:45Z

Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.

2026-06-03T07:32:18Z 16 pages, 6 figures Heng Cao Fan Zhang Jian Yao Yujie Zheng Changlin Zhao Lu Hao Yuxuan Wei Wangze Ni Huaiyu Fu Yuqian Sun Xuyan Mo http://arxiv.org/abs/2510.06198v3 The Answer Lies Within: Self-Derived Rewards Enable Explainable Relation Extraction 2026-06-13T04:01:29Z

Despite the remarkable reasoning capabilities of large language models, they still struggle with one-shot relation extraction without predefined relation labels. We identify two pitfalls: models are often misled by irrelevant tokens instead of relation-conveying semantics, and they often fail to align with the abstraction level human annotators expect. We introduce a novel framework that closes this gap with two components: (1) COGRE, a cognitively-inspired reasoning framework that structures RE into a series of processes mimicking human text-processing; and (2) HIT@DICT, a reinforcement learning intermediate reward strategy that encourages reasoning to align with relational labels by rewarding relation-relevant phrases in reasoning. The reward is derived on a credit dictionary automatically extracted from correct predictions. Our experiments show that our framework improves both accuracy and explanation quality by addressing these two pitfalls. For example, COGRE with Qwen2.5-14B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using HIT@DICT further improves performance by +23.46% points. Finally, human evaluation shows that our best model generates relational phrases closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).

2025-10-07T17:53:55Z Working in process Xinyu Guo Zhengliang Shi Minglai Yang Mihai Surdeanu http://arxiv.org/abs/2606.15080v1 AdaMame: A Training Recipe for Adaptive Multilingual Reasoning 2026-06-13T03:22:35Z

While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.

2026-06-13T03:22:35Z 20 pages, 5 figures Dayeon Ki Kevin Duh Marine Carpuat http://arxiv.org/abs/2606.15079v1 Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale 2026-06-13T03:21:49Z

Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

2026-06-13T03:21:49Z Ang Li Ben Liu Bin Han Bin Hu Bin Jing Binbin Hu Bing Li Cai Chen Caizhi Tang Changxin Tian Chao Huang Chao Zhang Chen Liang Chen Qian Chengfu Tang Chengyao Wen Chilin Fu Chunwei Wu Cong Zhang Cunyin Peng Daixin Wang Dalong Zhang Deng Zhao Dingnan Jin Dingyuan Zhu Donghao Zhang Fan Yuan Fangzheng Zhao Fanzhuang Meng Feifan Wu Feng Xu Fengbin Fang Gangshan Wang Guodong Yang Hailin Zhao Haitao Wang Haitao Zhang Hanxiao Zhang Hanzi Wang Hao Dai Hao Liu Hao Qian Hao Wu Haoxiong Liu Haoyu Xu Heng Zhang Hong Liu Hongliang Zhang Hongrui Liu Hongxun Li Hongzhi Ruan Huaidong Xiong Huihuang Zheng Huikang Tang Jia Guo Jia Li Jia Liu Jiameng Wang Jiaming Liu Jiannan Shi Jianping Wei Jiaolong Yang Jiapeng Wang Jie Gao Jie Wang Jiewei Wu Jin Yang Jinjin Li Jinjing Huang Jinquan Sun Jinyao Chen Juanhui Tu Jun Liu Jun Mei Jun Xu Jun Zhou Junjie Ou Junnan Sipan Junpeng Fang Kaihong Zhang Kaiqin Hu Ke Shi Kuan Xu Kun Tang Kunlong Chen Lanyin Mei Lei Chen Lei Liang Lei Xu Li Tang Liang Jiang Liangcheng Fu Lihui Zhang Linfeng Shi Lintao Ma Liyuan Liu Longfei Li Longfei Zheng Lu Liu Lu Yu Man Li Meiqi Zhu Meng Li Mengjie Gao Mengshu Sun Mingming Yin Mingyang Zhang Mingyuan Fan Nuo Xu Pan Tang Peijie Jiang Peilong Zhao Peng Lin Pingping Liu Qi Zuo Qian Zhao Qiang Cheng Qianggang Cao Qiaoben Bao Qing Cui Qingyuan Yang Qitao Shi Qiyin Huang Qizheng Zhou Quan Wan Runyuan Zhao Shaomian Zheng Shaowei Wei Shengnan Zhang Shuaicheng Li Shujie Li Shuo Zhang Sikang Bian Tianchu Yao Tiange Xu Tianshu Wang Ting Guo Tinghao Wang Tingwei Huang Tong Zhao Tongkai Yang Wang Hong Wanli Gu Wei Lu Weichang Wu Weiguang Han Weiquan Li Wenbo Shen Wenjing Fang Wenzhi Tang Xiang Shu Xiao Shi Xiaodong Yan Xiaolu Zhang Xiaopei Wan Xiaqing Sun Xin Zhao Xingyu Lu Xinxing Yang Xinyao Tang Xinyu Kong Xinyu Liu Xiong Xu Xuan Sun Xudong Han Xudong Wang Xujie Shen Yalin Zhang Yangyang Hou Yankun Ren Yao Zhao Ye Chen Yeyang Chen Yibo Cao Yifan Zuo Yijie Chen Ying Li Yingjie Song Yingxue Li Yiqi Wang Yixuan Sun Yizhu Xiao Yongfei Xu Yu Liu Yuchen Fang Yue Gao Yue Yu Yue Zhang Yuqi Zhang Yuxiao He Yuxiao Lu Yuxin Tian Yuxuan Li Yuzhuo Fu Zhankai Xu Zhaoxin Huan Zhenduo Zhang Zhengke Gui Zhengyu Huang Zhenjun Ma Zhenxuan Pan Zheping Qu Zhibo Zhu Zhidong Fan Zhigang Huangfu Zhihao Wang Zhiqiang Zhang Zhizhen Liu Zhuyan Zhou Zibin Lin Zihang Zeng Zihao Wang Zilong Wang Ziqi Liu Zitao Xuan Zixuan Cheng Zujie Wen Zuoli Tang http://arxiv.org/abs/2507.17588v3 Dual-branch Prompting for Multimodal Machine Translation 2026-06-13T03:21:06Z

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches. Our code is publicly available at https://github.com/MentaY/DDP.

2025-07-23T15:22:51Z This manuscript has been fully accepted and published by ACM Transactions on Multimedia Computing, Communications, and Applications (ACM TOMM) Jie Wang Zhendong Yang Liansong Zong Xiaobo Zhang Dexian Wang Ji Zhang http://arxiv.org/abs/2606.15077v1 Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation 2026-06-13T03:15:53Z

We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

2026-06-13T03:15:53Z Accepted for publication in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), ISPRS Congress 2026 Kyle Gao Joel Cumming Jonathan Li Linlin Xu David A. Clausi http://arxiv.org/abs/2606.17092v1 Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization 2026-06-13T03:15:30Z

Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a security-oriented framework for risk identification, evaluation, and mitigation in a multi-agent GIS system while maintaining adaptability to broader agentic architectures. We test the agentic system of a commercial geospatial partner while developing a modular state-machine-based orchestration framework that abstracts agent behavior into reusable components. We evaluate robustness using a red-teaming framework with an adaptive attacker LLM and a deterministic judge that produces binary outcomes with supporting rationales across multi-turn attacks. We further improve resilience with a prompt optimization framework that treats prompts as structured signatures and injects adversarial demonstrations, enabling systematic security improvements without degrading task performance.

2026-06-13T03:15:30Z Kyle Gao and Pranavi Kotta contributed equally to this work Kyle Gao Pranavi Kotta Linlin Xu Jonathan Li David A. Clausi http://arxiv.org/abs/2606.15070v1 Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models 2026-06-13T02:58:29Z

By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model's reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.

2026-06-13T02:58:29Z ICML 2026 Spotlight Jiakai Li Ke Qin Rongzheng Wang Yizhuo Ma Qizhi Chen Muquan Li Shuang Liang http://arxiv.org/abs/2606.15069v1 CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction 2026-06-13T02:58:13Z

Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data set.Our code is released at https://github.com/Quinnok/CoCoGEC

2026-06-13T02:58:13Z Qianyu Wang Xiaoman Wang Yuanyuan Liang Xinyuan Li Yunshi Lan http://arxiv.org/abs/2606.15059v1 A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation 2026-06-13T02:20:49Z

Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.

2026-06-13T02:20:49Z Accepted to IWSLT 2026 Scientific Track Yulin Xue Siqi Ouyang Lei Li