https://arxiv.org/api/gdn2x+Wy1PXT3hpXHYQwEZL7n782026-06-22T16:24:43Z11257960015http://arxiv.org/abs/2512.19011v3Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale2026-06-13T05:46:54ZSafety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches impose significant per-query latency and infrastructure cost. Very little research has asked whether CPU-based classifiers, such as support vector machines and gradient-boosted trees trained on TF-IDF features, can match their accuracy across the conditions that production deployments encounter.
We evaluate five CPU classifier families, Mamba-130M as an SSM-based GPU classifier, and transformer-based GPU models (DeBERTa-v3 and Gemma-2B with LoRA) across nine jailbreak sources and three regimes: in-distribution (D1), out-of-distribution (D2), and adversarially obfuscated (D3). On D1, the best CPU classifier matches the best transformer GPU model at roughly one-fifth the deployment cost. On D2, CPU classifiers fail via confident miscalibration, producing high-confidence false negatives that bypass escalation entirely. On D3, CPU classifiers outperform transformer GPU models by more than 26 percentage points in F1.
Based on these complementary failure modes, we design GuardChain, a three-stage safety pipeline (Regex -> CPU -> GPU) that routes each prompt to the cheapest stage capable of a confident decision. The CPU stage alone resolves 80\% of in-distribution prompts at near-peak accuracy, and the GPU stage recovers the out-of-distribution failures. For practitioners deploying LLM safety at scale, this work provides evidence that GPU-class infrastructure is unnecessary for the majority of traffic.2025-12-22T04:00:35ZUnder Review. 25 pages, 5 figures, 38 tablesVasudev MajhiDhruv GuptaAdvait SinghMatthew BarkerDhruv Kumarhttp://arxiv.org/abs/2606.13003v2The Illusion of Multi-Agent Advantage2026-06-13T05:32:43ZPrevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.2026-06-11T07:39:24ZPrathyusha JwalapuramHehai LinChuyuan LiFangkai JiaoSudong WangYifei MingZixuan KeChengwei QinGiuseppe CareniniShafiq Jotyhttp://arxiv.org/abs/2606.08867v2Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework2026-06-13T05:21:25ZThe rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment.
In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation.
A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.2026-06-07T22:44:00Z12 pages. Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)Aman GuptaKevin RossellEdesio AlcobaƧaJose Chrystian Lima PachecoCarolina Baptista de LimaShao TangLuiz Paulo RabachiniLuis MonedaHerbert FeiDaniel SilvaRohan Ramanath10.1145/3770855.3818332http://arxiv.org/abs/2509.22808v2ArFake: A Robust Framework for Multi-Dialect Arabic Speech Spoofing Detection Benchmark2026-06-13T04:58:50ZWith the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.2025-09-26T18:11:20ZMohamed ElsetohyAlhassan EhabAli MekkyBesher HassanShady Shehatahttp://arxiv.org/abs/2606.05014v2Depth-Attention: Cross-Layer Value Mixing for Language Models2026-06-13T04:31:25ZSelf-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.2026-06-03T15:33:45Z21 pages, 4 figures, 9 tablesBoyi ZengYiqin HaoZitong WangShixiang SongHe LiFeichen SongYifan LiuZiwei HeXinbing WangZhouhan Linhttp://arxiv.org/abs/2606.04547v2Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization2026-06-13T04:11:45ZPersonalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.2026-06-03T07:32:18Z16 pages, 6 figuresHeng CaoFan ZhangJian YaoYujie ZhengChanglin ZhaoLu HaoYuxuan WeiWangze NiHuaiyu FuYuqian SunXuyan Mohttp://arxiv.org/abs/2510.06198v3The Answer Lies Within: Self-Derived Rewards Enable Explainable Relation Extraction2026-06-13T04:01:29ZDespite the remarkable reasoning capabilities of large language models, they still struggle with one-shot relation extraction without predefined relation labels. We identify two pitfalls: models are often misled by irrelevant tokens instead of relation-conveying semantics, and they often fail to align with the abstraction level human annotators expect. We introduce a novel framework that closes this gap with two components: (1) COGRE, a cognitively-inspired reasoning framework that structures RE into a series of processes mimicking human text-processing; and (2) HIT@DICT, a reinforcement learning intermediate reward strategy that encourages reasoning to align with relational labels by rewarding relation-relevant phrases in reasoning. The reward is derived on a credit dictionary automatically extracted from correct predictions. Our experiments show that our framework improves both accuracy and explanation quality by addressing these two pitfalls. For example, COGRE with Qwen2.5-14B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using HIT@DICT further improves performance by +23.46% points. Finally, human evaluation shows that our best model generates relational phrases closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).2025-10-07T17:53:55ZWorking in processXinyu GuoZhengliang ShiMinglai YangMihai Surdeanuhttp://arxiv.org/abs/2606.15080v1AdaMame: A Training Recipe for Adaptive Multilingual Reasoning2026-06-13T03:22:35ZWhile Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.2026-06-13T03:22:35Z20 pages, 5 figuresDayeon KiKevin DuhMarine Carpuathttp://arxiv.org/abs/2606.15079v1Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale2026-06-13T03:21:49ZEfficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.2026-06-13T03:21:49ZAng LiBen LiuBin HanBin HuBin JingBinbin HuBing LiCai ChenCaizhi TangChangxin TianChao HuangChao ZhangChen LiangChen QianChengfu TangChengyao WenChilin FuChunwei WuCong ZhangCunyin PengDaixin WangDalong ZhangDeng ZhaoDingnan JinDingyuan ZhuDonghao ZhangFan YuanFangzheng ZhaoFanzhuang MengFeifan WuFeng XuFengbin FangGangshan WangGuodong YangHailin ZhaoHaitao WangHaitao ZhangHanxiao ZhangHanzi WangHao DaiHao LiuHao QianHao WuHaoxiong LiuHaoyu XuHeng ZhangHong LiuHongliang ZhangHongrui LiuHongxun LiHongzhi RuanHuaidong XiongHuihuang ZhengHuikang TangJia GuoJia LiJia LiuJiameng WangJiaming LiuJiannan ShiJianping WeiJiaolong YangJiapeng WangJie GaoJie WangJiewei WuJin YangJinjin LiJinjing HuangJinquan SunJinyao ChenJuanhui TuJun LiuJun MeiJun XuJun ZhouJunjie OuJunnan SipanJunpeng FangKaihong ZhangKaiqin HuKe ShiKuan XuKun TangKunlong ChenLanyin MeiLei ChenLei LiangLei XuLi TangLiang JiangLiangcheng FuLihui ZhangLinfeng ShiLintao MaLiyuan LiuLongfei LiLongfei ZhengLu LiuLu YuMan LiMeiqi ZhuMeng LiMengjie GaoMengshu SunMingming YinMingyang ZhangMingyuan FanNuo XuPan TangPeijie JiangPeilong ZhaoPeng LinPingping LiuQi ZuoQian ZhaoQiang ChengQianggang CaoQiaoben BaoQing CuiQingyuan YangQitao ShiQiyin HuangQizheng ZhouQuan WanRunyuan ZhaoShaomian ZhengShaowei WeiShengnan ZhangShuaicheng LiShujie LiShuo ZhangSikang BianTianchu YaoTiange XuTianshu WangTing GuoTinghao WangTingwei HuangTong ZhaoTongkai YangWang HongWanli GuWei LuWeichang WuWeiguang HanWeiquan LiWenbo ShenWenjing FangWenzhi TangXiang ShuXiao ShiXiaodong YanXiaolu ZhangXiaopei WanXiaqing SunXin ZhaoXingyu LuXinxing YangXinyao TangXinyu KongXinyu LiuXiong XuXuan SunXudong HanXudong WangXujie ShenYalin ZhangYangyang HouYankun RenYao ZhaoYe ChenYeyang ChenYibo CaoYifan ZuoYijie ChenYing LiYingjie SongYingxue LiYiqi WangYixuan SunYizhu XiaoYongfei XuYu LiuYuchen FangYue GaoYue YuYue ZhangYuqi ZhangYuxiao HeYuxiao LuYuxin TianYuxuan LiYuzhuo FuZhankai XuZhaoxin HuanZhenduo ZhangZhengke GuiZhengyu HuangZhenjun MaZhenxuan PanZheping QuZhibo ZhuZhidong FanZhigang HuangfuZhihao WangZhiqiang ZhangZhizhen LiuZhuyan ZhouZibin LinZihang ZengZihao WangZilong WangZiqi LiuZitao XuanZixuan ChengZujie WenZuoli Tanghttp://arxiv.org/abs/2507.17588v3Dual-branch Prompting for Multimodal Machine Translation2026-06-13T03:21:06ZMultimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches. Our code is publicly available at https://github.com/MentaY/DDP.2025-07-23T15:22:51ZThis manuscript has been fully accepted and published by ACM Transactions on Multimedia Computing, Communications, and Applications (ACM TOMM)Jie WangZhendong YangLiansong ZongXiaobo ZhangDexian WangJi Zhanghttp://arxiv.org/abs/2606.15077v1Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation2026-06-13T03:15:53ZWe present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.2026-06-13T03:15:53ZAccepted for publication in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), ISPRS Congress 2026Kyle GaoJoel CummingJonathan LiLinlin XuDavid A. Clausihttp://arxiv.org/abs/2606.17092v1Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization2026-06-13T03:15:30ZAgentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a security-oriented framework for risk identification, evaluation, and mitigation in a multi-agent GIS system while maintaining adaptability to broader agentic architectures. We test the agentic system of a commercial geospatial partner while developing a modular state-machine-based orchestration framework that abstracts agent behavior into reusable components. We evaluate robustness using a red-teaming framework with an adaptive attacker LLM and a deterministic judge that produces binary outcomes with supporting rationales across multi-turn attacks. We further improve resilience with a prompt optimization framework that treats prompts as structured signatures and injects adversarial demonstrations, enabling systematic security improvements without degrading task performance.2026-06-13T03:15:30ZKyle Gao and Pranavi Kotta contributed equally to this workKyle GaoPranavi KottaLinlin XuJonathan LiDavid A. Clausihttp://arxiv.org/abs/2606.15070v1Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models2026-06-13T02:58:29ZBy incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model's reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.2026-06-13T02:58:29ZICML 2026 SpotlightJiakai LiKe QinRongzheng WangYizhuo MaQizhi ChenMuquan LiShuang Lianghttp://arxiv.org/abs/2606.15069v1CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction2026-06-13T02:58:13ZGrammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data set.Our code is released at https://github.com/Quinnok/CoCoGEC2026-06-13T02:58:13ZQianyu WangXiaoman WangYuanyuan LiangXinyuan LiYunshi Lanhttp://arxiv.org/abs/2606.15059v1A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation2026-06-13T02:20:49ZSimultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.2026-06-13T02:20:49ZAccepted to IWSLT 2026 Scientific TrackYulin XueSiqi OuyangLei Li