https://arxiv.org/api/Siyrt/2J8sS6km4XkR/8R6r01M42026-06-10T00:33:29Z209316015http://arxiv.org/abs/2606.07259v1Assessing True Generalisability of Audio-Visual Speech Recognisers2026-06-05T13:35:10ZCurrent Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.2026-06-05T13:35:10ZAccepted to Interspeech 2026 Long paper track. 9 pages, 4 figuresZhaofeng LinStavros PetridisMaja PanticNaomi Hartehttp://arxiv.org/abs/2606.01802v3MOSS-Audio Technical Report2026-06-05T13:33:35ZMOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.2026-06-01T07:19:22ZChen YangChufan YuHanfu ChenJie ZhuJingqi ChenKe ChenWenxuan WangYang WangYaozhou JiangYi JiangZhengyuan LinZiqi ChenZhaoye FeiChenghao LiuDonghua YuJun ZhanKang YuKexin HuangLiwei FanMingshu ChenQinyuan ChengRuixiao LiShimin LiSonglin WangXingjian ZhaoYang GaoYitian GongYiyang ZhangZhe XuXipeng Qiuhttp://arxiv.org/abs/2606.07240v1KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 20262026-06-05T13:09:21ZCross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.2026-06-05T13:09:21ZSeymanur AktiAlexander Waibelhttp://arxiv.org/abs/2606.07229v1MMAE: A Massive Multitask Audio Editing Benchmark2026-06-05T12:52:41ZWe introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.2026-06-05T12:52:41ZOpen-Source at https://github.com/ddlBoJack/MMAEZiyang MaRuiqi YanRuiyang XuJie FangZhikang NiuYi-Wen ChaoWenming TuTianrui Wang AudenQi ChenWenxi ChenJiaying ChiYanru HuoZixuan JiangXiquan LiYalin LiJunxi LiuMinghao LiuBinghao QiangYijia ShanZheshu SongTian TanZixiang WangZeyu XieZhifei XieXiaoyu XingQixiang XuChen YangGuanrou YangShan YangYifan YangSteve YvesHaotian ZhangHaina ZhuKai YuLiefeng BoEng-Siong ChngXie Chenhttp://arxiv.org/abs/2606.07210v1A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization2026-06-05T12:21:25ZSpeech anonymization is commonly evaluated using averagecase metrics such as the equal error rate, which can hide large disparities in re-identification risks across individuals. In this paper, we conduct a large-scale per-speaker privacy analysis using a linkability-based metric under a worst-case scenario. Nearly 5,000 speakers are evaluated across multiple anonymization systems, attacker architectures, and conversation lengths. While linkability scores are highly polarized at the speaker level, the sets of easy to re-identify and hard to re-identify speakers vary substantially across configurations. We show that no single factor explains speaker vulnerability. Instead, the re-identification risk emerges from the interaction between the attacker, the anonymizer, and the amount of available speech. These results challenge the notion of intrinsic speaker-level privacy risks and emphasize the need for evaluation protocols that are explicitly conditioned on the attacker and anonymizer.2026-06-05T12:21:25ZAccepted to InterspeechOrane DufourPaul MagronMickael RouvierEmmanuel Vincenthttp://arxiv.org/abs/2606.07207v1Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development2026-06-05T12:19:00ZConfidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.2026-06-05T12:19:00ZZixi LiYouzhen Lihttp://arxiv.org/abs/2603.26394v2CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding2026-06-05T12:17:40ZA promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.2026-03-27T13:21:28Z10+2(refs) pages, 5 figures, 4 Tables, IEEE transactions preprintIñigo García-UgarteRubén EguinoaRicardo San MartínDaniel PaternainCarmen Vidaurrehttp://arxiv.org/abs/2512.00883v3Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents2026-06-05T10:27:51ZWorld models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.2025-11-30T13:11:56ZJiahua WangLeqi ZhengJialong WuYaoxin MaoShijie Chenghttp://arxiv.org/abs/2606.07080v1dots.tts Technical Report2026-06-05T09:19:24ZWe present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.2026-06-05T09:19:24ZShi LianChangtao LiBohan LiHankun WangDa ZhengJunfeng TianYufeng MaColin ZhangKai Yuhttp://arxiv.org/abs/2606.07030v1Phonetic Error Analysis of Raw Waveform Acoustic Models2026-06-05T08:19:51ZWe analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.2026-06-05T08:19:51ZINTERSPEECH2026Erfan LoweimiZhengjun YueAndrea CarmantiniZoran CvetkovicSteve RenalsPeter Bellhttp://arxiv.org/abs/2606.07015v1Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation2026-06-05T07:59:17ZWhile song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.2026-06-05T07:59:17ZZiyu ZhangChunyu QiangXiaopeng WangYuxin GuoKang YinWenjie TianJingbin HuTianlun ZuoZhao GuoTeng MaYuzhe LiangChen ZhangLei Xiehttp://arxiv.org/abs/2606.06975v1MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds2026-06-05T07:07:22ZBioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.2026-06-05T07:07:22Z17 pages, 9 figuresMuhammad Mun'im Ahmad ZabidiMohd Yamani Idna IdrisNorisma Idrishttp://arxiv.org/abs/2606.06940v1Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models2026-06-05T06:11:38ZWhile Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.2026-06-05T06:11:38ZAccepted by Interspeech2026Zhixian ZhaoShuiyuan WangWenjie TianJingbin HuZiyu ZhangLei Xiehttp://arxiv.org/abs/2606.05763v2M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition2026-06-05T06:11:23ZAudio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we release AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.2026-06-04T06:44:54Zsubmitted to IEEE Transactions on Audio, Speech, and Language ProcessingFei SuCancan LiMing LiJuan Liuhttp://arxiv.org/abs/2606.05739v2Do speech foundation models perceive speaker similarity as humans do?2026-06-05T05:57:01ZThis study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.2026-06-04T06:04:18ZAccepted by INTERSPEECH 2026Minoru KishiHayato YagiShinnosuke TakamichiYuki Saito