https://arxiv.org/api/Siyrt/2J8sS6km4XkR/8R6r01M4 2026-06-10T00:33:29Z 20931 60 15 http://arxiv.org/abs/2606.07259v1 Assessing True Generalisability of Audio-Visual Speech Recognisers 2026-06-05T13:35:10Z

Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.

2026-06-05T13:35:10Z Accepted to Interspeech 2026 Long paper track. 9 pages, 4 figures Zhaofeng Lin Stavros Petridis Maja Pantic Naomi Harte http://arxiv.org/abs/2606.01802v3 MOSS-Audio Technical Report 2026-06-05T13:33:35Z

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

2026-06-01T07:19:22Z Chen Yang Chufan Yu Hanfu Chen Jie Zhu Jingqi Chen Ke Chen Wenxuan Wang Yang Wang Yaozhou Jiang Yi Jiang Zhengyuan Lin Ziqi Chen Zhaoye Fei Chenghao Liu Donghua Yu Jun Zhan Kang Yu Kexin Huang Liwei Fan Mingshu Chen Qinyuan Cheng Ruixiao Li Shimin Li Songlin Wang Xingjian Zhao Yang Gao Yitian Gong Yiyang Zhang Zhe Xu Xipeng Qiu http://arxiv.org/abs/2606.07240v1 KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026 2026-06-05T13:09:21Z

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

2026-06-05T13:09:21Z Seymanur Akti Alexander Waibel http://arxiv.org/abs/2606.07229v1 MMAE: A Massive Multitask Audio Editing Benchmark 2026-06-05T12:52:41Z

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

2026-06-05T12:52:41Z Open-Source at https://github.com/ddlBoJack/MMAE Ziyang Ma Ruiqi Yan Ruiyang Xu Jie Fang Zhikang Niu Yi-Wen Chao Wenming Tu Tianrui Wang Auden Qi Chen Wenxi Chen Jiaying Chi Yanru Huo Zixuan Jiang Xiquan Li Yalin Li Junxi Liu Minghao Liu Binghao Qiang Yijia Shan Zheshu Song Tian Tan Zixiang Wang Zeyu Xie Zhifei Xie Xiaoyu Xing Qixiang Xu Chen Yang Guanrou Yang Shan Yang Yifan Yang Steve Yves Haotian Zhang Haina Zhu Kai Yu Liefeng Bo Eng-Siong Chng Xie Chen http://arxiv.org/abs/2606.07210v1 A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization 2026-06-05T12:21:25Z

Speech anonymization is commonly evaluated using averagecase metrics such as the equal error rate, which can hide large disparities in re-identification risks across individuals. In this paper, we conduct a large-scale per-speaker privacy analysis using a linkability-based metric under a worst-case scenario. Nearly 5,000 speakers are evaluated across multiple anonymization systems, attacker architectures, and conversation lengths. While linkability scores are highly polarized at the speaker level, the sets of easy to re-identify and hard to re-identify speakers vary substantially across configurations. We show that no single factor explains speaker vulnerability. Instead, the re-identification risk emerges from the interaction between the attacker, the anonymizer, and the amount of available speech. These results challenge the notion of intrinsic speaker-level privacy risks and emphasize the need for evaluation protocols that are explicitly conditioned on the attacker and anonymizer.

2026-06-05T12:21:25Z Accepted to Interspeech Orane Dufour Paul Magron Mickael Rouvier Emmanuel Vincent http://arxiv.org/abs/2606.07207v1 Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development 2026-06-05T12:19:00Z

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.

2026-06-05T12:19:00Z Zixi Li Youzhen Li http://arxiv.org/abs/2603.26394v2 CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding 2026-06-05T12:17:40Z

A promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.

2026-03-27T13:21:28Z 10+2(refs) pages, 5 figures, 4 Tables, IEEE transactions preprint Iñigo García-Ugarte Rubén Eguinoa Ricardo San Martín Daniel Paternain Carmen Vidaurre http://arxiv.org/abs/2512.00883v3 Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents 2026-06-05T10:27:51Z

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.

2025-11-30T13:11:56Z Jiahua Wang Leqi Zheng Jialong Wu Yaoxin Mao Shijie Cheng http://arxiv.org/abs/2606.07080v1 dots.tts Technical Report 2026-06-05T09:19:24Z

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

2026-06-05T09:19:24Z Shi Lian Changtao Li Bohan Li Hankun Wang Da Zheng Junfeng Tian Yufeng Ma Colin Zhang Kai Yu http://arxiv.org/abs/2606.07030v1 Phonetic Error Analysis of Raw Waveform Acoustic Models 2026-06-05T08:19:51Z

We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

2026-06-05T08:19:51Z INTERSPEECH2026 Erfan Loweimi Zhengjun Yue Andrea Carmantini Zoran Cvetkovic Steve Renals Peter Bell http://arxiv.org/abs/2606.07015v1 Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation 2026-06-05T07:59:17Z

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

2026-06-05T07:59:17Z Ziyu Zhang Chunyu Qiang Xiaopeng Wang Yuxin Guo Kang Yin Wenjie Tian Jingbin Hu Tianlun Zuo Zhao Guo Teng Ma Yuzhe Liang Chen Zhang Lei Xie http://arxiv.org/abs/2606.06975v1 MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds 2026-06-05T07:07:22Z

Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.

2026-06-05T07:07:22Z 17 pages, 9 figures Muhammad Mun'im Ahmad Zabidi Mohd Yamani Idna Idris Norisma Idris http://arxiv.org/abs/2606.06940v1 Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models 2026-06-05T06:11:38Z

While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.

2026-06-05T06:11:38Z Accepted by Interspeech2026 Zhixian Zhao Shuiyuan Wang Wenjie Tian Jingbin Hu Ziyu Zhang Lei Xie http://arxiv.org/abs/2606.05763v2 M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition 2026-06-05T06:11:23Z

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we release AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

2026-06-04T06:44:54Z submitted to IEEE Transactions on Audio, Speech, and Language Processing Fei Su Cancan Li Ming Li Juan Liu http://arxiv.org/abs/2606.05739v2 Do speech foundation models perceive speaker similarity as humans do? 2026-06-05T05:57:01Z

This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.

2026-06-04T06:04:18Z Accepted by INTERSPEECH 2026 Minoru Kishi Hayato Yagi Shinnosuke Takamichi Yuki Saito