https://arxiv.org/api/EjpM16lw319vMnKi3ENm2hEawXQ 2026-06-13T16:09:55Z 21683 75 15 http://arxiv.org/abs/2510.04593v3 UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models 2026-06-08T05:49:30Z

Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.

2025-10-06T08:47:38Z accepted at interspeech2026 Wenhao Guan Zhikang Niu Ziyue Jiang Kaidi Wang Peijie Chen Qingyang Hong Lin Li Xie Chen http://arxiv.org/abs/2606.09050v1 MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion 2026-06-08T05:39:23Z

Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.

2026-06-08T05:39:23Z Accepted by Interspeech 2026 Guobin Ma Yuxuan Xia Yuepeng Jiang Dake Guo Hanke Xie Jingbin Hu Yanbo Wang Lei Xie Pengcheng Zhu http://arxiv.org/abs/2606.09048v1 BareWave: Waveform-Native Flow-Matching Text-to-Speech 2026-06-08T05:36:42Z

Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.

2026-06-08T05:36:42Z Under Review Wei Fan Chao-Hong Tan Qian Chen Wen Wang Xiangang Li Kejiang Chen Weiming Zhang Nenghai Yu http://arxiv.org/abs/2510.05478v3 AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning 2026-06-08T05:16:04Z

Large Audio Language Models (LALMs) exhibit strong capabilities in general audio understanding but remain static after deployment, limiting their adaptability to real-world data. Since supervised fine-tuning is costly, we propose AQA-TTRL, a novel framework for audio understanding that enables on-the-fly evolution via test-time reinforcement learning using only unlabeled test data. It generates pseudo-labels via majority voting and optimizes the model through reinforcement learning. To address the noise in self-generated labels, we introduce confidence weighting to adjust training signals. Furthermore, multiple-attempt sampling mitigates advantage collapse and stabilizes training. Across MMAU, MMAR, and MMSU, AQA-TTRL achieves significant average improvements of 4.42% for Qwen2.5-Omni 7B and 11.04% for the 3B model. Notably, the adapted 3B model outperforms direct inference of the unadapted 7B model, highlighting the effectiveness of test-time adaptation in audio understanding.

2025-10-07T00:39:14Z Accepted to INTERSPEECH 2026 Haoyu Zhang Jiaxian Guo Dong Yang Yusuke Iwasawa Yutaka Matsuo http://arxiv.org/abs/2606.08898v1 Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training 2026-06-08T00:50:39Z

In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the possibility of decrease. However, the number of classes generally increases or decreases in practice. In this paper, we investigate a problem of Few-shot Class-variable Incremental Audio Classification (FCIAC), in which the number of classes increases or decreases. We propose a FCIAC method using prototype adaptation and pseudo class-variable training. The model in our method consists of an encoder and a classifier. The classifier is initialized by a class-variable prototype adaptation network, whose structure dynamically changes with the change of classes. In addition, we design a pseudo class-variable training strategy to enhance the model's adaptability to changing classes. Experiments on three public datasets show that our method exceeds previous methods in average accuracy. The code is at: https://github.com/cgq2971-afk/FCIAC.

2026-06-08T00:50:39Z This paper has been accepted for publication in Interspeech 2026. 4 Tables and 4 Figures Yanxiong Li Guoqing Chen Qianqian Li Sen Huang http://arxiv.org/abs/2606.08663v1 Probing Token Spaces under Generator Shift in AI-Generated Music Detection 2026-06-07T15:08:19Z

AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.

2026-06-07T15:08:19Z Accepted to ICML 2026 ML4Audio workshop Joonyong Park Jungwoo Kim Junyoung Koh Yuki Saito http://arxiv.org/abs/2502.16584v2 Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound 2026-06-07T13:34:59Z

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.

2025-02-23T14:24:15Z Liumeng Xue Ziya Zhou Jiahao Pan Zixuan Li Shuai Fan Yinghao Ma Sitong Cheng Dongchao Yang Haohan Guo Yujia Xiao Xinsheng Wang Zixuan Shen Chuanbo Zhu Xinshen Zhang Tianchi Liu Ruibin Yuan Zeyue Tian Haohe Liu Xingjian Du Emmanouil Benetos Ge Zhang Yike Guo Wei Xue http://arxiv.org/abs/2606.08580v1 G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching 2026-06-07T11:28:32Z

Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.

2026-06-07T11:28:32Z Accepted to Interspeech 2026 Yike Zhu Ziqian Wang Zikai Liu Xingchen Li Zhuangqi Chen Xianjun Xia Chuanzeng Huang Lei Xie http://arxiv.org/abs/2601.09239v6 DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion 2026-06-07T10:32:48Z

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/

2026-01-14T07:22:24Z Submit to ACL ARR 2026 May Hanlin Zhang Daxin Tan Dehua Tao Xiao Chen Haochen Tan Yunhe Li Yuchen Cao Linqi Song http://arxiv.org/abs/2603.12046v2 Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition 2026-06-07T09:33:32Z

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

2026-03-12T15:22:27Z Accepted to INTERSPEECH 2026 [Long Paper track]. Project website: https://umbertocappellazzo.github.io/Dr-SHAP-AV Umberto Cappellazzo Stavros Petridis Maja Pantic http://arxiv.org/abs/2606.08524v1 Acoustic disguising: a unified framework for cloaking and holography 2026-06-07T09:03:40Z

Cloaking and holography -- usually treated as distinct problems -- are two limits of a single operation that we call acoustic disguising, realized here using immersive boundary conditions on a closed surface. Driving the boundary with homogeneous Green's functions suppresses any incident field inside the enclosed volume and cloaks unknown objects broadband; driving it with scattering Green's functions synthesizes a holographic scatterer indistinguishable from a target for arbitrary illuminations. Combining the two, using heterogeneous Green's functions, replaces the scattering signature of one object with that of another, transforming its acoustic identity. We demonstrate the framework in three-dimensional FDTD simulations driven by impulsive Green's functions, complemented by data-driven Green's-function retrieval, establishing a direct route to real-time 3D acoustic cloaking, holography, cloning, and disguising.

2026-06-07T09:03:40Z 8 pages, 5 figures; Supplemental Material included (24 pages, 21 figures). Supplementary videos: https://jmullerresearch.ch/acoustic-disguising.html ; source code: https://github.com/Nano560/acoustic-disguising ; data and code archived at Zenodo: https://doi.org/10.5281/zenodo.20433701 Jonas Müller Dirk-Jan van Manen http://arxiv.org/abs/2606.08505v1 Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines 2026-06-07T08:10:14Z

Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.

2026-06-07T08:10:14Z Fumiaki Yamaguchi http://arxiv.org/abs/2602.07977v2 Detect, Attend and Extract: Keyword Guided Target Speaker Extraction 2026-06-07T08:01:33Z

Target speaker extraction (TSE) aims to extract the speech of a target speaker from mixtures containing multiple competing speakers. Conventional TSE systems predominantly rely on speaker cues, such as pre-enrolled speech, to identify and isolate the target speaker. However, in many practical scenarios, clean enrollment utterances are unavailable, limiting the applicability of existing approaches. In this work, we propose DAE-TSE, a keyword-guided TSE framework that specifies the target speaker through distinct keywords they utter. By leveraging keywords (i.e., partial transcriptions) as cues, our approach provides a flexible and practical alternative to enrollment-based TSE. DAE-TSE follows the Detect-Attend-Extract (DAE) paradigm: it first detects the presence of the given keywords, then attends to the corresponding speaker based on the keyword content, and finally extracts the target speech. Experimental results demonstrate that DAE-TSE outperforms standard TSE systems that rely on clean enrollment speech. To the best of our knowledge, this is the first study to utilize partial transcription as a cue for specifying the target speaker in TSE, offering a flexible and practical solution for real-world scenarios. Our code and demo page are now publicly available.

2026-02-08T14:06:11Z 4 figures, 4 tables. Accepted by IJCAI-ECAI 2026 Haoyu Li Yu Xi Yidi Jiang Shuai Wang Kate Knill Mark Gales Haizhou Li Kai Yu http://arxiv.org/abs/2603.08977v2 Universal Speech Content Factorization 2026-06-07T05:11:25Z

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

2026-03-09T22:11:40Z Accepted to Interspeech 2026 Henry Li Xinyuan Zexin Cai Lin Zhang Leibny Paola García-Perera Berrak Sisman Sanjeev Khudanpur Nicholas Andrews Matthew Wiesner http://arxiv.org/abs/2606.08435v1 Sound Field Interpolation Using Physics-Informed Extreme Learning Machine with Pre-Training 2026-06-07T03:21:17Z

Numerous machine learning-based sound field interpolation methods have been proposed. In particular, physics-informed neural networks (PINNs) can accurately interpolate sound fields from a small number of microphones. However, their high computational cost and long training time pose practical challenges for applications requiring real-time processing or online learning. To address this, we propose a hybrid framework that combines PINN-based pre-training with a physics-informed extreme learning machine (PIELM) tailored for acoustic fields. By replacing iterative PINN fine-tuning for each target sound field with closed-form output-layer adaptation using hidden-layer weights pre-trained by PINN, the proposed method efficiently interpolates unknown sound fields from limited observations. Simulation results under simplified one-dimensional free-field conditions demonstrate that, given a pre-trained model, the proposed method achieves interpolation accuracy comparable to that of PINN-based fine-tuning while reducing the adaptation time by more than three orders of magnitude.

2026-06-07T03:21:17Z This work has been submitted to the IEEE for possible publication Hayato Komaba Gen Sato Ken Kurata Yusuke Ikeda