https://arxiv.org/api/BwHOlwMuZ93O2ibgcckc2hA7mNg 2026-06-18T07:33:32Z 21755 180 15 http://arxiv.org/abs/2606.09345v1 A study on the impact of region specific data on the performance of Indic ASR 2026-06-08T11:12:28Z

Automatic Speech Recognition (ASR) systems are widely deployed across linguistically diverse regions, yet their ability to generalize across fine-grained geographic variation remains underexplored. We present a systematic study of cross-district ASR generalization for Indian languages, analyzing the impact of regional variation on performance. Using finetuning as a controlled probe, we train models on speech from a single district and evaluate them on other districts within the same language. We examine trends across multiple train test district pairs and quantify performance differences. To assess geographic effects, we analyze the correlation between WER and inter district distance using two distance measures. Our results show consistent correlations between geographic distance and WER, highlighting the challenges of regional generalization and the need for geographically diverse speech data in ASR development and evaluation in India.

2026-06-08T11:12:28Z Agneedh Basu Pavan Kumar J Pranav Bhat Sujith Pulikodan Visruth Sanka Nihar Desai Prasata Kumar Ghosh http://arxiv.org/abs/2606.09342v1 Parameter-Efficient Continual Learning for Automatic Speech Recognition 2026-06-08T11:08:24Z

Speech foundation models enable strong general-purpose ASR and are attractive for downstream adaptation. However, their size and the catastrophic forgetting induced by sequential fine-tuning demand parameter-efficient and regularized training methods, motivating parameter-efficient continual learning (PECL). While PECL has been widely studied in NLP and vision, it has received less attention in ASR. In this paper, we propose a simple yet effective PECL method based on recent advances in parameter-efficient fine-tuning for ASR. We partition pretrained weight matrices into head and tail subspaces according to singular values and restrict adaptation to approximate rotations within the low-energy tail subspace, preserving dominant components and reducing forgetting. For subsequent tasks, rotations are combined via weight averaging to further improve retention. Experiments on two benchmarks demonstrate reduced forgetting and superior overall performance compared to recent PECL baselines.

2026-06-08T11:08:24Z Accepted at Interspeech 2026 Steven Vander Eeckt Hugo Van hamme http://arxiv.org/abs/2606.09335v1 Factors affecting ASR performance: A study using state of the art ASR models in Indic Languages 2026-06-08T11:03:25Z

ASR performance varies across languages, speakers, and recording conditions, yet systematic analysis for Indic languages remain limited. We present a large-scale study of decoded outputs from multiple open-source ASR models evaluated on diverse Indian speech datasets in zero-shot settings. We analyze linguistic, speaker-level, and acoustic factors across Hindi, Bengali, Kannada, Telugu, and Marathi. We examine correlations between WER and speaker traits such as average word length, speaking rate, and utterance duration across multiple model dataset pairs. For Hindi, we further analyze audio factors including telephone codecs, bit depth, resampling, and background noise. Results reveal both cross lingual patterns and language-specific sensitivities, showing how speaker behavior and signal processing choices affect ASR robustness in real world Indic scenarios.

2026-06-08T11:03:25Z Agneedh Basu Pavan Kumar J Pranav Bhat Sujith Pulikodan Visruth Sanka Nihar Desai Prasanta Kumar Ghosh http://arxiv.org/abs/2606.09317v1 A Comparative Study of Pre-trained Speech Encoders and Training Objectives for Large-Scale Indic Spoken Language Identification 2026-06-08T10:27:33Z

Spoken language identification (LID) for Indian languages is a challenging problem due to the large number of languages, significant phonetic overlap among related varieties, and the scarcity of labeled data for many low-resource languages. In this work, we present a systematic comparative study of two pre-trained speech encoders -- Whisper and FastConformer -- combined with a linear classifier for large-scale Indic LID spanning 42 languages across four linguistic families. We evaluate both encoders in frozen (linear probing) and fine-tuned settings, and compare three training objectives: cross-entropy (CE), supervised contrastive loss with cross entropy (CE + supCon), and hierarchical softmax (HSM). Models are trained on the Vaani dataset and evaluated in a cross-corpus setting on Vaani-Test (held-out), FLEURS, and Kathbath, providing insights into domain generalization. The frozen FastConformer encoder achieves over 90\% macro accuracy on FLEURS and Kathbath without any task-specific adaptation, substantially outperforming Whisper on out-of-domain benchmarks, while fine-tuned Whisper yields stronger in-domain performance. HSM consistently outperforms CE and CE+SupCon for both encoders across all benchmarks, with the largest gains on out-of-domain test sets. CE+SupCon degrades FastConformer's cross-corpus generalization, suggesting that the contrastive objective over-specializes representations to in-domain conditions. Per-family analysis shows that Central Indo-Aryan varieties are the hardest to discriminate, with Hindi--Urdu and the Sadri--Chhattisgarhi--Surgujia cluster being the dominant confusion pairs.

2026-06-08T10:27:33Z Agneedh Basu Pavan Kumar J Sujith P Visruth Sanka Nihar Desai Prasanta Kumar Ghosh http://arxiv.org/abs/2606.06037v2 SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech 2026-06-08T08:49:38Z

Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.

2026-06-04T11:31:38Z Virginia Ceccatelli Yejin Jeon David Ifeoluwa Adelani http://arxiv.org/abs/2603.11669v2 SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns 2026-06-08T07:16:00Z

General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose the Global, Local, and Periodic (GLP) module, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.

2026-03-12T08:37:28Z Accepted to Interspeech 2026 Long paper track. Project page: https://sites.google.com/view/semambapp Yongjoon Lee Jung-Woo Choi http://arxiv.org/abs/2510.04593v3 UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models 2026-06-08T05:49:30Z

Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.

2025-10-06T08:47:38Z accepted at interspeech2026 Wenhao Guan Zhikang Niu Ziyue Jiang Kaidi Wang Peijie Chen Qingyang Hong Lin Li Xie Chen http://arxiv.org/abs/2606.09050v1 MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion 2026-06-08T05:39:23Z

Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.

2026-06-08T05:39:23Z Accepted by Interspeech 2026 Guobin Ma Yuxuan Xia Yuepeng Jiang Dake Guo Hanke Xie Jingbin Hu Yanbo Wang Lei Xie Pengcheng Zhu http://arxiv.org/abs/2606.09048v1 BareWave: Waveform-Native Flow-Matching Text-to-Speech 2026-06-08T05:36:42Z

Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.

2026-06-08T05:36:42Z Under Review Wei Fan Chao-Hong Tan Qian Chen Wen Wang Xiangang Li Kejiang Chen Weiming Zhang Nenghai Yu http://arxiv.org/abs/2510.05478v3 AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning 2026-06-08T05:16:04Z

Large Audio Language Models (LALMs) exhibit strong capabilities in general audio understanding but remain static after deployment, limiting their adaptability to real-world data. Since supervised fine-tuning is costly, we propose AQA-TTRL, a novel framework for audio understanding that enables on-the-fly evolution via test-time reinforcement learning using only unlabeled test data. It generates pseudo-labels via majority voting and optimizes the model through reinforcement learning. To address the noise in self-generated labels, we introduce confidence weighting to adjust training signals. Furthermore, multiple-attempt sampling mitigates advantage collapse and stabilizes training. Across MMAU, MMAR, and MMSU, AQA-TTRL achieves significant average improvements of 4.42% for Qwen2.5-Omni 7B and 11.04% for the 3B model. Notably, the adapted 3B model outperforms direct inference of the unadapted 7B model, highlighting the effectiveness of test-time adaptation in audio understanding.

2025-10-07T00:39:14Z Accepted to INTERSPEECH 2026 Haoyu Zhang Jiaxian Guo Dong Yang Yusuke Iwasawa Yutaka Matsuo http://arxiv.org/abs/2502.16584v2 Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound 2026-06-07T13:34:59Z

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.

2025-02-23T14:24:15Z Liumeng Xue Ziya Zhou Jiahao Pan Zixuan Li Shuai Fan Yinghao Ma Sitong Cheng Dongchao Yang Haohan Guo Yujia Xiao Xinsheng Wang Zixuan Shen Chuanbo Zhu Xinshen Zhang Tianchi Liu Ruibin Yuan Zeyue Tian Haohe Liu Xingjian Du Emmanouil Benetos Ge Zhang Yike Guo Wei Xue http://arxiv.org/abs/2606.08580v1 G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching 2026-06-07T11:28:32Z

Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.

2026-06-07T11:28:32Z Accepted to Interspeech 2026 Yike Zhu Ziqian Wang Zikai Liu Xingchen Li Zhuangqi Chen Xianjun Xia Chuanzeng Huang Lei Xie http://arxiv.org/abs/2601.09239v6 DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion 2026-06-07T10:32:48Z

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/

2026-01-14T07:22:24Z Submit to ACL ARR 2026 May Hanlin Zhang Daxin Tan Dehua Tao Xiao Chen Haochen Tan Yunhe Li Yuchen Cao Linqi Song http://arxiv.org/abs/2603.12046v2 Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition 2026-06-07T09:33:32Z

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

2026-03-12T15:22:27Z Accepted to INTERSPEECH 2026 [Long Paper track]. Project website: https://umbertocappellazzo.github.io/Dr-SHAP-AV Umberto Cappellazzo Stavros Petridis Maja Pantic http://arxiv.org/abs/2606.08524v1 Acoustic disguising: a unified framework for cloaking and holography 2026-06-07T09:03:40Z

Cloaking and holography -- usually treated as distinct problems -- are two limits of a single operation that we call acoustic disguising, realized here using immersive boundary conditions on a closed surface. Driving the boundary with homogeneous Green's functions suppresses any incident field inside the enclosed volume and cloaks unknown objects broadband; driving it with scattering Green's functions synthesizes a holographic scatterer indistinguishable from a target for arbitrary illuminations. Combining the two, using heterogeneous Green's functions, replaces the scattering signature of one object with that of another, transforming its acoustic identity. We demonstrate the framework in three-dimensional FDTD simulations driven by impulsive Green's functions, complemented by data-driven Green's-function retrieval, establishing a direct route to real-time 3D acoustic cloaking, holography, cloning, and disguising.

2026-06-07T09:03:40Z 8 pages, 5 figures; Supplemental Material included (24 pages, 21 figures). Supplementary videos: https://jmullerresearch.ch/acoustic-disguising.html ; source code: https://github.com/Nano560/acoustic-disguising ; data and code archived at Zenodo: https://doi.org/10.5281/zenodo.20433701 Jonas Müller Dirk-Jan van Manen