https://arxiv.org/api/BwHOlwMuZ93O2ibgcckc2hA7mNg2026-06-18T07:33:32Z2175518015http://arxiv.org/abs/2606.09345v1A study on the impact of region specific data on the performance of Indic ASR2026-06-08T11:12:28ZAutomatic Speech Recognition (ASR) systems are widely deployed across linguistically diverse regions, yet their ability to generalize across fine-grained geographic variation remains underexplored. We present a systematic study of cross-district ASR generalization for Indian languages, analyzing the impact of regional variation on performance. Using finetuning as a controlled probe, we train models on speech from a single district and evaluate them on other districts within the same language. We examine trends across multiple train test district pairs and quantify performance differences. To assess geographic effects, we analyze the correlation between WER and inter district distance using two distance measures. Our results show consistent correlations between geographic distance and WER, highlighting the challenges of regional generalization and the need for geographically diverse speech data in ASR development and evaluation in India.2026-06-08T11:12:28ZAgneedh BasuPavan Kumar JPranav BhatSujith PulikodanVisruth SankaNihar DesaiPrasata Kumar Ghoshhttp://arxiv.org/abs/2606.09342v1Parameter-Efficient Continual Learning for Automatic Speech Recognition2026-06-08T11:08:24ZSpeech foundation models enable strong general-purpose ASR and are attractive for downstream adaptation. However, their size and the catastrophic forgetting induced by sequential fine-tuning demand parameter-efficient and regularized training methods, motivating parameter-efficient continual learning (PECL). While PECL has been widely studied in NLP and vision, it has received less attention in ASR. In this paper, we propose a simple yet effective PECL method based on recent advances in parameter-efficient fine-tuning for ASR. We partition pretrained weight matrices into head and tail subspaces according to singular values and restrict adaptation to approximate rotations within the low-energy tail subspace, preserving dominant components and reducing forgetting. For subsequent tasks, rotations are combined via weight averaging to further improve retention. Experiments on two benchmarks demonstrate reduced forgetting and superior overall performance compared to recent PECL baselines.2026-06-08T11:08:24ZAccepted at Interspeech 2026Steven Vander EecktHugo Van hammehttp://arxiv.org/abs/2606.09335v1Factors affecting ASR performance: A study using state of the art ASR models in Indic Languages2026-06-08T11:03:25ZASR performance varies across languages, speakers, and recording conditions, yet systematic analysis for Indic languages remain limited. We present a large-scale study of decoded outputs from multiple open-source ASR models evaluated on diverse Indian speech datasets in zero-shot settings. We analyze linguistic, speaker-level, and acoustic factors across Hindi, Bengali, Kannada, Telugu, and Marathi. We examine correlations between WER and speaker traits such as average word length, speaking rate, and utterance duration across multiple model dataset pairs. For Hindi, we further analyze audio factors including telephone codecs, bit depth, resampling, and background noise. Results reveal both cross lingual patterns and language-specific sensitivities, showing how speaker behavior and signal processing choices affect ASR robustness in real world Indic scenarios.2026-06-08T11:03:25ZAgneedh BasuPavan Kumar JPranav BhatSujith PulikodanVisruth SankaNihar DesaiPrasanta Kumar Ghoshhttp://arxiv.org/abs/2606.09317v1A Comparative Study of Pre-trained Speech Encoders and Training Objectives for Large-Scale Indic Spoken Language Identification2026-06-08T10:27:33ZSpoken language identification (LID) for Indian languages is a challenging problem due to the large number of languages, significant phonetic overlap among related varieties, and the scarcity of labeled data for many low-resource languages. In this work, we present a systematic comparative study of two pre-trained speech encoders -- Whisper and FastConformer -- combined with a linear classifier for large-scale Indic LID spanning 42 languages across four linguistic families. We evaluate both encoders in frozen (linear probing) and fine-tuned settings, and compare three training objectives: cross-entropy (CE), supervised contrastive loss with cross entropy (CE + supCon), and hierarchical softmax (HSM). Models are trained on the Vaani dataset and evaluated in a cross-corpus setting on Vaani-Test (held-out), FLEURS, and Kathbath, providing insights into domain generalization. The frozen FastConformer encoder achieves over 90\% macro accuracy on FLEURS and Kathbath without any task-specific adaptation, substantially outperforming Whisper on out-of-domain benchmarks, while fine-tuned Whisper yields stronger in-domain performance. HSM consistently outperforms CE and CE+SupCon for both encoders across all benchmarks, with the largest gains on out-of-domain test sets. CE+SupCon degrades FastConformer's cross-corpus generalization, suggesting that the contrastive objective over-specializes representations to in-domain conditions. Per-family analysis shows that Central Indo-Aryan varieties are the hardest to discriminate, with Hindi--Urdu and the Sadri--Chhattisgarhi--Surgujia cluster being the dominant confusion pairs.2026-06-08T10:27:33ZAgneedh BasuPavan Kumar JSujith PVisruth SankaNihar DesaiPrasanta Kumar Ghoshhttp://arxiv.org/abs/2606.06037v2SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech2026-06-08T08:49:38ZLarge audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.2026-06-04T11:31:38ZVirginia CeccatelliYejin JeonDavid Ifeoluwa Adelanihttp://arxiv.org/abs/2603.11669v2SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns2026-06-08T07:16:00ZGeneral speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose the Global, Local, and Periodic (GLP) module, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.2026-03-12T08:37:28ZAccepted to Interspeech 2026 Long paper track. Project page: https://sites.google.com/view/semambappYongjoon LeeJung-Woo Choihttp://arxiv.org/abs/2510.04593v3UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models2026-06-08T05:49:30ZLarge language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.2025-10-06T08:47:38Zaccepted at interspeech2026Wenhao GuanZhikang NiuZiyue JiangKaidi WangPeijie ChenQingyang HongLin LiXie Chenhttp://arxiv.org/abs/2606.09050v1MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion2026-06-08T05:39:23ZStreaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.2026-06-08T05:39:23ZAccepted by Interspeech 2026Guobin MaYuxuan XiaYuepeng JiangDake GuoHanke XieJingbin HuYanbo WangLei XiePengcheng Zhuhttp://arxiv.org/abs/2606.09048v1BareWave: Waveform-Native Flow-Matching Text-to-Speech2026-06-08T05:36:42ZRemoving intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.2026-06-08T05:36:42ZUnder ReviewWei FanChao-Hong TanQian ChenWen WangXiangang LiKejiang ChenWeiming ZhangNenghai Yuhttp://arxiv.org/abs/2510.05478v3AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning2026-06-08T05:16:04ZLarge Audio Language Models (LALMs) exhibit strong capabilities in general audio understanding but remain static after deployment, limiting their adaptability to real-world data. Since supervised fine-tuning is costly, we propose AQA-TTRL, a novel framework for audio understanding that enables on-the-fly evolution via test-time reinforcement learning using only unlabeled test data. It generates pseudo-labels via majority voting and optimizes the model through reinforcement learning. To address the noise in self-generated labels, we introduce confidence weighting to adjust training signals. Furthermore, multiple-attempt sampling mitigates advantage collapse and stabilizes training. Across MMAU, MMAR, and MMSU, AQA-TTRL achieves significant average improvements of 4.42% for Qwen2.5-Omni 7B and 11.04% for the 3B model. Notably, the adapted 3B model outperforms direct inference of the unadapted 7B model, highlighting the effectiveness of test-time adaptation in audio understanding.2025-10-07T00:39:14ZAccepted to INTERSPEECH 2026Haoyu ZhangJiaxian GuoDong YangYusuke IwasawaYutaka Matsuohttp://arxiv.org/abs/2502.16584v2Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound2026-06-07T13:34:59ZRecent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.2025-02-23T14:24:15ZLiumeng XueZiya ZhouJiahao PanZixuan LiShuai FanYinghao MaSitong ChengDongchao YangHaohan GuoYujia XiaoXinsheng WangZixuan ShenChuanbo ZhuXinshen ZhangTianchi LiuRuibin YuanZeyue TianHaohe LiuXingjian DuEmmanouil BenetosGe ZhangYike GuoWei Xuehttp://arxiv.org/abs/2606.08580v1G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching2026-06-07T11:28:32ZUsing speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.2026-06-07T11:28:32ZAccepted to Interspeech 2026Yike ZhuZiqian WangZikai LiuXingchen LiZhuangqi ChenXianjun XiaChuanzeng HuangLei Xiehttp://arxiv.org/abs/2601.09239v6DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion2026-06-07T10:32:48ZSpeech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/2026-01-14T07:22:24ZSubmit to ACL ARR 2026 MayHanlin ZhangDaxin TanDehua TaoXiao ChenHaochen TanYunhe LiYuchen CaoLinqi Songhttp://arxiv.org/abs/2603.12046v2Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition2026-06-07T09:33:32ZAudio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.2026-03-12T15:22:27ZAccepted to INTERSPEECH 2026 [Long Paper track]. Project website: https://umbertocappellazzo.github.io/Dr-SHAP-AVUmberto CappellazzoStavros PetridisMaja Pantichttp://arxiv.org/abs/2606.08524v1Acoustic disguising: a unified framework for cloaking and holography2026-06-07T09:03:40ZCloaking and holography -- usually treated as distinct problems -- are two limits of a single operation that we call acoustic disguising, realized here using immersive boundary conditions on a closed surface. Driving the boundary with homogeneous Green's functions suppresses any incident field inside the enclosed volume and cloaks unknown objects broadband; driving it with scattering Green's functions synthesizes a holographic scatterer indistinguishable from a target for arbitrary illuminations. Combining the two, using heterogeneous Green's functions, replaces the scattering signature of one object with that of another, transforming its acoustic identity. We demonstrate the framework in three-dimensional FDTD simulations driven by impulsive Green's functions, complemented by data-driven Green's-function retrieval, establishing a direct route to real-time 3D acoustic cloaking, holography, cloning, and disguising.2026-06-07T09:03:40Z8 pages, 5 figures; Supplemental Material included (24 pages, 21 figures). Supplementary videos: https://jmullerresearch.ch/acoustic-disguising.html ; source code: https://github.com/Nano560/acoustic-disguising ; data and code archived at Zenodo: https://doi.org/10.5281/zenodo.20433701Jonas MüllerDirk-Jan van Manen