https://arxiv.org/api/N6un1M0qteGj0Zi+2VEAOGVq/d0 2026-06-22T13:51:31Z 21774 405 15 http://arxiv.org/abs/2605.25669v1 Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction 2026-05-25T10:19:16Z

Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, we propose FMelCodec, an ultra-low-bitrate neural speech codec in the mel-spectrogram domain, cast as a three-stage coding-refinement-reconstruction (CRR) framework that can operate at as low as 250 bps. In the CRR framework, the front-end mel-spectrogram coding stage employs a highly aggressive 640x compression/decompression encoder-decoder structure with a single 1024-entry VQ codebook, coupled with an online clustering strategy that reassigns underused codewords to prevent codebook collapse and preserve codebook diversity. The subsequent conditional flow matching (CFM)-based mel-spectrogram refinement stage leverages a lightweight velocity-field estimator and CFM-based solver to refine the codec-degraded mel-spectrogram produced by the preceding decoder, and adopts a self-consistency training scheme that supports fewer iterative inference steps for the purpose of reducing computational overhead. Finally, the vocoding-driven waveform reconstruction stage employs a HiFi-GAN vocoder to faithfully reconstruct waveform from the refined mel-spectrogram. Experiments conducted on two datasets spanning two sampling rates show that, under ultra-low-bitrate constraints of 250 bps for 16 kHz and 750 bps for 48 kHz, both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity, while incurring lower computational and model complexity.

2026-05-25T10:19:16Z Published at IEEE/ACM Transactions on Audio, Speech, and Language Processing Hui-Peng Du Yang Ai Xiao-Hang Jiang Yuan Tian Zhen-Hua Ling http://arxiv.org/abs/2605.25605v1 Decoding Stimulus Reconstruction-Based Auditory Attention Robustly in Unbalanced EEG Datasets 2026-05-25T08:58:19Z

In the past decade, numerous studies have applied deep neural networks (DNNs) to decode auditory attention (AAD) from Electroencephalogram (EEG) signals via stimulus reconstruction. However, the influence of dataset balance on the decoding performance of stimulus reconstruction-based AAD remains unexplored. In this study, three publicly available EEG-AAD datasets - KUL, DTU, and NJU cEEGrid - are used to construct both balanced and unbalanced experimental conditions. We hypothesize and demonstrate that stimulus reconstruction-based DNN decoders tend to produce overestimated decoding performance on unbalanced datasets. To address this issue, we propose a leave-one-paired-envelope-out (LOPEO) cross-validation protocol. Experimental results confirm that LOPEO effectively prevents inflated decoding accuracy on unbalanced datasets. While balanced datasets are generally preferred in experimental design, LOPEO provides a principled evaluation framework for unbalanced datasets that have already been published, filling an important gap in the field.

2026-05-25T08:58:19Z Yuanming Zhang Yayun Liang Zhibin Lin Jing Lu http://arxiv.org/abs/2605.25512v1 cSTMM: A Unified Complex Spherical Student's $t$ Mixture Model for Directional Statistics in Mask-Based Blind Speech Separation 2026-05-25T07:15:39Z

Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations on the complex unit sphere, without explicitly extracting phase and level difference features based on the plane-wave or spherical-wave assumptions. However, prior studies have mostly compared a small number of separately defined directional statistical mixture models, whereas a broader distribution family would enable a more systematic study of how density profiles affect separation performance. We propose the complex spherical Student's t mixture model (cSTMM), a directional mixture model that connects the complex angular central Gaussian mixture model (cACGMM), complex Bingham mixture model (cBMM), and complex Watson mixture model (cWMM) through the degrees-of-freedom parameter $ν$. We also derive a generalized minorization-maximization (MM) based procedure for parameter estimation. A no-restart evaluation on noise-free LibriSpeech mixtures reverberated with measured room impulse responses shows that a single development-selected value $ν^\ast=1$ achieved higher test-set mean signal-to-distortion ratio improvements (SDRi) than the cACGMM-equivalent setting $ν=M$ in all acoustic conditions, with an average condition-wise gain of 0.25dB. The experiments also numerically verify that the proposed formulation numerically recovers the cACGMM, cBMM, and cWMM cases.

2026-05-25T07:15:39Z Nobutaka Ito http://arxiv.org/abs/2605.25506v1 WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models 2026-05-25T07:09:58Z

Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, diffusion models, despite training faster than GANs, have slow CPU inference. In this paper, we introduce WaveNeXt 2, a unified ConvNeXt-based framework compatible with both GAN and diffusion vocoders. Its core innovation is residual denoising and sub-modeling, where each sub-model progressively refines the waveform. Experimental results in the multi-speaker dataset demonstrate the effectiveness of our approach: (1) GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit, and (2) Diff-WaveNeXt 2 also delivers much faster inference and competitive synthesis quality compared with FastDiff with 4 steps. The Diff-WaveNeXt 2 is very training-efficient, training in only 32 hours, making it ideal for resource-constrained applications.

2026-05-25T07:09:58Z ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Proc. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 17012-17016, 2026 Wangzixi Zhou Takuma Okamoto Yamato Ohtani Sakriani Sakti Hisashi Kawai 10.1109/ICASSP55912.2026.11461371 http://arxiv.org/abs/2605.25504v1 Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control 2026-05-25T07:08:58Z

While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-quality, fine-grained annotations, which restricts a model's ability to precisely control NV generation. To address this limitation, we propose a novel approach for fine-grained non-verbal expression synthesis. We curate and reprocess female NV utterances from the EARS corpus, develop a new annotation scheme using tags to encode NV types, frequencies, and durations, and build an emotional TTS benchmark to demonstrate its effectiveness. Our evaluation shows that while our NV approach leads to minor trade-offs in perceived naturalness, it significantly improves expressiveness (eMOS 4.20) and emotional recognition accuracy (78.8%). Emotion-specific analysis further reveals that NV cues are highly effective for high-arousal emotions like happy (82.5%) and fear (82.7%), and almost perfectly convey sadness (98.3%).

2026-05-25T07:08:58Z 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) Proc. 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 1-6, 2025 Wangzixi Zhou Bagus Tris Atmaja Sakriani Sakti 10.1109/O-COCOSDA68185.2025.11385230 http://arxiv.org/abs/2605.25498v1 Subspace Track-before-Detect for Passive Multi-Target Tracking with Unknown Emitted Signals 2026-05-25T06:57:38Z

Passive multi-target tracking (MTT) aims to infer the kinematic states of multiple targets from noisy sensor data in which contributions from unknown target-emitted signals are superposed. Track-before-detect (TBD) methods improve robustness to noise by operating directly on raw sensor data without relying on a preceding detection stage. However, many existing TBD methods assume that each target's contribution to the sensor data is determined solely by its kinematic state. This assumption limits their applicability to passive MTT, where each target's contribution depends on both its kinematic state and the unknown emitted signal. We propose subspace TBD, a passive multi-target TBD method based on a likelihood derived from the complex Bingham distribution that does not require explicit modeling or estimation of the unknown emitted signals. In a particle filter (PF) framework, each multi-target hypothesis is mapped to a low-dimensional subspace spanned by the steering vectors corresponding to the hypothesized target states. The likelihood is then used to evaluate the alignment of the normalized multichannel sensor data with this subspace. Preliminary experiments with simulated acoustic measurements and a given target activity pattern show that the proposed method can track two moving targets emitting unknown signals at a signal-to-noise ratio (SNR) of -10dB, whereas a conventional TBD baseline yields substantially larger tracking errors.

2026-05-25T06:57:38Z Nobutaka Ito Yoshiaki Bando http://arxiv.org/abs/2605.25404v1 Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems 2026-05-25T03:57:38Z

Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

2026-05-25T03:57:38Z Yizhou Peng Ziyang Ma Changsong Liu Yi-Wen Chao Xie Chen Eng Siong Chng http://arxiv.org/abs/2605.20414v2 PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding 2026-05-25T03:05:12Z

Long-form audio understanding poses significant challenges for large audio language models (LALMs) due to the extreme length of audio sequences and the need to reason over heterogeneous acoustic cues distributed over time, such as speech content, speaker identity, emotion, and sound events. To address these challenges, we propose \textbf{PlanRAG-Audio}, a planning-based retrieval-augmented generation framework for scalable long-form audio understanding. Rather than having audio LALMs process entire recordings directly, PlanRAG-Audio explicitly plans which modalities and temporal spans are required for a given query, and retrieves only query-relevant information from a structured text and audio database. This retrieval planning enables effective reasoning over complex, cross-domain audio queries while substantially reducing the input length passed to the large language models. Experiments across a wide range of speech/audio retrieval demonstrate that PlanRAG-Audio improves reasoning accuracy and stabilizes performance as audio duration increases by decoupling inference cost from raw audio length.

2026-05-19T19:10:30Z Accepted to Findings of ACL 2026 Masao Someki Chien-yu Huang Siddhant Arora Samuele Cornell Markus Müller Nathan Susanj Rupak V Swaminathan Grant P Strimel Jing Liu Shinji Watanabe http://arxiv.org/abs/2605.09568v3 RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations 2026-05-25T00:50:21Z

RADAR Challenge 2026 is an APSIPA Grand Challenge on Robust Audio Deepfake Recognition under Media Transformations, designed to simulate realistic media conditions in real-world audio distribution pipelines, including compression, resampling, noise, and reverberation. It consists of two phases: an English development phase with labeled data for analysis and paper writing, and a multilingual evaluation phase containing more than 100,000 utterances in English, Singapore English, Mandarin Chinese, Taiwanese Mandarin, Japanese, and Vietnamese. Systems are evaluated using equal error rate (EER) for binary real/fake classification. This paper describes the challenge task, the construction of the data set, the evaluation protocol, and the overall results. During the challenge, 33 teams submitted to the development phase and 22 teams submitted to the final evaluation phase. The reported results highlight the remaining challenges of robust audio deepfake detection under multilingual and media-transformed conditions.

2026-05-10T14:29:35Z Submitted to APSIPA 2026 Hieu-Thi Luong Xuechen Liu Ivan Kukanov Zheng Xin Chai Kong Aik Lee http://arxiv.org/abs/2602.00443v2 RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models 2026-05-24T18:26:14Z

Modern voice cloning, also known as zero-shot text-to-speech (TTS), can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practice, these systems often face noisy reference audio, imperfect text prompts, multilingual and long-form generation, post-processing, and adversarial perturbations, all of which can weaken robustness. Despite rapid progress in codec-token language models and diffusion-based TTS, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive dataset and benchmark for evaluating robustness in voice cloning. RVCBench provides task-aligned tests covering controlled text-audio pairing, multilingual and long-form scenarios, expressive prompts, post-processing conditions, and passive or proactive audio perturbations. Across 18 robustness evaluations, 225 speakers, and 14,370 utterances, RVCBench supports unified evaluation of input sensitivity, generation stability, output resilience, perturbation robustness, speaker similarity, and deepfake detectability. We evaluate 18 representative open-source voice cloning models and reveal systematic vulnerabilities in content consistency, speaker similarity, long-form stability, post-processing resilience, adversarial robustness, and detector-facing separability. We release the code and dataset to support reproducible evaluation and future research on robust voice cloning, speech synthesis, and audio generation. Code: https://github.com/Nanboy-Ronan/RVCBench. Dataset: https://huggingface.co/datasets/Nanboy/RVCBench.

2026-01-31T01:38:55Z 65 pages, 10 figures Ruinan Jin Xinting Liao Hanlin Yu Deval Pandya Xiaoxiao Li http://arxiv.org/abs/2510.02171v3 Go witheFlow: Real-time Emotion Driven Audio Effects Modulation 2026-05-24T17:05:21Z

Music performance is a distinctly human activity, intrinsically linked to the performer's ability to convey, evoke, or express emotion. Machines cannot perform music in the human sense; they can produce, reproduce, execute, or synthesize music, but they lack the capacity for affective or emotional experience. As such, music performance is an ideal candidate through which to explore aspects of collaboration between humans and machines. In this paper, we introduce the witheFlow system, designed to enhance real-time music performance by automatically modulating audio effects based on features extracted from both biosignals and the audio itself. The system, currently in a proof-of-concept phase, is designed to be lightweight, able to run locally on a laptop, and is open-source given the availability of a compatible Digital Audio Workstation and sensors.

2025-10-02T16:23:47Z Accepted at NeurIPS Creative AI Track 2025: Humanity Edmund Dervakos Spyridon Kantarelis Vassilis Lyberatos Jason Liartis Giorgos Stamou http://arxiv.org/abs/2604.19151v2 Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India 2026-05-24T13:47:47Z

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

2026-04-21T07:02:01Z 6 pages, 4 figures Kaushal Bhogale Manas Dhir Amritansh Walecha Manmeet Kaur Vanshika Chhabra Aaditya Pareek Hanuman Sidh Mahima Manik Sagar Jain Bhaskar Singh Utkarsh Singh Tahir Javed Shobhit Banga Mitesh M. Khapra http://arxiv.org/abs/2605.24825v1 Time Segmented Beamforming via Dynamic Programming: Theory and Implementation 2026-05-24T02:35:00Z

In dynamic acoustic environments with time-varying interferers, effective beamforming requires identifying stationary regions over time. The Capon beamformer, a whitened matched filter constrained to maintain unity gain in the desired direction, theoretically relies on the instantaneous ensemble covariance matrix. Practical implementations rely on the batch Capon (or Sample Matrix Inversion), which estimates the sample covariance matrix (SCM) by averaging over a block of snapshots. This practical approach implicitly assumes that the data within the batch window is stationary and can be coherently combined. In non-stationary settings, a batch approach that averages over fixed or excessively long windows fails, as moving interferers smear the SCM and degrade the beamformer's nulling capabilities. To address this, this paper introduces a temporally segmented distortionless response beamformer. Inspired by the segmented least squares method, which fits piecewise polynomials to data while penalizing excessive segmentation to prevent overfitting, the framework extends practical Capon beamforming by incorporating data-driven temporal segmentation. This formulation minimizes output power while dynamically adapting the SCM estimation windows to local stationarity, offering a principled approach to tracking time-varying interferers.

2026-05-24T02:35:00Z 16 pages, 17 figures, Beamforming New Approach Regret Bounds Manan Mittal Ryan M. Corey Diego Cuji John R. Buck Andrew C. Singer http://arxiv.org/abs/2605.24806v1 Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models 2026-05-24T01:39:18Z

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

2026-05-24T01:39:18Z 6 pages Muhammad Ashad Kabir Sirajam Munira http://arxiv.org/abs/2605.24618v1 FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations 2026-05-23T15:01:28Z

Recent advances in zero-shot text-to-speech (TTS) have enabled accurate imitation of reference speech in terms of both speaking style and speaker timbre. However, achieving disentangled control over these aspects from separate references remains a challenging task. Several studies have proposed disentangled speech representations that decompose speech into interpretable attributes (e.g., timbre, prosody, and content), providing a promising foundation for TTS with attribute control from separate references. Yet, how to effectively integrate such representations into TTS systems to achieve independent and precise control remains underexplored. In this paper, we present FC-TTS, a zero-shot TTS framework that enables disentangled control of style and timbre by conditioning on two distinct reference utterances. Unlike existing systems that inherit limitations from those pre-trained disentangled representations, FC-TTS introduces key design strategies, including architectural choices, training framework, and auxiliary training objectives, which improve the reliability of attribute separation and dual-reference control. Experiments show that FC-TTS achieves high-fidelity synthesis and competitive zero-shot naturalness, while uniquely supporting consistent and independent manipulation of style and timbre. Audio samples are available at https://qualcomm-ai-research.github.io/fc-tts

2026-05-23T15:01:28Z Accepted to ACL 2026 (Main Conference). 20 pages, 8 figures, 7 tables. Demo page: https://qualcomm-ai-research.github.io/fc-tts Yoonhyung Lee Hyunsin Park Jinhwan Park Jinkyu Lee