https://arxiv.org/api/dd0hO5m4M9zlyAfxeTgM4QfEPX0 2026-06-22T22:48:53Z 21774 525 15 http://arxiv.org/abs/2603.14889v2 SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness 2026-05-11T03:35:56Z

The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://github.com/MM-Speech/SDiaReward/.

2026-03-16T06:39:30Z Accepted to ACL 2026 Main Conference Jingyu Lu Yuhan Wang Fan Zhuo Xize Cheng Changhao Pan Xueyi Pu Yifu Chen Chenyuhao Wen Tianle Liang Zhou Zhao http://arxiv.org/abs/2601.12248v3 AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering 2026-05-10T16:47:20Z

Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.

2026-01-18T03:55:28Z Accepted to ICASSP 2026 (Oral). Project Website: https://github.com/kuan2jiu99/aqua-bench Chun-Yi Kuan Hung-yi Lee http://arxiv.org/abs/2605.09627v1 Single-Microphone Audio Point Source Discriminative Localization From Reverberation Late Tail Estimation 2026-05-10T16:16:21Z

Location information can be a valuable signal for audio segmentation tasks, especially as a complement to methods focusing on the content or qualities of the sources. Though audio source localization is typically performed using the observations of the signal captured by multiple microphones in space, information about a source's location is captured by a single microphone through its arrival time and spectral amplitude--given the source's emitted signal is known. Since reverberation originates from the audio sources in a room, it accordingly contains some information about the emitted audio signals. The late-tail part of reverberation is relatively invariant to the local source and microphone geometry, depending primarily on only the room itself, and thus can provide the necessary reference information about audio signals that depends minimally on their location. In this work, we leverage the robust late-tail estimation of Weighted Prediction Error (WPE) dereverberation within a probabilistic framework to estimate the likelihood of two audio signals collected in the same room as having originated from the same location. We demonstrate the effectiveness of our approach on the speaker diarization task in both simulated and real environments.

2026-05-10T16:16:21Z Published at IEEE ICASSP 2026 Matthew Maciejewski 10.1109/ICASSP55912.2026.11461520 http://arxiv.org/abs/2405.09570v2 FunnelNet: An End-to-End Deep Learning Framework to Monitor Digital Heart Murmur in Real-Time 2026-05-10T10:01:55Z

Heart murmurs are abnormal sounds caused by turbulent blood flow in the heart. Several diagnostic methods are available to detect heart murmurs and their severity, including cardiac auscultation, echocardiography, and phonocardiography (PCG). However, these methods have limitations, including the need for extensive training among healthcare providers, the cost and accessibility of echocardiography, and noise interference during PCG data processing. This study proposes an end-to-end real-time heart murmur detection approach using traditional and depthwise separable convolutional networks. We applied a Butterworth filter and Continuous Wavelet Transform (CWT) to eliminate noise and extract meaningful features from the PCG data. The proposed network consists of three parts: a Squeeze net that generates a compressed data representation, a Bottleneck layer that minimizes computational complexity using depthwise-separable convolutions, and an Expansion net that up-samples the data to capture fine details. We evaluated our model on the publicly available CirCor pediatric heart sound dataset. Using only $\sim$5.4k parameters, we achieved an accuracy of 85%, a sensitivity of 85%, and a specificity of 92%, successfully outperforming several larger models. Furthermore, we converted our network into a TinyML format and tested it on two resource-constrained devices, achieving an average real-time inference accuracy of 91% on a Raspberry Pi 4B and 80% on an Android smartphone. The proposed lightweight model offers a robust deep learning framework for accurate, real-time heart murmur detection, showing strong promise for accessible medical diagnostics in limited-resource environments. The code is publicly available at https://github.com/jobayer/FunnelNet.

2024-05-10T03:12:17Z Md Jobayer Md. Mehedi Hasan Shawon Md Zakir Hossain Shreya Ghosh Imre Rudas Tom Gedeon Md Rakibul Hasan http://arxiv.org/abs/2605.09413v1 Evaluating the Expressive Appropriateness of Speech in Rich Contexts 2026-05-10T08:28:58Z

Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.

2026-05-10T08:28:58Z 19 pages, 6 figures Tianrui Wang Ziyang Ma Yizhou Peng Haoyu Wang Zhikang Niu Zikang Huang Yihao Wu Yi-Wen Chao Yu Jiang Yuheng Lu Guanrou Yang Xuanchen Li Hexin Liu Chunyu Qiang Cheng Gong Yifan Yang Tianchi Liu Junyu Wang Nana Hou Meng Ge Fuming You Wei Yang Zhongqian Sun Haifeng Hu Xiaobao Wang Eng Siong Chng Xie Chen Longbiao Wang Jianwu Dang http://arxiv.org/abs/2605.09386v1 Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech 2026-05-10T07:24:55Z

Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject

2026-05-10T07:24:55Z Under Review Dong Yang Yiyi Cai Haoyu Zhang Yuki Saito Hiroshi Saruwatari http://arxiv.org/abs/2605.08961v1 Dolphin-CN-Dialect: Where Chinese Dialects Matter 2026-05-09T13:56:54Z

We present Dolphin-CN-Dialect, a streaming-capable ASR model with a focus on Chinese and dialect-rich scenarios. Compared to the previous version, Dolphin-CN-Dialect introduces substantial improvements in data processing, tokenization, training stability, and data sampling strategies. To address the challenges of highly imbalanced dialect data, we propose a temperature-based sampling strategy that effectively balances standard Mandarin and low-resource dialects, leading to significant gains in dialect recognition performance. In addition, we redesign the tokenizer to better align with linguistic characteristics, adopting character-level modeling for Chinese and subword modeling for English, while introducing extensible dialect tokens. Experimental results show that Dolphin-CN-Dialect achieves improvement in dialect recognition accuracy and CER reduction compared to Dolphin. Furthermore, Dolphin-CN-Dialect reaches competitive performance with recent SOTA open-source ASR models, while maintaining a significantly smaller model size. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling a practical balance between latency and accuracy. It also provides flexible customization through hotword support and efficient deployment optimized for specialized hardware. These improvements make Dolphin-CN-Dialect a strong and practical solution for real-world multi-dialect ASR applications.

2026-05-09T13:56:54Z Yangyang Meng Huihang Zhong Guodong Lin Guanbo Wang Hu Du Zhiming Shao Yukai Huang Ke Li Wei-Qiang Zhang http://arxiv.org/abs/2605.05611v2 X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning 2026-05-09T04:00:12Z

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

2026-05-07T02:57:53Z 16 pages, 4 figures, 9 tables Rixi Xu Qingyu Liu Haitao Li Yushen Chen Zhikang Niu Yunting Yang Jian Zhao Ke Li Berrak Sisman Qinyuan Cheng Xipeng Qiu Kai Yu Xie Chen http://arxiv.org/abs/2605.08608v1 Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation 2026-05-09T02:07:57Z

Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a high-fidelity codec built on learnable weighted WavLM layer representations as the discrete acoustic interface. By improving the reliability of conditioning under adverse conditions, the proposed framework substantially reduces hallucination and improves content faithfulness. Experiments show that the proposed method consistently outperforms prior LM-based speech enhancement baselines on linguistic consistency metrics, with especially clear gains under low-SNR and reverberant conditions, while maintaining competitive perceptual quality. Audio samples are available at https://max1wz.github.io/L3-SE-Demo-Page/. The complete source code will be released after the manuscript is accepted.

2026-05-09T02:07:57Z Zheng Wang Xiaobin Rong Hang Su Tianyi Tan Junnan Wu Lichun Fan Zhenbo Luo Jian Luan Jing Lu http://arxiv.org/abs/2605.08431v1 Latent Secret Spin: Keyed Orthogonal Rotations for Blind Speech Watermarking in Anisotropic Latent Spaces 2026-05-08T19:53:31Z

We introduce Latent Secret Spin (LSS), a blind speech watermarking method based on geometric operations in codec latent space. Based upon orthogonal rotations to principal components, LSS induces imperceptible but detectable covariance signatures according to a pseudo-random watermarking schedule. The scheme generalises across datasets, preserves perceptual quality and, unlike some learned, neural watermarking schemes, it does not require neural network training, is resistant to common signal manipulations and is flexible to payload size. Analyses show that structured latent-space watermarking is a promising and interpretable alternative to existing approaches.

2026-05-08T19:53:31Z Emma Coletta Massimiliano Todisco Michele Panariello Antonio Faonio Nicholas Evans http://arxiv.org/abs/2605.08075v1 Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping 2026-05-08T17:56:19Z

Decoding imagined speech from non-invasive brain recordings is challenging because imagined datasets are scarce and difficult to align temporally across subjects and sessions In this work, we propose a new approach to the decoding of imagined speech that leverages the richer and more reliably labeled recordings during listening to speech. We collected paired listened and imagined MEG recordings to rhythmic melodic and spoken stimuli from trained musicians. Using trained musicians helped improve temporal alignment across conditions. We then developed a three-stage decoding pipeline that revealed consistent and meaningful relationships between neural activity evoked by imagining and listening to the same stimuli. First, we trained six linear and neural models to map imagined MEG responses to listened responses. We evaluated these models against a null baseline from unseen subjects to validate that the predicted-listening responses preserve stimulus-specific information. In the second stage, we trained a contrastive word decoder exclusively on the listened MEG responses, and evaluated it using four embedding strategies including semantic, acoustic, and phonetic representations. In the third stage, we process the imagined MEG responses from held-out subjects through the mapping pipeline to compute the corresponding listening responses that are then decoded by the listened decoder. Using rank-based analysis, we show that the imagined words are decodable significantly above chance. We shall report here the results of a proof-of-concept implementation to decode imagined speech, where all evaluations are performed on held-out subjects. We also demonstrate that performance improves with training data size, suggesting that this approach is scalable and can directly be made applicable to realistic brain-computer interface scenarios.

2026-05-08T17:56:19Z Maryam Maghsoudi Shihab Shamma http://arxiv.org/abs/2605.07694v1 Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation 2026-05-08T13:03:41Z

Single-channel speaker distance estimation has recently achieved centimeter-level accuracy in simulated environments, yet it remains unclear which components of the room impulse response (RIR) the model exploits and how performance depends on the recording conditions. In this work, we decompose simulated RIRs into four variants (full, direct-only, no-late, and no-early) using the mixing time estimated from the echo density function as the boundary between early reflections and late reverberation. We define four calibration scenarios, from fully calibrated (synchronised capture, known source level) to fully uncalibrated (arbitrary onset, unknown level), and evaluate all combinations on a matched dataset. Results show that without time calibration, mean absolute error (MAE) increases to $1.29$ m and the model extracts reverberation-based cues, with early reflections emerging as the most informative component. Further analysis against DRR, $C_{50}$, and $T_{60}$ confirms that estimation accuracy improves with stronger early energy and degrades in highly reverberant environments. When time calibration is available, the model achieves a MAE of $0.14$ m by extracting the propagation delay alone, regardless of the RIR content.

2026-05-08T13:03:41Z Submitted to IWAENC 2026 Michael Neri Archontis Politis Tuomas Virtanen http://arxiv.org/abs/2408.07522v2 Optimising MFCC parameters for the automatic detection of respiratory diseases 2026-05-08T12:43:37Z

Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.

2024-08-14T12:56:17Z Yuyang Yan Sami O. Simons Loes van Bemmel Lauren Reinders Frits M. E. Franssen Visara Urovi 10.1016/j.apacoust.2024.110299 http://arxiv.org/abs/2603.03096v2 Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features 2026-05-08T09:23:21Z

How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence. We further show that characteristics can be changed by manipulating the corresponding dimensions.

2026-03-03T15:33:39Z 5 pages, 7 figures, submitted to IEEE Signal Processing Letters Kyle Janse van Rensburg Benjamin van Niekerk Herman Kamper http://arxiv.org/abs/2605.07291v1 Evaluating voice anonymisation using similarity rank disclosure 2026-05-08T06:01:28Z

The evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characterisation of privacy risk. We investigate the use of similarity rank disclosure (SRD), an information-theoretic metric, which operates on feature representations rather than classifier decisions, providing a threshold-independent assessment of privacy and analysis of both average and worst-case disclosure. We report its application to speaker embeddings, fundamental frequency, and phone embeddings using 2024 VoicePrivacy Challenge systems. The SRD reveals privacy leaks and system-specific weaknesses missed by EER-based evaluation. Findings highlight the merit of representation-level metrics and demonstrate the potential of SRD as a flexible and interpretable tool for the evaluation of voice anonymisation.

2026-05-08T06:01:28Z Shilpa Chandra Matteo Pettenò Nicholas Evans Michele Panariello Massimiliano Todisco Tom Bäckström Dorothea Kolossa Rainer Martin Themos Stafylakis Nicolas Gengembre