https://arxiv.org/api/dd0hO5m4M9zlyAfxeTgM4QfEPX02026-06-22T22:48:53Z2177452515http://arxiv.org/abs/2603.14889v2SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness2026-05-11T03:35:56ZThe rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://github.com/MM-Speech/SDiaReward/.2026-03-16T06:39:30ZAccepted to ACL 2026 Main ConferenceJingyu LuYuhan WangFan ZhuoXize ChengChanghao PanXueyi PuYifu ChenChenyuhao WenTianle LiangZhou Zhaohttp://arxiv.org/abs/2601.12248v3AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering2026-05-10T16:47:20ZRecent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.2026-01-18T03:55:28ZAccepted to ICASSP 2026 (Oral). Project Website: https://github.com/kuan2jiu99/aqua-benchChun-Yi KuanHung-yi Leehttp://arxiv.org/abs/2605.09627v1Single-Microphone Audio Point Source Discriminative Localization From Reverberation Late Tail Estimation2026-05-10T16:16:21ZLocation information can be a valuable signal for audio segmentation tasks, especially as a complement to methods focusing on the content or qualities of the sources. Though audio source localization is typically performed using the observations of the signal captured by multiple microphones in space, information about a source's location is captured by a single microphone through its arrival time and spectral amplitude--given the source's emitted signal is known. Since reverberation originates from the audio sources in a room, it accordingly contains some information about the emitted audio signals. The late-tail part of reverberation is relatively invariant to the local source and microphone geometry, depending primarily on only the room itself, and thus can provide the necessary reference information about audio signals that depends minimally on their location. In this work, we leverage the robust late-tail estimation of Weighted Prediction Error (WPE) dereverberation within a probabilistic framework to estimate the likelihood of two audio signals collected in the same room as having originated from the same location. We demonstrate the effectiveness of our approach on the speaker diarization task in both simulated and real environments.2026-05-10T16:16:21ZPublished at IEEE ICASSP 2026Matthew Maciejewski10.1109/ICASSP55912.2026.11461520http://arxiv.org/abs/2405.09570v2FunnelNet: An End-to-End Deep Learning Framework to Monitor Digital Heart Murmur in Real-Time2026-05-10T10:01:55ZHeart murmurs are abnormal sounds caused by turbulent blood flow in the heart. Several diagnostic methods are available to detect heart murmurs and their severity, including cardiac auscultation, echocardiography, and phonocardiography (PCG). However, these methods have limitations, including the need for extensive training among healthcare providers, the cost and accessibility of echocardiography, and noise interference during PCG data processing. This study proposes an end-to-end real-time heart murmur detection approach using traditional and depthwise separable convolutional networks. We applied a Butterworth filter and Continuous Wavelet Transform (CWT) to eliminate noise and extract meaningful features from the PCG data. The proposed network consists of three parts: a Squeeze net that generates a compressed data representation, a Bottleneck layer that minimizes computational complexity using depthwise-separable convolutions, and an Expansion net that up-samples the data to capture fine details. We evaluated our model on the publicly available CirCor pediatric heart sound dataset. Using only $\sim$5.4k parameters, we achieved an accuracy of 85%, a sensitivity of 85%, and a specificity of 92%, successfully outperforming several larger models. Furthermore, we converted our network into a TinyML format and tested it on two resource-constrained devices, achieving an average real-time inference accuracy of 91% on a Raspberry Pi 4B and 80% on an Android smartphone. The proposed lightweight model offers a robust deep learning framework for accurate, real-time heart murmur detection, showing strong promise for accessible medical diagnostics in limited-resource environments. The code is publicly available at https://github.com/jobayer/FunnelNet.2024-05-10T03:12:17ZMd JobayerMd. Mehedi Hasan ShawonMd Zakir HossainShreya GhoshImre RudasTom GedeonMd Rakibul Hasanhttp://arxiv.org/abs/2605.09413v1Evaluating the Expressive Appropriateness of Speech in Rich Contexts2026-05-10T08:28:58ZEvaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.2026-05-10T08:28:58Z19 pages, 6 figuresTianrui WangZiyang MaYizhou PengHaoyu WangZhikang NiuZikang HuangYihao WuYi-Wen ChaoYu JiangYuheng LuGuanrou YangXuanchen LiHexin LiuChunyu QiangCheng GongYifan YangTianchi LiuJunyu WangNana HouMeng GeFuming YouWei YangZhongqian SunHaifeng HuXiaobao WangEng Siong ChngXie ChenLongbiao WangJianwu Danghttp://arxiv.org/abs/2605.09386v1Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech2026-05-10T07:24:55ZMetric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject2026-05-10T07:24:55ZUnder ReviewDong YangYiyi CaiHaoyu ZhangYuki SaitoHiroshi Saruwatarihttp://arxiv.org/abs/2605.08961v1Dolphin-CN-Dialect: Where Chinese Dialects Matter2026-05-09T13:56:54ZWe present Dolphin-CN-Dialect, a streaming-capable ASR model with a focus on Chinese and dialect-rich scenarios. Compared to the previous version, Dolphin-CN-Dialect introduces substantial improvements in data processing, tokenization, training stability, and data sampling strategies. To address the challenges of highly imbalanced dialect data, we propose a temperature-based sampling strategy that effectively balances standard Mandarin and low-resource dialects, leading to significant gains in dialect recognition performance. In addition, we redesign the tokenizer to better align with linguistic characteristics, adopting character-level modeling for Chinese and subword modeling for English, while introducing extensible dialect tokens. Experimental results show that Dolphin-CN-Dialect achieves improvement in dialect recognition accuracy and CER reduction compared to Dolphin. Furthermore, Dolphin-CN-Dialect reaches competitive performance with recent SOTA open-source ASR models, while maintaining a significantly smaller model size. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling a practical balance between latency and accuracy. It also provides flexible customization through hotword support and efficient deployment optimized for specialized hardware. These improvements make Dolphin-CN-Dialect a strong and practical solution for real-world multi-dialect ASR applications.2026-05-09T13:56:54ZYangyang MengHuihang ZhongGuodong LinGuanbo WangHu DuZhiming ShaoYukai HuangKe LiWei-Qiang Zhanghttp://arxiv.org/abs/2605.05611v2X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning2026-05-09T04:00:12ZIn this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.2026-05-07T02:57:53Z16 pages, 4 figures, 9 tablesRixi XuQingyu LiuHaitao LiYushen ChenZhikang NiuYunting YangJian ZhaoKe LiBerrak SismanQinyuan ChengXipeng QiuKai YuXie Chenhttp://arxiv.org/abs/2605.08608v1Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation2026-05-09T02:07:57ZLanguage model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-invariant acoustic-semantic distillation framework for reducing linguistic hallucination in LM-based SE. The proposed method learns a noise-invariant conditioning encoder from noisy speech by jointly distilling two complementary clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency. The resulting noise-invariant acoustic-semantic representations are used to condition a decoder-only autoregressive language model, which predicts clean acoustic tokens that are decoded into enhanced speech. To support high-quality generation, we further employ a high-fidelity codec built on learnable weighted WavLM layer representations as the discrete acoustic interface. By improving the reliability of conditioning under adverse conditions, the proposed framework substantially reduces hallucination and improves content faithfulness. Experiments show that the proposed method consistently outperforms prior LM-based speech enhancement baselines on linguistic consistency metrics, with especially clear gains under low-SNR and reverberant conditions, while maintaining competitive perceptual quality. Audio samples are available at https://max1wz.github.io/L3-SE-Demo-Page/. The complete source code will be released after the manuscript is accepted.2026-05-09T02:07:57ZZheng WangXiaobin RongHang SuTianyi TanJunnan WuLichun FanZhenbo LuoJian LuanJing Luhttp://arxiv.org/abs/2605.08431v1Latent Secret Spin: Keyed Orthogonal Rotations for Blind Speech Watermarking in Anisotropic Latent Spaces2026-05-08T19:53:31ZWe introduce Latent Secret Spin (LSS), a blind speech watermarking method based on geometric operations in codec latent space. Based upon orthogonal rotations to principal components, LSS induces imperceptible but detectable covariance signatures according to a pseudo-random watermarking schedule. The scheme generalises across datasets, preserves perceptual quality and, unlike some learned, neural watermarking schemes, it does not require neural network training, is resistant to common signal manipulations and is flexible to payload size. Analyses show that structured latent-space watermarking is a promising and interpretable alternative to existing approaches.2026-05-08T19:53:31ZEmma ColettaMassimiliano TodiscoMichele PanarielloAntonio FaonioNicholas Evanshttp://arxiv.org/abs/2605.08075v1Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping2026-05-08T17:56:19ZDecoding imagined speech from non-invasive brain recordings is challenging because imagined datasets are scarce and difficult to align temporally across subjects and sessions In this work, we propose a new approach to the decoding of imagined speech that leverages the richer and more reliably labeled recordings during listening to speech. We collected paired listened and imagined MEG recordings to rhythmic melodic and spoken stimuli from trained musicians. Using trained musicians helped improve temporal alignment across conditions. We then developed a three-stage decoding pipeline that revealed consistent and meaningful relationships between neural activity evoked by imagining and listening to the same stimuli. First, we trained six linear and neural models to map imagined MEG responses to listened responses. We evaluated these models against a null baseline from unseen subjects to validate that the predicted-listening responses preserve stimulus-specific information. In the second stage, we trained a contrastive word decoder exclusively on the listened MEG responses, and evaluated it using four embedding strategies including semantic, acoustic, and phonetic representations. In the third stage, we process the imagined MEG responses from held-out subjects through the mapping pipeline to compute the corresponding listening responses that are then decoded by the listened decoder. Using rank-based analysis, we show that the imagined words are decodable significantly above chance. We shall report here the results of a proof-of-concept implementation to decode imagined speech, where all evaluations are performed on held-out subjects. We also demonstrate that performance improves with training data size, suggesting that this approach is scalable and can directly be made applicable to realistic brain-computer interface scenarios.2026-05-08T17:56:19ZMaryam MaghsoudiShihab Shammahttp://arxiv.org/abs/2605.07694v1Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation2026-05-08T13:03:41ZSingle-channel speaker distance estimation has recently achieved centimeter-level accuracy in simulated environments, yet it remains unclear which components of the room impulse response (RIR) the model exploits and how performance depends on the recording conditions. In this work, we decompose simulated RIRs into four variants (full, direct-only, no-late, and no-early) using the mixing time estimated from the echo density function as the boundary between early reflections and late reverberation. We define four calibration scenarios, from fully calibrated (synchronised capture, known source level) to fully uncalibrated (arbitrary onset, unknown level), and evaluate all combinations on a matched dataset. Results show that without time calibration, mean absolute error (MAE) increases to $1.29$ m and the model extracts reverberation-based cues, with early reflections emerging as the most informative component. Further analysis against DRR, $C_{50}$, and $T_{60}$ confirms that estimation accuracy improves with stronger early energy and degrades in highly reverberant environments. When time calibration is available, the model achieves a MAE of $0.14$ m by extracting the propagation delay alone, regardless of the RIR content.2026-05-08T13:03:41ZSubmitted to IWAENC 2026Michael NeriArchontis PolitisTuomas Virtanenhttp://arxiv.org/abs/2408.07522v2Optimising MFCC parameters for the automatic detection of respiratory diseases2026-05-08T12:43:37ZVoice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.2024-08-14T12:56:17ZYuyang YanSami O. SimonsLoes van BemmelLauren ReindersFrits M. E. FranssenVisara Urovi10.1016/j.apacoust.2024.110299http://arxiv.org/abs/2603.03096v2Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features2026-05-08T09:23:21ZHow do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence. We further show that characteristics can be changed by manipulating the corresponding dimensions.2026-03-03T15:33:39Z5 pages, 7 figures, submitted to IEEE Signal Processing LettersKyle Janse van RensburgBenjamin van NiekerkHerman Kamperhttp://arxiv.org/abs/2605.07291v1Evaluating voice anonymisation using similarity rank disclosure2026-05-08T06:01:28ZThe evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characterisation of privacy risk. We investigate the use of similarity rank disclosure (SRD), an information-theoretic metric, which operates on feature representations rather than classifier decisions, providing a threshold-independent assessment of privacy and analysis of both average and worst-case disclosure. We report its application to speaker embeddings, fundamental frequency, and phone embeddings using 2024 VoicePrivacy Challenge systems. The SRD reveals privacy leaks and system-specific weaknesses missed by EER-based evaluation. Findings highlight the merit of representation-level metrics and demonstrate the potential of SRD as a flexible and interpretable tool for the evaluation of voice anonymisation.2026-05-08T06:01:28ZShilpa ChandraMatteo PettenòNicholas EvansMichele PanarielloMassimiliano TodiscoTom BäckströmDorothea KolossaRainer MartinThemos StafylakisNicolas Gengembre