https://arxiv.org/api/OpeHfZ8tbMtk0g6QeC6JlAqkedo2026-06-22T21:49:08Z2177451015http://arxiv.org/abs/2402.07619v2Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data2026-05-12T08:04:51ZCOVID-19 has affected more than 223 countries worldwide and in the Post-COVID Era, there is a pressing need for non-invasive, low-cost, and highly scalable solutions to detect COVID-19. We develop a deep learning model to identify COVID-19 from voice recording data. The novelty of this work is in the development of deep learning models for COVID-19 identification from only voice recordings. We use the Cambridge COVID-19 Sound database which contains 893 speech samples, crowd-sourced from 4352 participants via a COVID-19 Sounds app. Voice features including Mel-spectrograms and Mel-frequency cepstral coefficients (MFCC) and CNN Encoder features are extracted. Based on the voice data, we develop deep learning classification models to detect COVID-19 cases. These models include Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) and Hidden-Unit BERT (HuBERT). We compare their predictive power to baseline machine learning models. HuBERT achieves the highest accuracy of 86\% and the highest AUC of 0.93. The results achieved with the proposed models suggest promising results in COVID-19 diagnosis from voice recordings when compared to the results obtained from the state-of-the-art.2024-02-12T12:52:47ZarXiv admin note: text overlap with arXiv:2209.03727Yuyang YanWafaa AljbawiSami O. SimonsVisara Urovi10.37349/edht.2024.00022http://arxiv.org/abs/2602.16416v2Online Single-Channel Audio-Based Sound Speed Estimation for Robust Multi-Channel Audio Control2026-05-12T06:50:44ZRobust spatial audio control relies on accurate acoustic propagation models, yet environmental variations, especially changes in the speed of sound, cause systematic mismatches that degrade performance. Existing methods either assume known sound speed, require multiple microphones, or rely on separate calibration, making them impractical for systems with minimal sensing. We propose an online sound speed estimator that operates during general multichannel audio playback and requires only a single observation microphone. The method exploits the structured effect of sound speed on the reproduced signal and estimates it by minimizing the mismatch between the measured audio and a parametric acoustic model. Simulations show accurate tracking of sound speed for diverse input signals and improved spatial control performance when the estimates are used to compensate propagation errors in a sound zone control framework.2026-02-18T12:48:55ZAccepted for publication at EUSIPCO 2026Andreas Jonas FuglsigMads Græsbøll ChristensenJesper Rindom Jensenhttp://arxiv.org/abs/2605.10815v2Probing Cross-modal Information Hubs in Audio-Visual LLMs2026-05-12T02:51:58ZAudio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.2026-05-11T16:34:18ZAccepted by ICML 2026Jihoo JungChaeyoung JungJi-Hoon KimJoon Son Chunghttp://arxiv.org/abs/2605.11422v1Chunkwise Aligners for Streaming Speech Recognition2026-05-12T02:16:26ZWe propose the Chunkwise Aligner, a novel architecture for streaming automatic speech recognition (ASR). While the Transducer is the standard model for streaming ASR, its training is costly due to the need to compute all possible audio-label alignments. The recently introduced Aligner reduces this cost by discarding explicit alignments, but this modification makes it unsuitable for streaming. Our approach overcomes this limitation by dividing the audio into chunks and aligning each label to the leftmost frames of its chunk, whereas transitions between chunks are managed by a learned end-of-chunk probability. Experiments show that the Chunkwise Aligner not only matches the Transducer's accuracy in both offline and streaming scenarios, but also offers superior training and decoding efficiencies.2026-05-12T02:16:26ZProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 18282-18286Wen Shen TeoTakafumi MoriyaMasato Mimura10.1109/ICASSP55912.2026.11463400http://arxiv.org/abs/2604.12928v3MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models2026-05-12T00:31:31ZSpeech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.2026-04-14T16:17:52ZAccepted to ICML 2026Chung-Ming ChienManu OrsiniEugene KharitonovNeil ZeghidourKaren LivescuAlexandre Défossezhttp://arxiv.org/abs/2605.11286v1Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming2026-05-11T22:13:05ZReliable adaptive beamforming is critical for large microphone arrays operating in highly dynamic acoustic environments. In scenarios characterized by fast-moving talkers and interferers, the available sample support for estimating the spatial correlation matrix is often snapshot-deficient. This deficiency degrades the White Noise Gain (WNG), leading to severe target signal cancellation. To ensure stable and robust beamforming, we previously proposed an adaptive diagonal loading method that leverages the Kantorovich inequality to guarantee the WNG remains strictly within specified bounds. However, accurately determining the smallest necessary loading level requires calculating the extreme eigenvalues of the spatial correlation matrix, a computationally expensive $\mathcal{O}(M^3)$ operation for large arrays. In this paper, we introduce a highly efficient $\mathcal{O}(kM^2)$ estimation technique using Lanczos iterations to build a small Krylov subspace. By projecting the correlation matrix onto a tridiagonal matrix of dimension $k \ll M$, we extract Ritz values that rapidly converge to the exact extreme eigenvalues. Our evaluations demonstrate that this Lanczos-accelerated approach achieves performance identical to exact Eigenvalue Decomposition (EVD), ensuring optimal interference suppression and strict WNG adherence at a fraction of the computational cost.2026-05-11T22:13:05Z5 pages, 8 figuresManan MittalRyan M. CoreyJohn R. BuckAndrew C. Singerhttp://arxiv.org/abs/2507.23511v3MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks2026-05-11T14:54:52ZWhile large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat2025-07-31T12:47:43ZAccepted to ICML 2026Yadong NiuTianzi WangHeinrich DinkelXingwei SunJiahao ZhouGang LiJizhong LiuXunying LiuJunbo ZhangJian Luanhttp://arxiv.org/abs/2509.08031v3AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs2026-05-11T14:29:23ZLarge Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.2025-09-09T15:30:40ZHoang NguyenSidharth SurapaneniAkshay KalkunteJash MehtaAman TiwariOluwanifemi BamgboseKhyati MahajanJash ShahShruthan RadhakrishnaSathwik Tejaswi MadhusudhanVikas YadavSai Rajeswarhttp://arxiv.org/abs/2605.10398v1SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements2026-05-11T11:40:57ZReconstructing a 3D sound field from sparse microphone measurements is a fundamental yet ill-posed problem, which we address through Acoustic Transfer Function (ATF) magnitude estimation. ATF magnitude encapsulates key perceptual and acoustic properties of a physical space with applications in room characterization and correction. Although recent generative paradigms such as Flow Matching (FM) have achieved state-of-the-art performance in speech and music generation, their potential in spatial audio remains underexplored. We propose a novel framework for 3D ATF magnitude reconstruction as a guided generation task, with a 3D U-Net conditioned by a permutation-invariant set encoder. This architecture enables reconstruction from an arbitrary number of sparse inputs while leveraging the stable and efficient training properties of FM. Experimental results demonstrate that SF-Flow achieves accurate reconstruction up to \SI{1}{kHz}, trains substantially faster than the autoencoder baseline, and improves significantly with dataset size.2026-05-11T11:40:57ZEge ErdemShoichi KoyamaTomohiko NakamuraOrchisama DasZoran Cvetkovićhttp://arxiv.org/abs/2509.24674v2Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing2026-05-11T10:47:35ZWe propose a novel zero-shot source tracing framework inspired by speaker verification. We adapt SSL-AASIST for attack classification, enhancing embeddings with AAM loss and RegMixup, and ensure that training attacks are disjoint from those forming fingerprint-trial pairs. For backend scoring in attack verification, we explore both zero-shot approaches (cosine similarity and Siamese) and few-shot approaches (MLP and Siamese). Experiments on our recently introduced STOPA dataset with an open set setting show that few-shot learning provides advantages in the in-distribution (ID) scenario, while zero-shot approaches perform better in the out-of-distribution (OOD) scenario. In attack source verification with ID trials, few-shot Siamese and MLP achieve equal error rates (EER) of 17.72% and 13.11%, compared to 29.91% for zero-shot cosine scoring. Conversely, in OOD trials, zero-shot cosine scoring reaches 16.43%, outperforming few-shot Siamese at 23.47% and MLP at 21.57%.2025-09-29T12:14:58ZAccepted to Odyssey 2026Manasi ChhibberJagabandhu MishraTomi H. Kinnunenhttp://arxiv.org/abs/2605.10203v1Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration2026-05-11T08:49:26ZThe advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non-target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non-target integrity.2026-05-11T08:49:26ZAccepted by ICML 2026Haowen LiTianxiang LiYi YangBoyu CaoQi Liuhttp://arxiv.org/abs/2605.10199v1How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue2026-05-11T08:46:47ZFull-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.2026-05-11T08:46:47ZHui LuXueyuan ChenHuimeng WangShuhai PengShiyin KangXixin WuZhiyong Wuhttp://arxiv.org/abs/2605.10084v1PoDAR: Power-Disentangled Audio Representation for Generative Modeling2026-05-11T07:05:11ZThe performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.2026-05-11T07:05:11Z9 pages, 3 figuresAlejandro LuebsMithilesh VaidyaIshaan KumarSumukh BadamStephen W. BaileyMatthew BendelJose SoteloXingzhe Hehttp://arxiv.org/abs/2602.01861v3RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses2026-05-11T05:23:14ZRoom impulse responses (RIRs) are essential for many acoustic signal processing tasks, yet measuring them densely across space is often impractical. In this work, we propose RIR-Former, a grid-free, one-step feed-forward model for RIR reconstruction. By introducing a sinusoidal encoding module into a transformer backbone, our method effectively incorporates microphone position information, enabling interpolation at arbitrary array locations. Furthermore, a segmented multi-branch decoder is designed to separately handle early reflections and late reverberation, improving reconstruction across the entire RIR. Experiments on diverse simulated acoustic environments demonstrate that RIR-Former consistently outperforms state-of-the-art baselines in terms of normalized mean square error (NMSE) and cosine distance (CD), under varying missing rates and array configurations. These results highlight the potential of our approach for practical deployment and motivate future work on scaling from randomly spaced linear arrays to complex array geometries, dynamic acoustic scenes, and real-world environments.2026-02-02T09:33:54ZPublished in ICASSP 2026. Code: https://github.com/ShaoHenry/RIR-Former . Equal contribution: Shaoheng Xu and Chunyi SunProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 15312--15316, 2026Shaoheng XuChunyi SunJihui ZhangPrasanga N. SamarasingheThushara D. Abhayapala10.1109/ICASSP55912.2026.11462487http://arxiv.org/abs/2510.19414v2EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection2026-05-11T03:38:57ZThe growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.2025-10-22T09:34:31ZICASSP 2026Tong ZhangYihuan HuangYanzhen Ren10.1109/ICASSP55912.2026.11464778