https://arxiv.org/api/AGl3ySkWtN2FGq38WKNgiX6RxAk 2026-06-22T17:20:28Z 21774 450 15 http://arxiv.org/abs/2605.20403v1 Causal Spatio-Temporal Sound Field Reconstruction 2026-05-19T19:00:36Z In sound field control applications, it is commonly assumed that one has access to an accurate representation of the sound field in the region of interest. This is a problematic assumption since the reconstruction of a sound field from available microphone measurements is especially challenging in real-time applications where only causal measurements are available. Notably, causal time-windowed observations introduce correlation between frequency components, making sound field reconstruction methods that process each frequency band independently sub-optimal. In this work, we formulate a causal finite-window spatio-temporal linear minimum mean-square error estimator for sound field reconstruction. The sound field is modeled as the solution to the wave equation driven by a stationary stochastic spatio-temporal source distribution, which induces a physically interpretable covariance function. It is shown that this covariance function is closely related to the classical diffuse-field coherence model. Since the computational complexity grows rapidly with the number of spatio-temporal observations, we formulate a budget-constrained spatio-temporal sample selection approach to minimize the posterior reconstruction variance. The proposed estimator and sampling strategy are evaluated using both simulated and measured sound fields, demonstrating improved short-window reconstruction compared to frequency domain finite-window baselines. 2026-05-19T19:00:36Z David Sundström Filip Tronarp Johan Lindström Andreas Jakobsson http://arxiv.org/abs/2605.18222v2 Contextual Biasing for Streaming ASR via CTC-based Word Spotting 2026-05-19T17:09:34Z Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the challenges of streaming ASR. For example, CTC-based word spotting (CTC-WS) have demonstrated strong performance by directly detecting keywords from CTC log-probabilities, but they are limited to offline processing and require access to the full utterance. In This work, we present a streaming extension of CTC-WS for real-time contextual biasing. Our method maintains active keyword paths across audio chunks using a stateful token passing algorithm, enabling the detection of keywords that span multiple chunks. To ensure low latency and stable output, we introduce an incremental commitment mechanism that only emits segments guaranteed not to be affected by future audio, while deferring uncertain regions. This method naturally integrates with streaming ASR pipelines and does not require modifications to the underlying acoustic model or additional training, making it practical for real-world deployment. Experimental results show that our method reduces overall WER and effectively improves keyword F-score, demonstrating its effectiveness for real-time ASR applications. 2026-05-18T11:06:44Z Kai-Chen Tsai Tien-Hong Lo Yun-Ting Sun Berlin Chen http://arxiv.org/abs/2508.07285v3 Non-Intrusive Automatic Speech Recognition Refinement: A Survey 2026-05-19T14:44:40Z Automatic Speech Recognition (ASR) is an integral component of modern technology, powering applications such as voice-activated assistants, transcription services, and accessibility tools. Yet ASR systems continue to struggle with the inherent variability of human speech, such as accents, dialects, and speaking styles, as well as environmental interference, including background noise. Moreover, domain-specific conversations often employ specialized terminology, which can exacerbate transcription errors. These shortcomings not only degrade raw ASR accuracy but also propagate mistakes through subsequent natural language processing pipelines. Because redesigning an ASR model is costly and time-consuming, non-intrusive refinement techniques that leave the model's architecture intact have become increasingly popular. In this survey, we review current non-intrusive refinement approaches and group them into five classes: fusion, re-scoring, correction, distillation, and training adjustment. For each class, we outline the main methods, advantages, drawbacks, and ideal application scenarios. Beyond method classification, this work surveys adaptation techniques aimed at refining ASR in domain-specific contexts, reviews commonly used evaluation datasets along with their construction processes, and proposes a standardized set of metrics to facilitate fair comparisons. Finally, we identify open research gaps and suggest promising directions for future work. By providing this structured overview, we aim to equip researchers and practitioners with a clear foundation for developing more robust, accurate ASR refinement pipelines. 2025-08-10T10:46:14Z Mohammad Reza Peyghan Saman Soleimani Roudi Saeedreza Zouashkiani Sajjad Amini Fatemeh Rajabi Shahrokh Ghaemmaghami http://arxiv.org/abs/2605.19833v1 Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation 2026-05-19T13:26:51Z Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild. 2026-05-19T13:26:51Z Project page: https://xzf-thu.github.io/Mega-ASR/. Code, models, and dataset will be released. A robust ASR framework targeting in-the-wild and compositional acoustic scenarios where conventional ASR systems fail Zhifei Xie Kaiyu Pang Haobin Zhang Deheng Ye Xiaobin Hu Shuicheng Yan Chunyan Miao http://arxiv.org/abs/2505.08752v7 Configurations, Tessellations and Tone Networks 2026-05-19T12:00:21Z The tonnetz, which is commonly represented as a tessellation of the plane by a triangular network of tones, can also be represented as a bipartite graph of degree three with twelve vertices denoting major triads and twelve vertices denoting minor triads. We show that this Levi graph can be realized geometrically as a system of twelve points and twelve lines in $\mathbb R^2$ with the property that three points lie on each line and three lines pass through each point, in a configuration $\{12_3\}$ of Daublebsky von Sterneck type D222. This tonnetz configuration, alongside various generalizations thereof, can be used as a new basis for the composition and analysis of music. 2025-05-13T17:13:14Z 37 pages, 19 figures, 3 tables. To be published in Journal of Mathematics and Music Jeffrey R. Boland Lane P. Hughston http://arxiv.org/abs/2605.19695v1 Cross-Talk Speech Reduction, by Separation, for Separation 2026-05-19T11:29:35Z In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that both CTRnet and PuLSS can be trained on real-recorded data from the target domain, addressing the generalization gap commonly observed when models are trained exclusively on simulated data. On the CHiME-6 dataset, our framework achieves state-of-the-art ASR performance under both oracle and estimated speaker diarization, surpassing all CHiME-{7,8} challenge submissions. To our knowledge, it is the first neural speech separation method that substantially outperforms guided source separation on real conversational "speech-in-the-wild" data. 2026-05-19T11:29:35Z in submission Zhong-Qiu Wang Samuele Cornell http://arxiv.org/abs/2105.00933v3 Deep Neural Network for Musical Instrument Recognition using MFCCs 2026-05-19T09:11:32Z The task of efficient automatic music classification is of vital importance and forms the basis for various advanced applications of AI in the musical domain. Musical instrument recognition is the task of instrument identification by virtue of its audio. This audio, also termed as the sound vibrations are leveraged by the model to match with the instrument classes. In this paper, we use an artificial neural network (ANN) model that was trained to perform classification on twenty different classes of musical instruments. Here we use use only the mel-frequency cepstral coefficients (MFCCs) of the audio data. Our proposed model trains on the full London philharmonic orchestra dataset which contains twenty classes of instruments belonging to the four families viz. woodwinds, brass, percussion, and strings. Based on experimental results our model achieves state-of-the-art accuracy on the same. 2021-05-03T15:10:34Z Computacion y Sistemas, Vol 25, No 2 (2021): 25(2) 2021 Saranga Kingkor Mahanta Abdullah Faiz Ur Rahman Khilji Partha Pakray http://arxiv.org/abs/2604.05201v2 Exploring Speech Foundation Models for Speaker Diarization Across Lifespan 2026-05-19T05:55:19Z Speech foundation models have shown strong transferability across a wide range of speech applications. However, their robustness to age-related domain shift in speaker diarization remains underexplored. In this work, we present a cross-lifespan evaluation within a unified end-to-end neural diarization framework (EEND-VC), covering speech samples from conversations involving children, adults, and older adults. We compare models under zero-shot cross-age inference, joint multi-age training, and domain-specific adaptation. Results show substantial performance degradation when models trained on adult-specific speech are applied to child and older-adult conversational data. Moreover, joint multi-age training across different age groups improves robustness without reducing diarization performance in canonical adult conversations, while targeted age group adaptation yields further gains in diarization performance, particularly when using the Whisper encoder. 2026-04-06T21:57:21Z Under review Anfeng Xu Tiantian Feng Shrikanth Narayanan http://arxiv.org/abs/2605.19388v1 Fast Multichannel NMF with Block-Diagonal Spatial Covariance Matrices for Efficient Blind Source Separation Using Distributed Microphone Arrays 2026-05-19T05:34:40Z Distributed microphone arrays composed of multiple subarrays enable blind source separation over a wide spatial area. Directly applying fast multichannel nonnegative matrix factorization (FastMNMF) to all subarrays can exploit observations from all subarrays, but it requires repeated inversions of large matrices spanning all microphones, causing the computational cost to increase rapidly as the number of microphones grows. In contrast, applying FastMNMF to one subarray reduces the matrix size but cannot exploit observations from other subarrays. We propose distributed FastMNMF, which imposes a block-diagonal structure on the source spatial covariance matrices, so that matrix inversions are performed within subarrays. The NMF-based source spectrogram model is shared across subarrays, allowing the method to aggregate source activity information while discarding inter-subarray covariance. In synchronized, noiseless simulations with fixed room and array/source geometry, the method required less computation time than conventional FastMNMF using all subarrays, achieved a higher average source-to-distortion ratio than conventional FastMNMF using one subarray, and was applicable in the tested five-source condition, where each four-microphone subarray was locally underdetermined. 2026-05-19T05:34:40Z Hirotaka Nishikori Nobutaka Ito Kouei Yamaoka Norihiro Takamune Hiroshi Saruwatari http://arxiv.org/abs/2506.14148v2 Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment 2026-05-19T04:05:49Z This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries. 2025-06-17T03:25:38Z This paper has been retracted by the authors. Due to miscommunication, the authorship is incomplete and missing early contributions Long-Vu Hoang Tuan Nguyen Tran Huy Dat http://arxiv.org/abs/2601.09413v2 Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception 2026-05-18T16:18:39Z We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence. 2026-01-14T12:06:50Z Accepted to ACL 2026. Oral Presentation. Code: https://github.com/YukinoWan/Speech-Hands OpenClaw Branch: https://github.com/openclaw/openclaw/pull/69073 Zhen Wan Chao-Han Huck Yang Jinchuan Tian Hanrong Ye Ankita Pasad Szu-wei Fu Arushi Goel Ryo Hachiuma Shizhe Diao Kunal Dhawan Sreyan Ghosh Yusuke Hirota Zhehuai Chen Rafael Valle Chenhui Chu Shinji Watanabe Yu-Chiang Frank Wang Boris Ginsburg http://arxiv.org/abs/2605.18442v1 Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters 2026-05-18T14:11:37Z Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades significantly when evaluated on mismatched array geometries. In this paper, we propose a geometry-conditioned SSF (GC-SSF), which incorporates a geometry-conditioning branch based on FiLM layers. Furthermore, we propose a feature that jointly encodes the DOA and the microphone positions (DOA-MPE). The conditioning branch modulates the intermediate feature maps of the SSF using the DOA-MPE feature to capture the spatial relationship between the microphone positions and the target speaker. Experimental results across circular, uniform linear, and random microphone arrays show that the proposed GC-SSF generalizes better to mismatched geometries while maintaining high spatial selectivity, demonstrating its ability to effectively adapt the filtering process to different array geometries 2026-05-18T14:11:37Z Submitted to IWAENC2026 Jiatong Li Wiebke Middelberg Simon Doclo http://arxiv.org/abs/2401.09512v10 MLAAD: The Multi-Language Audio Anti-Spoofing Dataset 2026-05-18T13:28:38Z This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes. 2024-01-17T15:09:02Z IJCNN 2024 Nicolas M. Müller Piotr Kawa Wei Herng Choong Edresson Casanova Eren Gölge Thorsten Müller Piotr Syga Philip Sperl Konstantin Böttinger http://arxiv.org/abs/2605.17964v1 Fractional-Order Subband p-Norm Adaptive Filter via Transformation Nearest Kronecker Product Decomposition for Active Noise Control 2026-05-18T07:18:52Z The conventional normalized subband p-norm (NSPN) algorithm achieves robustness in $α$-stable noise ($1<α\leq 2$) by utilizing low-order error moments. However, its performance degrades significantly under three scenarios: (1) non-Gaussian inputs, (2) $α$-stable noise with $0<α\leq 1$, and (3) sparse system identification. To address these limitations, this paper proposes a fractional-order NSPN algorithm based on the nearest Kronecker product (NKP) decomposition and fractional-order stochastic gradient descent, termed NKP-FoNSPN. Theoretical bounds for the fractional-order parameter $β$ are also derived. Notably, when $β=1$, the NKP-FoNSPN reduces to a new NKP-NSPN algorithm, while its non-NKP decomposition variant becomes the fractional-order NSPN (FoNSPN) algorithm. Furthermore, a novel transformation-based NKP (TNKP) decomposition technique is designed, which exhibits lower computational complexity than conventional NKP for specific filter structures. The resulting TNKP-based FoNSPN (TNKP-FoNSPN) achieves lower steady-state misadjustment and multiplication cost compared with the NKP-FoNSPN algorithm. Additionally, complete computational complexity analyses are provided. For active noise control (ANC) scenarios, we develop filtered-x variants: NKP-FxFoNSPN and TNKP-FxFoNSPN. From the former, two additional variants are derived: NKP-FxNSPN and FxFoNSPN. Simulations using diverse noise sources (pink, helicopter, gunshot, pile driver, and traction substation noise) demonstrate the superiority of the proposed algorithms. Finally, we validate their noise reduction performance in a real constructed single-channel duct ANC and a simulated multi-channel ANC systems. 2026-05-18T07:18:52Z Jianhong Ye Haiquan Zhao Shaohui Lv Yang Zhou 10.1016/j.ymssp.2026.114073 http://arxiv.org/abs/2605.17846v1 UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations 2026-05-18T04:40:31Z Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-to-Left script constraints and frequent code-switching, we developed UrduSpeech, a LLM-driven pipeline to curate data across 12 diverse categories, including news, drama, and rare literary forms like Bait-Bazi. We also release a 9-hour US-Benchmark set, manually corrected by native annotators to serve as a standard. Human quality assessment of the primary 156-hour corpus yielded a Mean Opinion Score (MOS) of 4.6 (std = 0.7) with inter-rater reliability confirmed by a 0.68 Cohen's Kappa, validating our curation pipeline's 97.6% confidence score. The corpus maintains a 60-40 gender balance across 71,792 utterances. Our work represents a significant leap toward linguistic inclusivity in global AI. The corpus and code are open-sourced, and a demo page is available. 2026-05-18T04:40:31Z 6 pages, 3 figures, 4 tables Attia Nafees ul Haq Zeyu Zhu Jingbin Hu ChunJiang He Lei Xie