https://arxiv.org/api/LNuffBpu77ESdX99Z93kGHpIAMU 2026-06-18T08:38:40Z 21755 195 15 http://arxiv.org/abs/2606.08505v1 Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines 2026-06-07T08:10:14Z

Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.

2026-06-07T08:10:14Z Fumiaki Yamaguchi http://arxiv.org/abs/2602.07977v2 Detect, Attend and Extract: Keyword Guided Target Speaker Extraction 2026-06-07T08:01:33Z

Target speaker extraction (TSE) aims to extract the speech of a target speaker from mixtures containing multiple competing speakers. Conventional TSE systems predominantly rely on speaker cues, such as pre-enrolled speech, to identify and isolate the target speaker. However, in many practical scenarios, clean enrollment utterances are unavailable, limiting the applicability of existing approaches. In this work, we propose DAE-TSE, a keyword-guided TSE framework that specifies the target speaker through distinct keywords they utter. By leveraging keywords (i.e., partial transcriptions) as cues, our approach provides a flexible and practical alternative to enrollment-based TSE. DAE-TSE follows the Detect-Attend-Extract (DAE) paradigm: it first detects the presence of the given keywords, then attends to the corresponding speaker based on the keyword content, and finally extracts the target speech. Experimental results demonstrate that DAE-TSE outperforms standard TSE systems that rely on clean enrollment speech. To the best of our knowledge, this is the first study to utilize partial transcription as a cue for specifying the target speaker in TSE, offering a flexible and practical solution for real-world scenarios. Our code and demo page are now publicly available.

2026-02-08T14:06:11Z 4 figures, 4 tables. Accepted by IJCAI-ECAI 2026 Haoyu Li Yu Xi Yidi Jiang Shuai Wang Kate Knill Mark Gales Haizhou Li Kai Yu http://arxiv.org/abs/2603.08977v2 Universal Speech Content Factorization 2026-06-07T05:11:25Z

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

2026-03-09T22:11:40Z Accepted to Interspeech 2026 Henry Li Xinyuan Zexin Cai Lin Zhang Leibny Paola García-Perera Berrak Sisman Sanjeev Khudanpur Nicholas Andrews Matthew Wiesner http://arxiv.org/abs/2606.08435v1 Sound Field Interpolation Using Physics-Informed Extreme Learning Machine with Pre-Training 2026-06-07T03:21:17Z

Numerous machine learning-based sound field interpolation methods have been proposed. In particular, physics-informed neural networks (PINNs) can accurately interpolate sound fields from a small number of microphones. However, their high computational cost and long training time pose practical challenges for applications requiring real-time processing or online learning. To address this, we propose a hybrid framework that combines PINN-based pre-training with a physics-informed extreme learning machine (PIELM) tailored for acoustic fields. By replacing iterative PINN fine-tuning for each target sound field with closed-form output-layer adaptation using hidden-layer weights pre-trained by PINN, the proposed method efficiently interpolates unknown sound fields from limited observations. Simulation results under simplified one-dimensional free-field conditions demonstrate that, given a pre-trained model, the proposed method achieves interpolation accuracy comparable to that of PINN-based fine-tuning while reducing the adaptation time by more than three orders of magnitude.

2026-06-07T03:21:17Z This work has been submitted to the IEEE for possible publication Hayato Komaba Gen Sato Ken Kurata Yusuke Ikeda http://arxiv.org/abs/2606.08425v1 TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints 2026-06-07T02:50:24Z

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.

2026-06-07T02:50:24Z Accepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.app Vinh-Thuan Ly http://arxiv.org/abs/2606.08393v1 SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation 2026-06-07T01:10:11Z

Video-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.

2026-06-07T01:10:11Z 6 pages, 4 figures Haoyu Zhang Yuta Oshima Xingjian Du Chunfeng Wang Irene Li Yusuke Iwasawa Yutaka Matsuo http://arxiv.org/abs/2606.08247v1 AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals 2026-06-06T16:11:45Z

Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.

2026-06-06T16:11:45Z 10 pages, 8 figures, 5 tables, 14 equations Aueaphum Aueawatthanaphisut http://arxiv.org/abs/2606.08210v1 Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion 2026-06-06T14:54:44Z

Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.

2026-06-06T14:54:44Z Accepted at INTERSPEECH 2026 (Main) Rashini Liyanarachchi Rachael Mackay Alison Short Aditya Joshi Erik Meijering http://arxiv.org/abs/2606.08171v1 Predictive Fixed-Filter Active Noise Control (PFANC) Using Convolutional Recurrent Neural Networks for Dynamic Noises 2026-06-06T13:35:34Z

The existing Generative Fixed-Filter Active Noise Control (GFANC) method generates a suitable control filter based on the current noise frame. This reactive design aims to estimate a control filter that is optimal for the present frame rather than the upcoming one. Consequently, it suffers from an inherent tracking lag and lacks the predictive capability to handle rapidly varying noises. To address this limitation, we propose the Predictive Fixed-Filter Active Noise Control (PFANC) method with a proactive control paradigm in this paper. In the PFANC method, multiple consecutive noise frames are processed by a Convolutional Recurrent Neural Network (CRNN) to predict the next-frame control filter. By utilizing temporal correlations across noise frames to anticipate the control filter in advance, the PFANC method can effectively track dynamic noise changes. Furthermore, the theoretical analysis based on a high-order Markov chain shows that incorporating multiple noise frames enhances the prediction of the control filter. Numerical simulations with linear and logarithmic chirp signals, as well as real-world dynamic noises, validate the effectiveness of the PFANC method and its superiority over GFANC and its variations. The PFANC method also exhibits good transferability across different acoustic paths.

2026-06-06T13:35:34Z Zhengding Luo Haowen Li Haozhe Ma Dongyuan Shi Wen Zhang Woon-Seng Gan http://arxiv.org/abs/2512.20978v2 GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model 2026-06-06T11:42:51Z

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

2025-12-24T06:13:02Z Accepted to Interspeech2026 Haoyang Li Xuyi Zhuang Azmat Adnan Ye Ni Wei Rao Shreyas Gopal Eng Siong Chng Boon Siew Han Yuanjin Zheng http://arxiv.org/abs/2601.04178v2 Sound Event Detection with Boundary-Aware Optimization and Inference 2026-06-06T11:02:56Z

Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.

2026-01-07T18:45:29Z Accepted for publication in IEEE Signal Processing Letters, 2026 Florian Schmid Chi Ian Tang Sanjeel Parekh Vamsi Krishna Ithapu Juan Azcarreta Ortiz Giacomo Ferroni Yijun Qian Arnoldas Jasonas Cosmin Frateanu Camilla Clark Gerhard Widmer Çağdaş Bilen http://arxiv.org/abs/2312.15946v3 EnchantDance: Unveiling the Potential of Music-Driven Dance Movement 2026-06-06T10:48:53Z

The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.

2023-12-26T08:19:10Z Project Page: https://fluide1022.github.io/EnchantDance/ Bo Han Teng Zhang Zeyu Ling Feilin Han http://arxiv.org/abs/2602.20967v2 Training-Free Intelligibility-Guided Observation Addition for Noisy ASR 2026-06-06T10:41:57Z

Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

2026-02-24T14:46:54Z Accepted to Interspeech2026 Haoyang Li Changsong Liu Wei Rao Hao Shi Sakriani Sakti Eng Siong Chng http://arxiv.org/abs/2602.23958v2 An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance 2026-06-06T08:55:46Z

Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

2026-02-27T12:01:16Z Accepted to Interspeech 2026. Source code and evaluation pipeline are available at: https://github.com/wonwoo-jeong/fad-encoder-bias Wonwoo Jeong http://arxiv.org/abs/2604.24199v4 Speech Enhancement Based on Drifting Models 2026-06-06T06:47:13Z

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

2026-04-27T09:00:51Z 6 pages, 2 figures Liang Xu Diego Caviedes-Nozal W. Bastiaan Kleijn Longfei Felix Yan Rasmus Kongsgaard Olsson