https://arxiv.org/api/LNuffBpu77ESdX99Z93kGHpIAMU2026-06-18T08:38:40Z2175519515http://arxiv.org/abs/2606.08505v1Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines2026-06-07T08:10:14ZSpeech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.2026-06-07T08:10:14ZFumiaki Yamaguchihttp://arxiv.org/abs/2602.07977v2Detect, Attend and Extract: Keyword Guided Target Speaker Extraction2026-06-07T08:01:33ZTarget speaker extraction (TSE) aims to extract the speech of a target speaker from mixtures containing multiple competing speakers. Conventional TSE systems predominantly rely on speaker cues, such as pre-enrolled speech, to identify and isolate the target speaker. However, in many practical scenarios, clean enrollment utterances are unavailable, limiting the applicability of existing approaches. In this work, we propose DAE-TSE, a keyword-guided TSE framework that specifies the target speaker through distinct keywords they utter. By leveraging keywords (i.e., partial transcriptions) as cues, our approach provides a flexible and practical alternative to enrollment-based TSE. DAE-TSE follows the Detect-Attend-Extract (DAE) paradigm: it first detects the presence of the given keywords, then attends to the corresponding speaker based on the keyword content, and finally extracts the target speech. Experimental results demonstrate that DAE-TSE outperforms standard TSE systems that rely on clean enrollment speech. To the best of our knowledge, this is the first study to utilize partial transcription as a cue for specifying the target speaker in TSE, offering a flexible and practical solution for real-world scenarios. Our code and demo page are now publicly available.2026-02-08T14:06:11Z4 figures, 4 tables. Accepted by IJCAI-ECAI 2026Haoyu LiYu XiYidi JiangShuai WangKate KnillMark GalesHaizhou LiKai Yuhttp://arxiv.org/abs/2603.08977v2Universal Speech Content Factorization2026-06-07T05:11:25ZWe propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.2026-03-09T22:11:40ZAccepted to Interspeech 2026Henry Li XinyuanZexin CaiLin ZhangLeibny Paola García-PereraBerrak SismanSanjeev KhudanpurNicholas AndrewsMatthew Wiesnerhttp://arxiv.org/abs/2606.08435v1Sound Field Interpolation Using Physics-Informed Extreme Learning Machine with Pre-Training2026-06-07T03:21:17ZNumerous machine learning-based sound field interpolation methods have been proposed. In particular, physics-informed neural networks (PINNs) can accurately interpolate sound fields from a small number of microphones. However, their high computational cost and long training time pose practical challenges for applications requiring real-time processing or online learning. To address this, we propose a hybrid framework that combines PINN-based pre-training with a physics-informed extreme learning machine (PIELM) tailored for acoustic fields. By replacing iterative PINN fine-tuning for each target sound field with closed-form output-layer adaptation using hidden-layer weights pre-trained by PINN, the proposed method efficiently interpolates unknown sound fields from limited observations. Simulation results under simplified one-dimensional free-field conditions demonstrate that, given a pre-trained model, the proposed method achieves interpolation accuracy comparable to that of PINN-based fine-tuning while reducing the adaptation time by more than three orders of magnitude.2026-06-07T03:21:17ZThis work has been submitted to the IEEE for possible publicationHayato KomabaGen SatoKen KurataYusuke Ikedahttp://arxiv.org/abs/2606.08425v1TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints2026-06-07T02:50:24ZCurrent advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.2026-06-07T02:50:24ZAccepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.appVinh-Thuan Lyhttp://arxiv.org/abs/2606.08393v1SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation2026-06-07T01:10:11ZVideo-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.2026-06-07T01:10:11Z6 pages, 4 figuresHaoyu ZhangYuta OshimaXingjian DuChunfeng WangIrene LiYusuke IwasawaYutaka Matsuohttp://arxiv.org/abs/2606.08247v1AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals2026-06-06T16:11:45ZAcute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.2026-06-06T16:11:45Z10 pages, 8 figures, 5 tables, 14 equationsAueaphum Aueawatthanaphisuthttp://arxiv.org/abs/2606.08210v1Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion2026-06-06T14:54:44ZAutomated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.2026-06-06T14:54:44ZAccepted at INTERSPEECH 2026 (Main)Rashini LiyanarachchiRachael MackayAlison ShortAditya JoshiErik Meijeringhttp://arxiv.org/abs/2606.08171v1Predictive Fixed-Filter Active Noise Control (PFANC) Using Convolutional Recurrent Neural Networks for Dynamic Noises2026-06-06T13:35:34ZThe existing Generative Fixed-Filter Active Noise Control (GFANC) method generates a suitable control filter based on the current noise frame. This reactive design aims to estimate a control filter that is optimal for the present frame rather than the upcoming one. Consequently, it suffers from an inherent tracking lag and lacks the predictive capability to handle rapidly varying noises. To address this limitation, we propose the Predictive Fixed-Filter Active Noise Control (PFANC) method with a proactive control paradigm in this paper. In the PFANC method, multiple consecutive noise frames are processed by a Convolutional Recurrent Neural Network (CRNN) to predict the next-frame control filter. By utilizing temporal correlations across noise frames to anticipate the control filter in advance, the PFANC method can effectively track dynamic noise changes. Furthermore, the theoretical analysis based on a high-order Markov chain shows that incorporating multiple noise frames enhances the prediction of the control filter. Numerical simulations with linear and logarithmic chirp signals, as well as real-world dynamic noises, validate the effectiveness of the PFANC method and its superiority over GFANC and its variations. The PFANC method also exhibits good transferability across different acoustic paths.2026-06-06T13:35:34ZZhengding LuoHaowen LiHaozhe MaDongyuan ShiWen ZhangWoon-Seng Ganhttp://arxiv.org/abs/2512.20978v2GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model2026-06-06T11:42:51ZLanguage Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.2025-12-24T06:13:02ZAccepted to Interspeech2026Haoyang LiXuyi ZhuangAzmat AdnanYe NiWei RaoShreyas GopalEng Siong ChngBoon Siew HanYuanjin Zhenghttp://arxiv.org/abs/2601.04178v2Sound Event Detection with Boundary-Aware Optimization and Inference2026-06-06T11:02:56ZTemporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.2026-01-07T18:45:29ZAccepted for publication in IEEE Signal Processing Letters, 2026Florian SchmidChi Ian TangSanjeel ParekhVamsi Krishna IthapuJuan Azcarreta OrtizGiacomo FerroniYijun QianArnoldas JasonasCosmin FrateanuCamilla ClarkGerhard WidmerÇağdaş Bilenhttp://arxiv.org/abs/2312.15946v3EnchantDance: Unveiling the Potential of Music-Driven Dance Movement2026-06-06T10:48:53ZThe task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.2023-12-26T08:19:10ZProject Page: https://fluide1022.github.io/EnchantDance/Bo HanTeng ZhangZeyu LingFeilin Hanhttp://arxiv.org/abs/2602.20967v2Training-Free Intelligibility-Guided Observation Addition for Noisy ASR2026-06-06T10:41:57ZAutomatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.2026-02-24T14:46:54ZAccepted to Interspeech2026Haoyang LiChangsong LiuWei RaoHao ShiSakriani SaktiEng Siong Chnghttp://arxiv.org/abs/2602.23958v2An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance2026-06-06T08:55:46ZFréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.2026-02-27T12:01:16ZAccepted to Interspeech 2026. Source code and evaluation pipeline are available at: https://github.com/wonwoo-jeong/fad-encoder-biasWonwoo Jeonghttp://arxiv.org/abs/2604.24199v4Speech Enhancement Based on Drifting Models2026-06-06T06:47:13ZWe propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.2026-04-27T09:00:51Z6 pages, 2 figuresLiang XuDiego Caviedes-NozalW. Bastiaan KleijnLongfei Felix YanRasmus Kongsgaard Olsson