https://arxiv.org/api/SkSd2UUvENg9SMknGYE8Sen6XUI2026-06-09T22:26:57Z209313015http://arxiv.org/abs/2606.08505v1Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines2026-06-07T08:10:14ZSpeech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.2026-06-07T08:10:14ZFumiaki Yamaguchihttp://arxiv.org/abs/2603.08977v2Universal Speech Content Factorization2026-06-07T05:11:25ZWe propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.2026-03-09T22:11:40ZAccepted to Interspeech 2026Henry Li XinyuanZexin CaiLin ZhangLeibny Paola García-PereraBerrak SismanSanjeev KhudanpurNicholas AndrewsMatthew Wiesnerhttp://arxiv.org/abs/2606.08425v1TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints2026-06-07T02:50:24ZCurrent advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.2026-06-07T02:50:24ZAccepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.appVinh-Thuan Lyhttp://arxiv.org/abs/2606.08385v1A Switching Beamformer for Highly Non-Stationary Environments2026-06-07T00:44:39ZAdaptive beamforming is a cornerstone of array signal processing, yet its performance often collapses in the face of complex, rapidly changing interference. When interferers appear or move unpredictably, conventional estimators encounter a fundamental memory trade-off: short windows enable rapid tracking but suffer from high estimation variance, while long windows provide stable rejection but fail to adapt to shifts. This challenge is resolved by introducing the Universal Switching Beamformer (USB), which integrates competitive sequential prediction into the beamforming architecture. By employing a linear transition diagram, the USB implicitly maintains an exponentially large family of candidate covariance histories and dynamically re-weights them based on their cumulative output power. This mechanism allows the beamformer to automatically vary its effective memory length without explicit change detection or heuristic parameter tuning. A theoretical upper bound is proven on the regret relative to an omniscient oracle that selects the best piecewise-stationary covariance model in hindsight. Extensive simulations and experiments on the SwellEx-96 dataset demonstrate that the USB achieves the agility of short-window estimators and the precision of long-term integration, providing a principled solution for tracking highly non-stationary scenes.2026-06-07T00:44:39Z11 pages, 19 figures, under reviewManan MittalRyan M. CoreyJohn R. BuckAndrew C. Singerhttp://arxiv.org/abs/2606.08286v1FXplorer: A Map-Based Interface for Exploratory Audio Effect Design2026-06-06T18:14:41ZAudio effects (FX) shape sound in contemporary music practice. However, most interfaces present them as discrete modules and parameters that favor targeted adjustment over exploratory listening. This separation can make it difficult to build intuition about the broader space of possible transformations or to move fluidly between searching and refinement. We present FXplorer, an interface that organizes audio effects within a perceptually informed 2D space, allowing sound transformations to be browsed as a continuous landscape rather than as isolated presets. By combining established spatial interaction approaches and interpretable DAW-style controls with recent embedding-based machine learning methods for similarity and semantic search, the system brings exploration and parameter refinement into a single workspace. FXplorer supports composition, production, or performance by allowing users to edit and interpolate between effect presets interactively.2026-06-06T18:14:41ZAccepted to NIME 2026. Project page: https://anniejchu.github.io/fxplorer/Annie ChuJason Brent SmithBryan Pardohttp://arxiv.org/abs/2606.08210v1Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion2026-06-06T14:54:44ZAutomated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.2026-06-06T14:54:44ZAccepted at INTERSPEECH 2026 (Main)Rashini LiyanarachchiRachael MackayAlison ShortAditya JoshiErik Meijeringhttp://arxiv.org/abs/2511.18421v2DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation2026-06-06T13:46:03ZExisting Test-time Adaptation (TTA) studies rely heavily on static and homogeneous corruption protocols, such as ImageNet-C and CIFAR-10-C/100-C, leading to inconsistent evaluation settings and potentially inflated robustness estimates that are compared with real-world situations. TTA lacks a standardized evaluation infrastructure capable of modeling realistic heterogeneous acoustic degradation. We introduce DHAuDS, a standardized benchmark suite for evaluating audio classification TTA robustness under dynamic corruption severity and heterogeneous noise mixtures. Rather than proposing a new TTA algorithm, DHAuDS focuses on exposing robustness limitations that remain hidden under conventional fixed-noise evaluation protocols.2025-11-23T12:19:23ZWeichuang ShaoIman Yi LiaoTomas Henrique Bode MaulTissa Chandesahttp://arxiv.org/abs/2509.02167v2AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition2026-06-06T12:22:03ZRecently, Transformers (e.g., Audio Spectrogram Transformers, AST) and state-space models (e.g., Audio Mamba, AuM) have achieved remarkable progress in audio modeling. However, the O(L^2) computational complexity of the Transformer architecture hinders efficient long-sequence processing, while the Mamba architecture tends to become unstable when scaling parameters and data. To address these challenges, this paper proposes AudioRWKV (A-RWKV), a highly efficient and stable architecture for audio modeling. Specifically, we inherit the stable and efficient recurrent formulation of RWKV7 and replace its 1D token-shift operation with a 2D depthwise separable convolution to better capture local spectro-temporal patterns. Furthermore, we adapt the original causal WKV kernel into a bidirectional WKV kernel (Bi-WKV), enabling global context modeling over the entire audio sequence while maintaining linear computational complexity. Benefiting from the inherent stability of the RWKV7 foundation, A-RWKV scales seamlessly to larger model sizes. Experimental results demonstrate that, under the same linear-model regime, A-RWKV-S (22M) achieves performance parity with AuM-B (92M) while exhibiting more stable throughput than AST; for long-form audio (~5 minutes 28 seconds), WKV7 achieves up to a 13.3X speedup in processing.2025-09-02T10:20:31Z6 pages, 3 figuresJing WangMaoxiang WuJiayu XiongJianlong KwanJun Xuehttp://arxiv.org/abs/2601.04178v2Sound Event Detection with Boundary-Aware Optimization and Inference2026-06-06T11:02:56ZTemporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.2026-01-07T18:45:29ZAccepted for publication in IEEE Signal Processing Letters, 2026Florian SchmidChi Ian TangSanjeel ParekhVamsi Krishna IthapuJuan Azcarreta OrtizGiacomo FerroniYijun QianArnoldas JasonasCosmin FrateanuCamilla ClarkGerhard WidmerÇağdaş Bilenhttp://arxiv.org/abs/2312.15946v3EnchantDance: Unveiling the Potential of Music-Driven Dance Movement2026-06-06T10:48:53ZThe task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.2023-12-26T08:19:10ZProject Page: https://fluide1022.github.io/EnchantDance/Bo HanTeng ZhangZeyu LingFeilin Hanhttp://arxiv.org/abs/2602.20967v2Training-Free Intelligibility-Guided Observation Addition for Noisy ASR2026-06-06T10:41:57ZAutomatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.2026-02-24T14:46:54ZAccepted to Interspeech2026Haoyang LiChangsong LiuWei RaoHao ShiSakriani SaktiEng Siong Chnghttp://arxiv.org/abs/2606.08087v1Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference2026-06-06T10:23:18ZDeep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.2026-06-06T10:23:18ZAccepted to Speaker Odyssey 2026 LisbonHugo LeguillierDriss MatroufGuillaume LechienMickael Rouvierhttp://arxiv.org/abs/2606.08078v1On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation2026-06-06T09:55:37ZAlthough low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.2026-06-06T09:55:37ZAccepted at Speaker Odyssey 2026 LisbonHugo LeguillierDriss MatroufGuillaume LechienMickael Rouvierhttp://arxiv.org/abs/2602.23958v2An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance2026-06-06T08:55:46ZFréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.2026-02-27T12:01:16ZAccepted to Interspeech 2026. Source code and evaluation pipeline are available at: https://github.com/wonwoo-jeong/fad-encoder-biasWonwoo Jeonghttp://arxiv.org/abs/2606.08038v1Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis2026-06-06T07:58:02ZThe scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale-first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.2026-06-06T07:58:02ZAccepted by Interspeech 2026Zhuolin YiJun XueYanzhen RenYihuan HuangYi ChaiDaixian LiGuanxiang FengJiajun Liu