https://arxiv.org/api/E4bGOdX5U6c8CfRvPmq4ez+ohVI 2026-06-18T10:54:00Z 21755 225 15 http://arxiv.org/abs/2606.06907v1 SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models 2026-06-05T04:50:34Z

Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.

2026-06-05T04:50:34Z 5 pages, 5 figures Seonuk Kim Yonghyeon Jun Ju Yeon Kang Jimin Hong Yoonhyeong Lee Nam Soo Kim http://arxiv.org/abs/2606.06837v1 SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails 2026-06-05T02:24:19Z

Scripted vs spontaneous speech detection is appealing for interview guardrails, but benchmark performance can be inflated by shortcuts tied to corpus identity, channel conditions, and recording artifacts rather than speaking style itself. We present SEAM, a shortcut-aware framework for real-time scriptedness detection that combines uniform preprocessing, seam-aware sampling, non-speech augmentation, and a compact DistilHuBERT backbone. With 8s windows, the model achieves 0.971 +- 0.004 ROC-AUC on an external interview-domain evaluation set. Removing the shortcut-prevention components improves internal held-out metrics but sharply reduces external performance, indicating shortcut learning. Post-training quantization reduces the model footprint to 41.8MB with little loss in external performance. The results demonstrate that robust real-time scriptedness detection depends not only on the backbone, but on shortcut-aware data design and evaluation. We release code and model checkpoints.

2026-06-05T02:24:19Z Accepted to Interspeech 2026 Vsevolod V. Kovalev Pranay Manocha http://arxiv.org/abs/2606.06806v1 Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference 2026-06-05T01:14:12Z

Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.

2026-06-05T01:14:12Z Accepted to Interspeech2026 Kentaro Onda Satoru Fukayama Daisuke Saito Nobuaki Minematsu http://arxiv.org/abs/2606.06795v1 BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation 2026-06-05T00:45:28Z

We present BiEAR, a human auditory-inspired adaptive binaural front-end for multi-speaker localisation and distance estimation. Inspired by medial olivocochlear (MOC) feedback in human hearing, BiEAR uses a neural controller to adaptively adjust the frequency selectivity of a binaural auditory filterbank during inference. This yields time-frequency adaptive representations for ears, enabling the model to respond to changing acoustic conditions. We evaluate BiEAR on multi-speaker localisation and distance estimation in anechoic and real-room environments. Results show that the adaptive front-end improves localisation accuracy and robustness to unseen speakers and rooms compared with commonly used fixed binaural front-ends. Visualisation and analysis of learned filter adaptations show that BiEAR emphasises informative frequency bands over time. These findings suggest that adaptive, biologically inspired binaural front-ends can improve machine hearing robustness in complex acoustic scenes.

2026-06-05T00:45:28Z Accepted to INTERSPEECH 2026 Hanyu Meng Eliathamby Ambikairajah Vidhyasaharan Sethu Qiquan Zhang Haizhou Li http://arxiv.org/abs/2606.06615v1 FIGMA: Towards FIne-Grained Music retrievAl 2026-06-04T18:05:39Z

Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.

2026-06-04T18:05:39Z Accepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/ Nishit Anand Ashish Seth Sreyan Ghosh Dinesh Manocha Ramani Duraiswami http://arxiv.org/abs/2606.06444v1 USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding 2026-06-04T17:42:05Z

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

2026-06-04T17:42:05Z Accepted to Interspeech 2026 Heng-Jui Chang Alexander H. Liu Saurabhchand Bhati Mrudula Athi Anton Ratnarajah Amit Chhetri James Glass http://arxiv.org/abs/2606.06357v1 F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation 2026-06-04T16:25:07Z

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

2026-06-04T16:25:07Z Technical report; early work; 9 pages, 2 figures, 5 tables Dinghao Zhou Xingchen Song Di Wu Pengyu Cheng Shengfan Shen Sixiang Lv http://arxiv.org/abs/2606.06211v1 FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition 2026-06-04T14:20:11Z

Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

2026-06-04T14:20:11Z Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop Fernando López Santosh Kesiraju Jordi Luque http://arxiv.org/abs/2606.06200v1 Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition 2026-06-04T14:05:38Z

Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.

2026-06-04T14:05:38Z Accepted to Interspeech 2026 Jinyi Mi Ding Ma Tomoki Toda http://arxiv.org/abs/2606.06183v1 Revisiting Lexicon Evaluation in Unsupervised Word Discovery 2026-06-04T13:55:09Z

Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.

2026-06-04T13:55:09Z 6 figures Simon Malan Danel Slabbert Herman Kamper http://arxiv.org/abs/2603.17837v5 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning 2026-06-04T13:47:38Z

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

2026-03-18T15:30:29Z Accepted by ICML 2026 Donghang Wu Tianyu Zhang Yuxin Li Hexin Liu Chen Chen Eng Siong Chng Yoshua Bengio http://arxiv.org/abs/2606.06559v1 IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems 2026-06-04T12:39:44Z

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

2026-06-04T12:39:44Z Tao Zhong Jiajun Deng Nikita Kuzmin Yinke Zhu Tianxiang Cao Tristan Tsoi Zhili Tan Simon Lui Xunying Liu http://arxiv.org/abs/2602.22417v2 Absorbing Discrete Diffusion for Speech Enhancement 2026-06-04T10:49:29Z

Inspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.

2026-02-25T21:22:08Z Accepted at Interspeech 2026 Philippe Gonzalez http://arxiv.org/abs/2606.05931v1 To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection 2026-06-04T09:33:58Z

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

2026-06-04T09:33:58Z INTERSPEECH 2026 Erfan Loweimi Mengjie Qian Kate Knill Guanfeng Wu Chi-Ho Chan Abbas Haider Muhammad Awan Josef Kittler Hui Wang Mark Gales http://arxiv.org/abs/2606.05911v1 DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement 2026-06-04T09:16:26Z

Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.

2026-06-04T09:16:26Z This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI2026) Cunhang Fan Enrui Liu Jing Zhou Jian Kang Jie Li Andong Li Jian Zhou Zhao Lv Xuelong Li 10.1109/TPAMI.2026.3698087