https://arxiv.org/api/E4bGOdX5U6c8CfRvPmq4ez+ohVI2026-06-18T10:54:00Z2175522515http://arxiv.org/abs/2606.06907v1SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models2026-06-05T04:50:34ZLarge audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.2026-06-05T04:50:34Z5 pages, 5 figuresSeonuk KimYonghyeon JunJu Yeon KangJimin HongYoonhyeong LeeNam Soo Kimhttp://arxiv.org/abs/2606.06837v1SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails2026-06-05T02:24:19ZScripted vs spontaneous speech detection is appealing for interview guardrails, but benchmark performance can be inflated by shortcuts tied to corpus identity, channel conditions, and recording artifacts rather than speaking style itself. We present SEAM, a shortcut-aware framework for real-time scriptedness detection that combines uniform preprocessing, seam-aware sampling, non-speech augmentation, and a compact DistilHuBERT backbone. With 8s windows, the model achieves 0.971 +- 0.004 ROC-AUC on an external interview-domain evaluation set. Removing the shortcut-prevention components improves internal held-out metrics but sharply reduces external performance, indicating shortcut learning. Post-training quantization reduces the model footprint to 41.8MB with little loss in external performance. The results demonstrate that robust real-time scriptedness detection depends not only on the backbone, but on shortcut-aware data design and evaluation. We release code and model checkpoints.2026-06-05T02:24:19ZAccepted to Interspeech 2026 VsevolodV. KovalevPranay Manochahttp://arxiv.org/abs/2606.06806v1Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference2026-06-05T01:14:12ZDiscrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.2026-06-05T01:14:12ZAccepted to Interspeech2026Kentaro OndaSatoru FukayamaDaisuke SaitoNobuaki Minematsuhttp://arxiv.org/abs/2606.06795v1BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation2026-06-05T00:45:28ZWe present BiEAR, a human auditory-inspired adaptive binaural front-end for multi-speaker localisation and distance estimation. Inspired by medial olivocochlear (MOC) feedback in human hearing, BiEAR uses a neural controller to adaptively adjust the frequency selectivity of a binaural auditory filterbank during inference. This yields time-frequency adaptive representations for ears, enabling the model to respond to changing acoustic conditions. We evaluate BiEAR on multi-speaker localisation and distance estimation in anechoic and real-room environments. Results show that the adaptive front-end improves localisation accuracy and robustness to unseen speakers and rooms compared with commonly used fixed binaural front-ends. Visualisation and analysis of learned filter adaptations show that BiEAR emphasises informative frequency bands over time. These findings suggest that adaptive, biologically inspired binaural front-ends can improve machine hearing robustness in complex acoustic scenes.2026-06-05T00:45:28ZAccepted to INTERSPEECH 2026Hanyu MengEliathamby AmbikairajahVidhyasaharan SethuQiquan ZhangHaizhou Lihttp://arxiv.org/abs/2606.06615v1FIGMA: Towards FIne-Grained Music retrievAl2026-06-04T18:05:39ZRetrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.2026-06-04T18:05:39ZAccepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/Nishit AnandAshish SethSreyan GhoshDinesh ManochaRamani Duraiswamihttp://arxiv.org/abs/2606.06444v1USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding2026-06-04T17:42:05ZAudio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.2026-06-04T17:42:05ZAccepted to Interspeech 2026Heng-Jui ChangAlexander H. LiuSaurabhchand BhatiMrudula AthiAnton RatnarajahAmit ChhetriJames Glasshttp://arxiv.org/abs/2606.06357v1F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation2026-06-04T16:25:07ZContinuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets2026-06-04T16:25:07ZTechnical report; early work; 9 pages, 2 figures, 5 tablesDinghao ZhouXingchen SongDi WuPengyu ChengShengfan ShenSixiang Lvhttp://arxiv.org/abs/2606.06211v1FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition2026-06-04T14:20:11ZAutomatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.2026-06-04T14:20:11ZAccepted in Odyssey 2026: The Speaker and Language Recognition WorkshopFernando LópezSantosh KesirajuJordi Luquehttp://arxiv.org/abs/2606.06200v1Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition2026-06-04T14:05:38ZZero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.2026-06-04T14:05:38ZAccepted to Interspeech 2026Jinyi MiDing MaTomoki Todahttp://arxiv.org/abs/2606.06183v1Revisiting Lexicon Evaluation in Unsupervised Word Discovery2026-06-04T13:55:09ZBuilding a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.2026-06-04T13:55:09Z6 figuresSimon MalanDanel SlabbertHerman Kamperhttp://arxiv.org/abs/2603.17837v5The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning2026-06-04T13:47:38ZDuring conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.2026-03-18T15:30:29ZAccepted by ICML 2026Donghang WuTianyu ZhangYuxin LiHexin LiuChen ChenEng Siong ChngYoshua Bengiohttp://arxiv.org/abs/2606.06559v1IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems2026-06-04T12:39:44ZFull-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.2026-06-04T12:39:44ZTao ZhongJiajun DengNikita KuzminYinke ZhuTianxiang CaoTristan TsoiZhili TanSimon LuiXunying Liuhttp://arxiv.org/abs/2602.22417v2Absorbing Discrete Diffusion for Speech Enhancement2026-06-04T10:49:29ZInspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.2026-02-25T21:22:08ZAccepted at Interspeech 2026Philippe Gonzalezhttp://arxiv.org/abs/2606.05931v1To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection2026-06-04T09:33:58ZWhen retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).2026-06-04T09:33:58ZINTERSPEECH 2026Erfan LoweimiMengjie QianKate KnillGuanfeng WuChi-Ho ChanAbbas HaiderMuhammad AwanJosef KittlerHui WangMark Galeshttp://arxiv.org/abs/2606.05911v1DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement2026-06-04T09:16:26ZAlthough artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.2026-06-04T09:16:26ZThis article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI)IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI2026)Cunhang FanEnrui LiuJing ZhouJian KangJie LiAndong LiJian ZhouZhao LvXuelong Li10.1109/TPAMI.2026.3698087