https://arxiv.org/api/WYMRCjG3gme2Rys9Y+pdpikOhCE2026-03-22T08:45:43Z21133015http://arxiv.org/abs/2603.19195v1How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation2026-03-19T17:50:07ZLarge language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.2026-03-19T17:50:07ZProject website: https://kehanlu.github.io/AKBKe-Han LuSzu-Wei FuChao-Han Huck YangZhehuai ChenSung-Feng HuangChih-Kai YangYi-Cheng LinChi-Yuan HsiaoWenze RenEn-Pei HuYu-Han HuangAn-Yu ChengCheng-Han ChiangYu TsaoYu-Chiang Frank WangHung-yi Leehttp://arxiv.org/abs/2507.02768v2DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment2026-03-19T17:35:34ZWe introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.2025-07-03T16:28:25ZPublished in IEEE Transactions on Audio, Speech and Language Processing (TASLP). Model and code available at: https://github.com/kehanlu/DeSTA2.5-AudioKe-Han LuZhehuai ChenSzu-Wei FuChao-Han Huck YangSung-Feng HuangChih-Kai YangChee-En YuChun-Wei ChenWei-Chih ChenChien-yu HuangYi-Cheng LinYu-Xiang LinChi-An FuChun-Yi KuanWenze RenXuanjun ChenWei-Ping HuangEn-Pei HuTzu-Quan LinYuan-Kuei WuKuan-Po HuangHsiao-Ying HuangHuang-Cheng ChouKai-Wei ChangCheng-Han ChiangBoris GinsburgYu-Chiang Frank WangHung-yi Leehttp://arxiv.org/abs/2603.19176v1Few-shot Acoustic Synthesis with Multimodal Flow Matching2026-03-19T17:32:06ZGenerating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.2026-03-19T17:32:06ZTo appear at CVPR 2026. 23 pages, 16 figures. Project Page: https://amandinebtto.github.io/FLAC/Amandine Brunettohttp://arxiv.org/abs/2603.11715v2Affect Decoding in Phonated and Silent Speech Production from Surface EMG2026-03-19T17:12:02ZThe expression of affect is integral to spoken communication, yet, its link to underlying articulatory execution remains unclear. Measures of articulatory muscle activity such as EMG could reveal how speech production is modulated by emotion alongside acoustic speech analyses. We investigate affect decoding from facial and neck surface electromyography (sEMG) during phonated and silent speech production. For this purpose, we introduce a dataset comprising 2,780 utterances from 12 participants across 3 tasks, on which we evaluate both intra- and inter-subject decoding using a range of features and model embeddings. Our results reveal that EMG representations reliably discriminate frustration with up to 0.845 AUC, and generalize well across articulation modes. Our ablation study further demonstrates that affective signatures are embedded in facial motor activity and persist in the absence of phonation, highlighting the potential of EMG sensing for affect-aware silent speech interfaces.2026-03-12T09:22:02ZSimon PistroschKleanthis AvramidisZhao RenTiantian FengJihwan LeeMonica Gonzalez-MachorroAnton BatlinerTanja SchultzShrikanth NarayananBjörn W. Schullerhttp://arxiv.org/abs/2603.11360v2Fair-Gate: Fairness-Aware Interpretable Risk Gating for Sex-Fair Voice Biometrics2026-03-19T16:02:32ZVoice biometric systems can exhibit sex-related performance gaps even when overall verification accuracy is strong. We attribute these gaps to two practical mechanisms: (i) demographic shortcut learning, where speaker classification training exploits spurious correlations between sex and speaker identity, and (ii) feature entanglement, where sex-linked acoustic variation overlaps with identity cues and cannot be removed without degrading speaker discrimination. We propose Fair-Gate, a fairness-aware and interpretable risk-gating framework that addresses both mechanisms in a single pipeline. Fair-Gate applies risk extrapolation to reduce variation in speaker-classification risk across proxy sex groups, and introduces a local complementary gate that routes intermediate features into an identity branch and a sex branch. The gate provides interpretability by producing an explicit routing mask that can be inspected to understand which features are allocated to identity versus sex-related pathways. Experiments on VoxCeleb1 show that Fair-Gate improves the utility--fairness trade-off, yielding more sex-fair ASV performance under challenging evaluation conditions.2026-03-11T22:50:15ZYangyang QuTodisco MassimilianoGaldi ChiaraEvans Nicholashttp://arxiv.org/abs/2511.23098v2Group-Aware Partial Model Merging for Children's Automatic Speech Recognition2026-03-19T15:35:11ZWhile supervised fine-tuning of adult pre-trained models for children's ASR has shown promise, it often fails to capture group-specific characteristics and variations among children. To address this, we introduce GRoup-Aware PARtial model Merging, a parameter-efficient approach that combines unsupervised clustering, partial fine-tuning, and model merging. Our approach adapts adult-pre-trained models to children by first grouping the children's data based on acoustic similarity. Each group is used to partially fine-tune an adult pre-trained model, and the resulting models are merged at the parameter level. Experiments conducted on the MyST children's speech corpus indicate that GRAPAM achieves a relative WER improvement of 6%, using the same amount of data, outperforming full fine-tuning while training fewer parameters.2025-11-28T11:35:22ZSubmitted to Interspeech 2026Thomas RollandAlberto Abadhttp://arxiv.org/abs/2510.18391v2MPDR Beamforming for Almost-Cyclostationary Processes2026-03-19T13:11:44ZConventional acoustic beamformers typically assume short-time stationarity and process frequency bins independently, ignoring inter-frequency correlations. This is suboptimal for almost-periodic noise sources such as engines, fans, and musical instruments: these signals are better modeled as (almost) cyclostationary (ACS) processes with statistically correlated spectral components. This paper introduces the cyclic minimum power distortionless response (cMPDR) beamformer, which extends the conventional MPDR to jointly exploit spatial and spectral correlations. Building on frequency-shifted (FRESH) filtering, it suppresses noise components that are coherent across harmonically related frequencies, reducing residual noise beyond what spatial filtering alone achieves. To address inharmonicity, where partials deviate from exact integer multiples of a fundamental frequency, we estimate resonant frequencies from a periodogram and derive frequency shifts from their pairwise spacing. Theoretical analysis yields closed-form expressions for residual noise and proves that output power decreases monotonically with the number of cyclic components. Experiments on synthetic harmonic noise and real UAV motor recordings confirm these findings: in low-SNR scenarios, the cMPDR achieves up to 5dB improvement in SI-SDR over the MPDR, yields consistent STOI gains, and remains effective with a single microphone. When spectral correlation is absent, the method reduces to conventional MPDR and does not degrade performance. These results suggest that cyclic processing is a viable direction for acoustic noise reduction that deserves further investigation. Code is available at https://github.com/Screeen/cMPDR.2025-10-21T08:12:42ZThis work has been submitted to the IEEE for possible publicationGiovanni BologniMartin Bo MøllerRichard HeusdensRichard C. Hendrikshttp://arxiv.org/abs/2603.18612v1DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units2026-03-19T08:31:58ZWe introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.2026-03-19T08:31:58Z6 pages, 2 figures. Submitted to Interspeech 2026Maxime PoliManel KhentoutAngelo Ortiz TandazoEwan DunbarEmmanuel ChemlaEmmanuel Dupouxhttp://arxiv.org/abs/2603.18485v1ARTT: Augmented Reverberant-Target Training for Unsupervised Monaural Speech Dereverberation2026-03-19T04:42:44ZDue to the absence of clean reference signals and spatial cues, monaural unsupervised speech dereverberation is a challenging ill-posed inverse problem. To realize it, we propose augmented reverberant-target training (ARTT), which consists of two stages. In the first stage, reverberant-target training (RTT) is proposed to first further reverberate the observed reverberant mixture signal, and then train a deep neural network (DNN) to recover the observed reverberant mixture via discriminative training. Although the target signal to fit is reverberant, we find that the resulting DNN can effectively reduce reverberation. In the second stage, an online self-distillation mechanism based on the mean-teacher algorithm is proposed to further improve dereverberation. Evaluation results demonstrate that ARTT achieves strong unsupervised dereverberation performance, significantly outperforming previous baselines.2026-03-19T04:42:44Zin submissionSiqi SongFulin WuZhong-Qiu Wanghttp://arxiv.org/abs/2509.22363v3Investigating Faithfulness in Large Audio Language Models2026-03-19T03:27:18ZLarge Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.2025-09-26T13:58:22ZPooneh MousaviLovenya JainMirco RavanelliCem Subakanhttp://arxiv.org/abs/2510.08581v2Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions2026-03-19T02:17:03ZHallucinations in multimodal models have been extensively studied using benchmarks that probe reliability in image-text query settings. However, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice interfaces. In this paper, we introduce a systematic pipeline that converts existing multimodal hallucination benchmarks into spoken-query versions while preserving the original tasks and labels. We instantiate this pipeline on RePOPE and release RePOPE-Spk, where all queries are provided as spoken audio under diverse input conditions. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3-6% with clean speech and by up to 30% under environmental noise. Furthermore, many-shot prompting and chain-of-thought reasoning provide only partial mitigation. Our findings motivate new directions for building reliable voice interface systems and evaluations.2025-09-19T07:18:45ZSubmitted to Interspeech2026Hansol ParkHoseong AhnJunwon MoonYejin LeeKyuhong Shimhttp://arxiv.org/abs/2603.17837v1The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning2026-03-18T15:30:29ZDuring conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.2026-03-18T15:30:29ZDonghang WuTianyu ZhangYuxin LiHexin LiuChen ChenEng Siong ChngYoshua Bengiohttp://arxiv.org/abs/2603.17822v1Multi-Source Evidence Fusion for Audio Question Answering2026-03-18T15:12:42ZLarge audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge's reasoning quality metric.2026-03-18T15:12:42ZAivo OlevTanel Alumäehttp://arxiv.org/abs/2603.17769v1Modeling Overlapped Speech with Shuffles2026-03-18T14:28:58ZWe propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.2026-03-18T14:28:58ZMatthew WiesnerSamuele CornellAlexander PolokLucas Ondel YangLukáš BurgetSanjeev Khudanpurhttp://arxiv.org/abs/2603.15352v2NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation2026-03-18T13:16:23ZWhile recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.2026-03-16T14:35:52ZSubmit to Interspeech 2026Qinke NiHuan LiaoDekun ChenYuxiang WangZhizheng Wu