https://arxiv.org/api/SkSd2UUvENg9SMknGYE8Sen6XUI 2026-03-24T08:39:36Z 20243 30 15 http://arxiv.org/abs/2603.19697v1 Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction 2026-03-20T07:05:29Z The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones. Audio samples are available at: https://plugandsteer.github.io 2026-03-20T07:05:29Z Submitted to Interspeech 2026; demo available https://plugandsteer.github.io Doyeop Kwak Suyeon Lee Joon Son Chung http://arxiv.org/abs/2603.19615v1 CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation 2026-03-20T03:46:18Z While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score. 2026-03-20T03:46:18Z A condensed version of this work has been submitted to Interspeech 2026. Section 10 is an extended analysis added in this version Insung Lee Taeyoung Jeong Haejun Yoo Du-Seong Chang Myoung-Wan Koo http://arxiv.org/abs/2509.24773v4 VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning 2026-03-20T03:36:49Z Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward feature-level data synthesis method, demonstrating that our framework provides a robust foundation that easily adapts to joint sound and speech generation using synthetic data. Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative models. Project page: https://vasflow1.github.io/vasflow/ 2025-09-29T13:38:24Z Paper Under Review Xin Cheng Yuyue Wang Xihua Wang Yihan Wu Kaisi Guan Yijing Chen Peng Zhang Xiaojiang Liu Meng Cao Ruihua Song http://arxiv.org/abs/2603.18048v2 DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models 2026-03-20T01:17:57Z Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding. 2026-03-17T15:52:26Z 14 pages,6 figures Jiaqi Xiong Yunjia Qi Qi Cao Yu Zheng Yutong Zhang Ziteng Wang Ruofan Liao Weisheng Xu Sichen Liu http://arxiv.org/abs/2603.19468v1 Listen First, Then Answer: Timestamp-Grounded Speech Reasoning 2026-03-19T21:01:04Z Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning. 2026-03-19T21:01:04Z Submitted to Interspeech 2026 Jihoon Jeong Pooneh Mousavi Mirco Ravanelli Cem Subakan http://arxiv.org/abs/2603.19195v1 How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation 2026-03-19T17:50:07Z Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research. 2026-03-19T17:50:07Z Project website: https://kehanlu.github.io/AKB Ke-Han Lu Szu-Wei Fu Chao-Han Huck Yang Zhehuai Chen Sung-Feng Huang Chih-Kai Yang Yi-Cheng Lin Chi-Yuan Hsiao Wenze Ren En-Pei Hu Yu-Han Huang An-Yu Cheng Cheng-Han Chiang Yu Tsao Yu-Chiang Frank Wang Hung-yi Lee http://arxiv.org/abs/2507.02768v2 DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment 2026-03-19T17:35:34Z We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs. 2025-07-03T16:28:25Z Published in IEEE Transactions on Audio, Speech and Language Processing (TASLP). Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio Ke-Han Lu Zhehuai Chen Szu-Wei Fu Chao-Han Huck Yang Sung-Feng Huang Chih-Kai Yang Chee-En Yu Chun-Wei Chen Wei-Chih Chen Chien-yu Huang Yi-Cheng Lin Yu-Xiang Lin Chi-An Fu Chun-Yi Kuan Wenze Ren Xuanjun Chen Wei-Ping Huang En-Pei Hu Tzu-Quan Lin Yuan-Kuei Wu Kuan-Po Huang Hsiao-Ying Huang Huang-Cheng Chou Kai-Wei Chang Cheng-Han Chiang Boris Ginsburg Yu-Chiang Frank Wang Hung-yi Lee http://arxiv.org/abs/2603.19176v1 Few-shot Acoustic Synthesis with Multimodal Flow Matching 2026-03-19T17:32:06Z Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis. 2026-03-19T17:32:06Z To appear at CVPR 2026. 23 pages, 16 figures. Project Page: https://amandinebtto.github.io/FLAC/ Amandine Brunetto http://arxiv.org/abs/2603.11715v2 Affect Decoding in Phonated and Silent Speech Production from Surface EMG 2026-03-19T17:12:02Z The expression of affect is integral to spoken communication, yet, its link to underlying articulatory execution remains unclear. Measures of articulatory muscle activity such as EMG could reveal how speech production is modulated by emotion alongside acoustic speech analyses. We investigate affect decoding from facial and neck surface electromyography (sEMG) during phonated and silent speech production. For this purpose, we introduce a dataset comprising 2,780 utterances from 12 participants across 3 tasks, on which we evaluate both intra- and inter-subject decoding using a range of features and model embeddings. Our results reveal that EMG representations reliably discriminate frustration with up to 0.845 AUC, and generalize well across articulation modes. Our ablation study further demonstrates that affective signatures are embedded in facial motor activity and persist in the absence of phonation, highlighting the potential of EMG sensing for affect-aware silent speech interfaces. 2026-03-12T09:22:02Z Simon Pistrosch Kleanthis Avramidis Zhao Ren Tiantian Feng Jihwan Lee Monica Gonzalez-Machorro Anton Batliner Tanja Schultz Shrikanth Narayanan Björn W. Schuller http://arxiv.org/abs/2603.11360v2 Fair-Gate: Fairness-Aware Interpretable Risk Gating for Sex-Fair Voice Biometrics 2026-03-19T16:02:32Z Voice biometric systems can exhibit sex-related performance gaps even when overall verification accuracy is strong. We attribute these gaps to two practical mechanisms: (i) demographic shortcut learning, where speaker classification training exploits spurious correlations between sex and speaker identity, and (ii) feature entanglement, where sex-linked acoustic variation overlaps with identity cues and cannot be removed without degrading speaker discrimination. We propose Fair-Gate, a fairness-aware and interpretable risk-gating framework that addresses both mechanisms in a single pipeline. Fair-Gate applies risk extrapolation to reduce variation in speaker-classification risk across proxy sex groups, and introduces a local complementary gate that routes intermediate features into an identity branch and a sex branch. The gate provides interpretability by producing an explicit routing mask that can be inspected to understand which features are allocated to identity versus sex-related pathways. Experiments on VoxCeleb1 show that Fair-Gate improves the utility--fairness trade-off, yielding more sex-fair ASV performance under challenging evaluation conditions. 2026-03-11T22:50:15Z Yangyang Qu Todisco Massimiliano Galdi Chiara Evans Nicholas http://arxiv.org/abs/2603.20307v1 EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control 2026-03-19T15:15:17Z Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility. 2026-03-19T15:15:17Z Yuzhe Weng Haotian Wang Yuanhong Yu Jun Du Shan He Xiaoyan Wu Haoran Xu http://arxiv.org/abs/2603.17558v2 Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition 2026-03-19T13:47:01Z Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework's reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility. 2026-03-18T10:04:50Z 13 pages, 8 figures Yuxiang Mei Delai Qiu Shengping Liu Jiaen Liang Yanhua Long http://arxiv.org/abs/2510.18391v2 MPDR Beamforming for Almost-Cyclostationary Processes 2026-03-19T13:11:44Z Conventional acoustic beamformers typically assume short-time stationarity and process frequency bins independently, ignoring inter-frequency correlations. This is suboptimal for almost-periodic noise sources such as engines, fans, and musical instruments: these signals are better modeled as (almost) cyclostationary (ACS) processes with statistically correlated spectral components. This paper introduces the cyclic minimum power distortionless response (cMPDR) beamformer, which extends the conventional MPDR to jointly exploit spatial and spectral correlations. Building on frequency-shifted (FRESH) filtering, it suppresses noise components that are coherent across harmonically related frequencies, reducing residual noise beyond what spatial filtering alone achieves. To address inharmonicity, where partials deviate from exact integer multiples of a fundamental frequency, we estimate resonant frequencies from a periodogram and derive frequency shifts from their pairwise spacing. Theoretical analysis yields closed-form expressions for residual noise and proves that output power decreases monotonically with the number of cyclic components. Experiments on synthetic harmonic noise and real UAV motor recordings confirm these findings: in low-SNR scenarios, the cMPDR achieves up to 5dB improvement in SI-SDR over the MPDR, yields consistent STOI gains, and remains effective with a single microphone. When spectral correlation is absent, the method reduces to conventional MPDR and does not degrade performance. These results suggest that cyclic processing is a viable direction for acoustic noise reduction that deserves further investigation. Code is available at https://github.com/Screeen/cMPDR. 2025-10-21T08:12:42Z This work has been submitted to the IEEE for possible publication Giovanni Bologni Martin Bo Møller Richard Heusdens Richard C. Hendriks http://arxiv.org/abs/2603.18758v1 Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning 2026-03-19T11:09:58Z This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information. 2026-03-19T11:09:58Z Preprint. Accepted for publication in IEEE Transactions on Computational Social Systems IEEE Transactions on Computational Social Systems, 2026 Hung-Yue Suen Kuo-En Hung Fan-Hsun Tseng 10.1109/TCSS.2026.3675249 http://arxiv.org/abs/2603.20301v1 Voice Privacy from an Attribute-based Perspective 2026-03-19T09:53:30Z Voice privacy approaches that preserve the anonymity of speakers modify speech in an attempt to break the link with the true identity of the speaker. Current benchmarks measure speaker protection based on signal-to-signal comparisons. In this paper, we introduce an attribute-based perspective, where we measure privacy protection in terms of comparisons between sets of speaker attributes. First, we analyze privacy impact by calculating speaker uniqueness for ground truth attributes, attributes inferred on the original speech, and attributes inferred on speech protected with standard anonymization. Next, we examine a threat scenario involving only a single utterance per speaker and calculate attack error rates. Overall, we observe that inferred attributes still present a risk despite attribute inference errors. Our research points to the importance of considering both attribute-related threats and protection mechanisms in future voice privacy research. 2026-03-19T09:53:30Z Submitted to InterSpeech 2026 Mehtab Ur Rahman Martha Larson Cristian Tejedor García