https://arxiv.org/api/qb86KkbGkW6+lcv/aNYWo1eo4Es 2026-03-24T10:12:00Z 20243 45 15 http://arxiv.org/abs/2603.18678v1 Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models 2026-03-19T09:39:52Z Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence. 2026-03-19T09:39:52Z The paper is currently under review Yuchen Su Shaoxin Zhong Yonghua Zhu Ruofan Wang Zijian Huang Qiqi Wang Na Zhao Diana Benavides-Prado Michael Witbrock http://arxiv.org/abs/2603.18612v1 DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units 2026-03-19T08:31:58Z We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages. 2026-03-19T08:31:58Z 6 pages, 2 figures. Submitted to Interspeech 2026 Maxime Poli Manel Khentout Angelo Ortiz Tandazo Ewan Dunbar Emmanuel Chemla Emmanuel Dupoux http://arxiv.org/abs/2510.08581v2 Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions 2026-03-19T02:17:03Z Hallucinations in multimodal models have been extensively studied using benchmarks that probe reliability in image-text query settings. However, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice interfaces. In this paper, we introduce a systematic pipeline that converts existing multimodal hallucination benchmarks into spoken-query versions while preserving the original tasks and labels. We instantiate this pipeline on RePOPE and release RePOPE-Spk, where all queries are provided as spoken audio under diverse input conditions. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3-6% with clean speech and by up to 30% under environmental noise. Furthermore, many-shot prompting and chain-of-thought reasoning provide only partial mitigation. Our findings motivate new directions for building reliable voice interface systems and evaluations. 2025-09-19T07:18:45Z Submitted to Interspeech2026 Hansol Park Hoseong Ahn Junwon Moon Yejin Lee Kyuhong Shim http://arxiv.org/abs/2509.13093v4 GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR 2026-03-19T01:00:12Z End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR. 2025-09-16T13:51:44Z This paper has been submitted to Interspeech 2026 for review Yujie Guo Jiaming Zhou Yuhang Jia Shiwan Zhao Yong Qin http://arxiv.org/abs/2603.18359v1 Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information 2026-03-18T23:55:36Z Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability. 2026-03-18T23:55:36Z Shih-Heng Wang Tiantian Feng Aditya Kommineni Thanathai Lertpetchpun Bowen Yi Xuan Shi Shrikanth Narayanan http://arxiv.org/abs/2603.18299v1 ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis 2026-03-18T21:38:00Z Intracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording sessions. In realistic deployment, however, models must generalize to new sessions without labeled data, and performance often degrades due to cross-session nonstationarities (e.g., electrode shifts, neural turnover, and changes in user strategy). In this paper, we propose ALIGN, a session-invariant learning framework based on multi-domain adversarial neural networks for semi-supervised cross-session adaptation. ALIGN trains a feature encoder jointly with a phoneme classifier and a domain classifier operating on the latent representation. Through adversarial optimization, the encoder is encouraged to preserve task-relevant information while suppressing session-specific cues. We evaluate ALIGN on intracortical speech decoding and find that it generalizes consistently better to previously unseen sessions, improving both phoneme error rate and word error rate relative to baselines. These results indicate that adversarial domain alignment is an effective approach for mitigating session-level distribution shift and enabling robust longitudinal BCI decoding. 2026-03-18T21:38:00Z Zhanqi Zhang Shun Li Bernardo L. Sabatini Mikio Aoi Gal Mishne http://arxiv.org/abs/2603.17769v1 Modeling Overlapped Speech with Shuffles 2026-03-18T14:28:58Z We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall. 2026-03-18T14:28:58Z Matthew Wiesner Samuele Cornell Alexander Polok Lucas Ondel Yang Lukáš Burget Sanjeev Khudanpur http://arxiv.org/abs/2603.15352v2 NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation 2026-03-18T13:16:23Z While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework. 2026-03-16T14:35:52Z Submit to Interspeech 2026 Qinke Ni Huan Liao Dekun Chen Yuxiang Wang Zhizheng Wu http://arxiv.org/abs/2603.13780v2 Integrated Spoofing-Robust Automatic Speaker Verification via a Three-Class Formulation and LLR 2026-03-18T12:58:51Z Spoofing-robust automatic speaker verification (SASV) aims to integrate automatic speaker verification (ASV) and countermeasure (CM). A popular solution is fusion of independent ASV and CM scores. To better modeling SASV, some frameworks integrate ASV and CM within a single network. However, these solutions are typically bi-encoder based, offer limited interpretability, and cannot be readily adapted to new evaluation parameters without retraining. Based on this, we propose a unified end-to-end framework via a three-class formulation that enables log-likelihood ratio (LLR) inference from class logits for a more interpretable decision pipeline. Experiments show comparable performance to existing methods on ASVSpoof5 and better results on SpoofCeleb. The visualization and analysis also prove that the three-class reformulation provides more interpretability. 2026-03-14T06:20:10Z Submitted to Interspeech 2026; put on arxiv based on requirement from Interspeech: "Interspeech no longer enforces an anonymity period for submissions." and "For authors that prefer to upload their paper online, a note indicating that the paper was submitted for review to Interspeech should be included in the posting." Kai Tan Lin Zhang Ruiteng Zhang Johan Rohdin Leibny Paola García-Perera Zexin Cai Sanjeev Khudanpur Matthew Wiesner Nicholas Andrews http://arxiv.org/abs/2509.16760v2 Feature Selection via Graph Topology Inference for Soundscape Emotion Recognition 2026-03-18T12:50:04Z Research on soundscapes has shifted the focus of environmental acoustics from noise levels to the perception of sounds, incorporating contextual factors. Soundscape emotion recognition (SER) models perception using a set of features, with arousal and valence commonly regarded as sufficient descriptors of affect. In this work, we blend \emph{graph learning} techniques with a novel \emph{information criterion} to develop a feature selection framework for SER. Specifically, we estimate a sparse graph representation of feature relations using linear structural equation models (SEM) tailored to the widely used Emo-Soundscapes dataset. The resulting graph captures the relations between input features and the two emotional outputs. To determine the appropriate level of sparsity, we propose a novel \emph{generalized elbow detector}, which provides both a point estimate and an uncertainty interval. We conduct an extensive evaluation of our methods, including visualizations of the inferred relations. While several of our findings align with previous studies, the graph representation also reveals a strong connection between arousal and valence, challenging common SER assumptions. 2025-09-20T17:52:20Z Samuel Rey Luca Martino Roberto San Millan Eduardo Morgado http://arxiv.org/abs/2603.18103v1 STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling 2026-03-18T12:14:14Z With the widespread deployment of deep-learning-based speech models in security-critical applications, backdoor attacks have emerged as a serious threat: an adversary who poisons a small fraction of training data can implant a hidden trigger that controls the model's output while preserving normal behavior on clean inputs. Existing inference-time defenses are not well suited to the audio domain, as they either rely on trigger over-robustness assumptions that fail on transformation-based and semantic triggers, or depend on properties specific to image or text modalities. In this paper, we propose STEP (Stability-based Trigger Exposure Profiling), a black-box, retraining-free backdoor detector that operates under hard-label-only access. Its core idea is to exploit a characteristic dual anomaly of backdoor triggers: anomalous label stability under semantic-breaking perturbations, and anomalous label fragility under semantic-preserving perturbations. STEP profiles each test sample with two complementary perturbation branches that target these two properties respectively, scores the resulting stability features with one-class anomaly detectors trained on benign references, and fuses the two scores via unsupervised weighting. Extensive experiments across seven backdoor attacks show that STEP achieves an average AUROC of 97.92% and EER of 4.54%, substantially outperforming state-of-the-art baselines, and generalizes across model architectures, speech tasks, an open-set verification scenario, and over-the-air physical-world settings. 2026-03-18T12:14:14Z Kun Wang Meng Chen Junhao Wang Yuli Wu Li Lu Chong Zhang Peng Cheng Jiaheng Zhang Kui Ren http://arxiv.org/abs/2603.18082v1 EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities 2026-03-18T07:55:24Z TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP. 2026-03-18T07:55:24Z Xinyuan Qian Xinjia Zhu Alessio Brutti Dong Liang 10.1145/3797029 http://arxiv.org/abs/2603.05413v2 Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial 2026-03-17T17:39:54Z We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While end-to-end speech-to-speech models may ultimately provide the best latency for voice agents, fully self-hosted end-to-end solutions are not yet available. We evaluate the closest candidate, Qwen3-Omni, across three configurations: its cloud-only DashScope Realtime API achieves $\sim$702ms audio-to-audio latency with streaming, but is not self-hostable; its local vLLM deployment supports only the Thinker (text generation from audio, 516ms), not the Talker (audio synthesis); and its local Transformers deployment runs the full pipeline but at $\sim$146s -- far too slow for realtime. The cascaded streaming pipeline (STT $\rightarrow$ LLM $\rightarrow$ TTS) therefore remains the practical architecture for self-hosted realtime voice agents, and the focus of this tutorial. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured time-to-first-audio of 755ms (best case 729ms) with full function calling support. We release the full codebase as a 9-chapter progressive tutorial with working, tested code for every component. 2026-03-05T17:35:59Z Jielin Qiu Zixiang Chen Liangwei Yang Ming Zhu Zhiwei Liu Juntao Tan Wenting Zhao Rithesh Murthy Roshan Ram Akshara Prabhakar Shelby Heinecke Caiming Xiong Silvio Savarese Huan Wang http://arxiv.org/abs/2603.16805v1 Making Separation-First Multi-Stream Audio Watermarking Feasible via Joint Training 2026-03-17T17:09:45Z Modern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems using unique keys but a shared structure, mixing, separating, and decoding from each output. A naive pipeline (robust watermarking + off-the-shelf separation) yields poor bit recovery, showing robustness to generic distortions does not ensure robustness to separation artifacts. To enable this, we jointly train the watermark system and the separator in an end-to-end manner, encouraging the separator to preserve watermark cues while adapting embedding to separation-specific distortions. Experiments on speech+music and vocal+accompaniment mixtures show substantial gains in post-separation recovery while maintaining perceptual quality. 2026-03-17T17:09:45Z Houmin Sun Zi Hu Linxi Li Yechen Wang Liwei Jin Ming Li http://arxiv.org/abs/2603.16713v1 Evaluating Latent Space Structure in Timbre VAEs: A Comparative Study of Unsupervised, Descriptor-Conditioned, and Perceptual Feature-Conditioned Models 2026-03-17T16:03:07Z We present a comparative evaluation of latent space organization in three Variational Autoencoders (VAEs) for musical timbre generation: an unsupervised VAE, a descriptor-conditioned VAE, and a VAE conditioned on continuous perceptual features from the AudioCommons timbral models. Using a curated dataset of electric guitar sounds labeled with 19 semantic descriptors across four intensity levels, we assess each model's latent structure with a suite of clustering and interpretability metrics. These include silhouette scores, timbre descriptor compactness, pitch-conditional separation, trajectory linearity, and cross-pitch consistency. Our findings show that conditioning on perceptual features yields a more compact, discriminative, and pitch-invariant latent space, outperforming both the unsupervised and discrete descriptor-conditioned models. This work highlights the limitations of one-hot semantic conditioning and provides methodological tools for evaluating timbre latent spaces, contributing to the development of more controllable and interpretable generative audio models. 2026-03-17T16:03:07Z 5 pages, 1 figure, 1 table Joseph Cameron Alan Blackwell