https://arxiv.org/api/qb86KkbGkW6+lcv/aNYWo1eo4Es2026-03-24T10:12:00Z202434515http://arxiv.org/abs/2603.18678v1Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models2026-03-19T09:39:52ZPuns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence.2026-03-19T09:39:52ZThe paper is currently under reviewYuchen SuShaoxin ZhongYonghua ZhuRuofan WangZijian HuangQiqi WangNa ZhaoDiana Benavides-PradoMichael Witbrockhttp://arxiv.org/abs/2603.18612v1DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units2026-03-19T08:31:58ZWe introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.2026-03-19T08:31:58Z6 pages, 2 figures. Submitted to Interspeech 2026Maxime PoliManel KhentoutAngelo Ortiz TandazoEwan DunbarEmmanuel ChemlaEmmanuel Dupouxhttp://arxiv.org/abs/2510.08581v2Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions2026-03-19T02:17:03ZHallucinations in multimodal models have been extensively studied using benchmarks that probe reliability in image-text query settings. However, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice interfaces. In this paper, we introduce a systematic pipeline that converts existing multimodal hallucination benchmarks into spoken-query versions while preserving the original tasks and labels. We instantiate this pipeline on RePOPE and release RePOPE-Spk, where all queries are provided as spoken audio under diverse input conditions. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3-6% with clean speech and by up to 30% under environmental noise. Furthermore, many-shot prompting and chain-of-thought reasoning provide only partial mitigation. Our findings motivate new directions for building reliable voice interface systems and evaluations.2025-09-19T07:18:45ZSubmitted to Interspeech2026Hansol ParkHoseong AhnJunwon MoonYejin LeeKyuhong Shimhttp://arxiv.org/abs/2509.13093v4GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR2026-03-19T01:00:12ZEnd-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.2025-09-16T13:51:44ZThis paper has been submitted to Interspeech 2026 for reviewYujie GuoJiaming ZhouYuhang JiaShiwan ZhaoYong Qinhttp://arxiv.org/abs/2603.18359v1Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information2026-03-18T23:55:36ZNeural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability.2026-03-18T23:55:36ZShih-Heng WangTiantian FengAditya KommineniThanathai LertpetchpunBowen YiXuan ShiShrikanth Narayananhttp://arxiv.org/abs/2603.18299v1ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis2026-03-18T21:38:00ZIntracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording sessions. In realistic deployment, however, models must generalize to new sessions without labeled data, and performance often degrades due to cross-session nonstationarities (e.g., electrode shifts, neural turnover, and changes in user strategy). In this paper, we propose ALIGN, a session-invariant learning framework based on multi-domain adversarial neural networks for semi-supervised cross-session adaptation. ALIGN trains a feature encoder jointly with a phoneme classifier and a domain classifier operating on the latent representation. Through adversarial optimization, the encoder is encouraged to preserve task-relevant information while suppressing session-specific cues. We evaluate ALIGN on intracortical speech decoding and find that it generalizes consistently better to previously unseen sessions, improving both phoneme error rate and word error rate relative to baselines. These results indicate that adversarial domain alignment is an effective approach for mitigating session-level distribution shift and enabling robust longitudinal BCI decoding.2026-03-18T21:38:00ZZhanqi ZhangShun LiBernardo L. SabatiniMikio AoiGal Mishnehttp://arxiv.org/abs/2603.17769v1Modeling Overlapped Speech with Shuffles2026-03-18T14:28:58ZWe propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.2026-03-18T14:28:58ZMatthew WiesnerSamuele CornellAlexander PolokLucas Ondel YangLukáš BurgetSanjeev Khudanpurhttp://arxiv.org/abs/2603.15352v2NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation2026-03-18T13:16:23ZWhile recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.2026-03-16T14:35:52ZSubmit to Interspeech 2026Qinke NiHuan LiaoDekun ChenYuxiang WangZhizheng Wuhttp://arxiv.org/abs/2603.13780v2Integrated Spoofing-Robust Automatic Speaker Verification via a Three-Class Formulation and LLR2026-03-18T12:58:51ZSpoofing-robust automatic speaker verification (SASV) aims to integrate automatic speaker verification (ASV) and countermeasure (CM). A popular solution is fusion of independent ASV and CM scores. To better modeling SASV, some frameworks integrate ASV and CM within a single network. However, these solutions are typically bi-encoder based, offer limited interpretability, and cannot be readily adapted to new evaluation parameters without retraining. Based on this, we propose a unified end-to-end framework via a three-class formulation that enables log-likelihood ratio (LLR) inference from class logits for a more interpretable decision pipeline. Experiments show comparable performance to existing methods on ASVSpoof5 and better results on SpoofCeleb. The visualization and analysis also prove that the three-class reformulation provides more interpretability.2026-03-14T06:20:10ZSubmitted to Interspeech 2026; put on arxiv based on requirement from Interspeech: "Interspeech no longer enforces an anonymity period for submissions." and "For authors that prefer to upload their paper online, a note indicating that the paper was submitted for review to Interspeech should be included in the posting."Kai TanLin ZhangRuiteng ZhangJohan RohdinLeibny Paola García-PereraZexin CaiSanjeev KhudanpurMatthew WiesnerNicholas Andrewshttp://arxiv.org/abs/2509.16760v2Feature Selection via Graph Topology Inference for Soundscape Emotion Recognition2026-03-18T12:50:04ZResearch on soundscapes has shifted the focus of environmental acoustics from noise levels to the perception of sounds, incorporating contextual factors. Soundscape emotion recognition (SER) models perception using a set of features, with arousal and valence commonly regarded as sufficient descriptors of affect. In this work, we blend \emph{graph learning} techniques with a novel \emph{information criterion} to develop a feature selection framework for SER. Specifically, we estimate a sparse graph representation of feature relations using linear structural equation models (SEM) tailored to the widely used Emo-Soundscapes dataset. The resulting graph captures the relations between input features and the two emotional outputs. To determine the appropriate level of sparsity, we propose a novel \emph{generalized elbow detector}, which provides both a point estimate and an uncertainty interval. We conduct an extensive evaluation of our methods, including visualizations of the inferred relations. While several of our findings align with previous studies, the graph representation also reveals a strong connection between arousal and valence, challenging common SER assumptions.2025-09-20T17:52:20ZSamuel ReyLuca MartinoRoberto San MillanEduardo Morgadohttp://arxiv.org/abs/2603.18103v1STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling2026-03-18T12:14:14ZWith the widespread deployment of deep-learning-based speech models in security-critical applications, backdoor attacks have emerged as a serious threat: an adversary who poisons a small fraction of training data can implant a hidden trigger that controls the model's output while preserving normal behavior on clean inputs. Existing inference-time defenses are not well suited to the audio domain, as they either rely on trigger over-robustness assumptions that fail on transformation-based and semantic triggers, or depend on properties specific to image or text modalities. In this paper, we propose STEP (Stability-based Trigger Exposure Profiling), a black-box, retraining-free backdoor detector that operates under hard-label-only access. Its core idea is to exploit a characteristic dual anomaly of backdoor triggers: anomalous label stability under semantic-breaking perturbations, and anomalous label fragility under semantic-preserving perturbations. STEP profiles each test sample with two complementary perturbation branches that target these two properties respectively, scores the resulting stability features with one-class anomaly detectors trained on benign references, and fuses the two scores via unsupervised weighting. Extensive experiments across seven backdoor attacks show that STEP achieves an average AUROC of 97.92% and EER of 4.54%, substantially outperforming state-of-the-art baselines, and generalizes across model architectures, speech tasks, an open-set verification scenario, and over-the-air physical-world settings.2026-03-18T12:14:14ZKun WangMeng ChenJunhao WangYuli WuLi LuChong ZhangPeng ChengJiaheng ZhangKui Renhttp://arxiv.org/abs/2603.18082v1EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities2026-03-18T07:55:24ZTTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.2026-03-18T07:55:24ZXinyuan QianXinjia ZhuAlessio BruttiDong Liang10.1145/3797029http://arxiv.org/abs/2603.05413v2Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial2026-03-17T17:39:54ZWe present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While end-to-end speech-to-speech models may ultimately provide the best latency for voice agents, fully self-hosted end-to-end solutions are not yet available. We evaluate the closest candidate, Qwen3-Omni, across three configurations: its cloud-only DashScope Realtime API achieves $\sim$702ms audio-to-audio latency with streaming, but is not self-hostable; its local vLLM deployment supports only the Thinker (text generation from audio, 516ms), not the Talker (audio synthesis); and its local Transformers deployment runs the full pipeline but at $\sim$146s -- far too slow for realtime. The cascaded streaming pipeline (STT $\rightarrow$ LLM $\rightarrow$ TTS) therefore remains the practical architecture for self-hosted realtime voice agents, and the focus of this tutorial. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured time-to-first-audio of 755ms (best case 729ms) with full function calling support. We release the full codebase as a 9-chapter progressive tutorial with working, tested code for every component.2026-03-05T17:35:59ZJielin QiuZixiang ChenLiangwei YangMing ZhuZhiwei LiuJuntao TanWenting ZhaoRithesh MurthyRoshan RamAkshara PrabhakarShelby HeineckeCaiming XiongSilvio SavareseHuan Wanghttp://arxiv.org/abs/2603.16805v1Making Separation-First Multi-Stream Audio Watermarking Feasible via Joint Training2026-03-17T17:09:45ZModern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems using unique keys but a shared structure, mixing, separating, and decoding from each output. A naive pipeline (robust watermarking + off-the-shelf separation) yields poor bit recovery, showing robustness to generic distortions does not ensure robustness to separation artifacts. To enable this, we jointly train the watermark system and the separator in an end-to-end manner, encouraging the separator to preserve watermark cues while adapting embedding to separation-specific distortions. Experiments on speech+music and vocal+accompaniment mixtures show substantial gains in post-separation recovery while maintaining perceptual quality.2026-03-17T17:09:45ZHoumin SunZi HuLinxi LiYechen WangLiwei JinMing Lihttp://arxiv.org/abs/2603.16713v1Evaluating Latent Space Structure in Timbre VAEs: A Comparative Study of Unsupervised, Descriptor-Conditioned, and Perceptual Feature-Conditioned Models2026-03-17T16:03:07ZWe present a comparative evaluation of latent space organization in three Variational Autoencoders (VAEs) for musical timbre generation: an unsupervised VAE, a descriptor-conditioned VAE, and a VAE conditioned on continuous perceptual features from the AudioCommons timbral models. Using a curated dataset of electric guitar sounds labeled with 19 semantic descriptors across four intensity levels, we assess each model's latent structure with a suite of clustering and interpretability metrics. These include silhouette scores, timbre descriptor compactness, pitch-conditional separation, trajectory linearity, and cross-pitch consistency. Our findings show that conditioning on perceptual features yields a more compact, discriminative, and pitch-invariant latent space, outperforming both the unsupervised and discrete descriptor-conditioned models. This work highlights the limitations of one-hot semantic conditioning and provides methodological tools for evaluating timbre latent spaces, contributing to the development of more controllable and interpretable generative audio models.2026-03-17T16:03:07Z5 pages, 1 figure, 1 tableJoseph CameronAlan Blackwell