https://arxiv.org/api/eWHafZsXaSsU6je1p7PWfgslWJU 2026-06-22T09:06:01Z 21774 345 15 http://arxiv.org/abs/2606.01016v1 PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects 2026-05-31T05:13:32Z

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

2026-05-31T05:13:32Z 19 pages, 13 figures, KDD 2026 Sicheng Yang Shulan Ruan Shiwei Wu Yu Liu Lu Fan Zhi Li You He http://arxiv.org/abs/2606.00851v1 Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning 2026-05-30T18:53:35Z

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

2026-05-30T18:53:35Z Sukru Samet Dindar Riki Shimizu Xilin Jiang Nima Mesgarani http://arxiv.org/abs/2606.02631v1 Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals 2026-05-30T14:59:57Z

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

2026-05-30T14:59:57Z 12 pages, 3 figures Shenghao Ding http://arxiv.org/abs/2606.00684v1 Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection 2026-05-30T11:45:30Z

We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.

2026-05-30T11:45:30Z 16 pages, 5 figures Xinwei Cao Mengxuan Lu Torbjørn Svendsen Giampiero Salvi http://arxiv.org/abs/2606.00629v1 Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation 2026-05-30T09:05:11Z

Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.

2026-05-30T09:05:11Z DaFx 2026 Nelly Garcia Aditya Bhattacharjee Gabryel Mason-Williams Israel Mason-Williams Emmanouil Benetos Joshua Reiss http://arxiv.org/abs/2511.13487v3 Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization 2026-05-30T08:37:39Z

This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

2025-11-17T15:25:49Z Accepted at EUSIPCO 2026 Davoud Shariat Panah Alessandro Ragano Dan Barry Jan Skoglund Andrew Hines http://arxiv.org/abs/2510.00180v3 DiffAU: Diffusion-Based Ambisonics Upscaling 2026-05-30T01:56:52Z

Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.

2025-09-30T19:01:06Z Amit Milstein Nir Shlezinger Boaz Rafaely http://arxiv.org/abs/2606.00460v1 SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors 2026-05-30T00:54:53Z

Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. Unlike commonly used steering approaches that rely on contrastive activation differences, SALSA directly optimizes steering vectors using a supervised objective. Across children's speech, multilingual speech, and Mandarin-English code-switching benchmarks, SALSA substantially improves performance over zero-shot inference and speech in-context learning baselines, achieving up to 46.8% relative improvements over zero-shot. Analysis further demonstrates that steering the encoder, particularly the later layers, is more effective than steering the LLM backbone. These findings suggest that steering improves downstream ASR performance by adapting higher-level acoustic and phonetic representations to better align with the pretrained language model representation space, rather than by modifying the decoder itself.

2026-05-30T00:54:53Z Yekaterina Yegorova Argyrios Gerogiannis Haolong Zheng Julia Hockenmaier Chang D. Yoo Mark A. Hasegawa-Johnson http://arxiv.org/abs/2412.03771v3 Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification 2026-05-29T23:51:20Z

Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.

2024-12-04T23:18:40Z Ysobel Sims Alexandre Mendes Stephan Chalup http://arxiv.org/abs/2606.00407v1 Privacy-preserving Prosody Representation Learning 2026-05-29T22:49:56Z

Speech representations that capture prosodic information can be useful for both understanding and generation. However, speaker characteristics are reflected in acoustic-prosodic features (e.g., pitch). To address privacy concerns from the leakage of identity information, we propose a new self-supervised approach to learning prosody representations that incorporates speaker disentanglement strategies. We evaluate our encoder on three tasks to probe representation capabilities, including pitch reconstruction and detection of different prosodic events. Our encoder outperforms raw prosody and HuBERT-base baselines, achieving strong speaker disentanglement without adverse impact on prosody-related downstream tasks.

2026-05-29T22:49:56Z Accepted to ACL 2026 Kevin Everson Mari Ostendorf http://arxiv.org/abs/2603.03312v3 Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding 2026-05-29T19:20:43Z

Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental issues: Semantic Bias, where outputs collapse into generic linguistic templates; Signal Neglect, where models rely heavily on LLM priors to hallucinate fluent text even in the absence of meaningful signals; and the "BLEU Trap", where high-frequency stopwords inflate n-gram metrics, masking a lack of true semantic fidelity. To resolve these challenges, we move beyond conventional end-to-end pipelines and propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We extract these semantic anchors from EEG embeddings directly, then unify them with an Active Retrieval Decoding mechanism, compelling the LLM to ground its token generation in the neural signals rather than defaulting to linguistic priors. Furthermore, we break the BLEU Trap by establishing a comprehensive evaluation protocol using rigorous retrieval and distribution-based metrics such as Fréchet Distance. Extensive experiments demonstrate that SemKey effectively mitigates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.

2026-02-09T02:47:07Z Yuchen Wang Haonan Wang Yu Guo Honglong Yang Xiaomeng Li http://arxiv.org/abs/2603.10468v2 G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition 2026-05-29T17:09:39Z

We study timestamped speaker-attributed automatic speech recognition (SA-ASR) for long-form, multi-party speech with overlap. In this setting, chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Prior Speech-LLM systems tend to prioritize either local diarization or global labeling, lacking the ability to jointly model fine-grained temporal boundaries and robust cross-chunk identity linking. We propose G-STAR, an end-to-end framework that couples a cache-conditioned speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Under chunk-wise decoding protocols, experiments on both oracle-segmented local evaluation and full-meeting global evaluation show strong speaker-attributed transcription performance.

2026-03-11T06:40:01Z submitted to Emnlp 2026 Jing Peng Ziyi Chen Haoyu Li Yucheng Wang Duo Ma Mengtian Li Yunfan Du Dezhu Xu Kai Yu Shuai Wang http://arxiv.org/abs/2605.31469v1 Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus 2026-05-29T16:01:25Z

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

2026-05-29T16:01:25Z Máté Gedeon Piroska Zsófia Barta Péter Mihajlik Katalin Mády http://arxiv.org/abs/2605.31329v1 Improving acoustic drone detection generalization through pretraining and data augmentation 2026-05-29T14:04:55Z

Detecting unauthorized UAV flights is critical for surveillance, security, and airspace management. Acoustic drone detection, which relies on the distinctive propeller and motor sounds of UAVs, provides a low-cost, passive solution that requires no line of sight. A central challenge is generalization: reliably distinguishing drone signatures from ambient noise across unseen recording setups, environments, and UAV types (out-of-domain). Inspired by advances in large-scale audio pretraining, we develop a compact DNN-based detector and improve its generalization by (1) pretraining the model for broad sound-event classification before fine-tuning on diverse in-house and public drone recordings, and (2) applying on-the-fly augmentations (pitch shifting, noise mixing, microphone transfer function simulation, spectrogram augmentation) to expose the model to varied acoustic conditions. An ablation study quantifies the impact of each augmentation. For evaluation, we set target false-positive rates (FPR) aligned with real-world surveillance needs and report true-positive rates (TPR) on both in-domain data (public IDMT Berne 2022) and out-of-domain data (public AuDroK). Our results show that pretraining is the dominant factor for robust detection, yielding substantial TPR improvements over training from scratch on all benchmarks. The full augmentation chain provides additional gains on acoustically mismatched out-of-domain data, achieving the best mean TPR on the AuDroK subsets and the largest improvements on the most challenging scenarios. We further validate real-world applicability by measuring false positives on public non-drone corpora (IDMT-TRAFFIC and ESC-50), demonstrating equally low FPR on unfamiliar backgrounds. A distance-dependent analysis on IDMT Berne 2022 shows effective detection at distances up to 150 m.

2026-05-29T14:04:55Z Accepted to Quiet Drones 2026 Paul M. Reuter Mattes Ohlenbusch Christian Rollwage http://arxiv.org/abs/2510.20853v2 Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization 2026-05-29T13:20:38Z

Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i)~insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii)~task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new DailySense dataset, the first to enable ExG-based analysis across five human senses, together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.

2025-10-22T05:11:02Z Accepted to ICLR 2026 Hyungjun Yoon Seungjoo Lee Yu Yvonne Wu Xiaomeng Chen Taiting Lu Freddy Yifei Liu Taeckyung Lee Hyeongheon Cha Haochen Zhao Gaoteng Zhao Dongyao Chen Cecilia Mascolo Sung-Ju Lee Lili Qiu