https://arxiv.org/api/+27DiHYq+mBnf4+yU5nqk3Yv0hk 2026-06-13T22:15:42Z 21683 150 15 http://arxiv.org/abs/2509.21597v2 AUDDT: A Unified Benchmark Toolkit for Audio and Speech Deepfake Detectors 2026-06-03T17:37:13Z

With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, existing benchmarks employ a narrow set of datasets, leaving detector generalization to real-world conditions uncertain. In this paper, we systematically review 31 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across a wide range of speech and non-speech audio datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors under diverse manipulation types and recording conditions. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, we highlight how AUDDT differs from existing benchmarking efforts by enabling large-scale, diverse evaluation across modern spoofing methods and richer attribute-level analysis through comprehensive metadata annotation. Using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable performance variability across different conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gaps relative to practical deployment scenarios.

2025-09-25T21:09:40Z Yi Zhu Heitor R. Guimarães Arthur Pimentel Tiago Falk http://arxiv.org/abs/2606.05121v1 Audio Interaction Model 2026-06-03T17:26:11Z

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

2026-06-03T17:26:11Z Next generation of LALMs, work in progress Zhifei Xie Zihang Liu Ze An Xiaobin Hu Yue Liao Ziyang Ma Dongchao Yang Mingbao Lin Deheng Ye Shuicheng Yan Chunyan Miao http://arxiv.org/abs/2508.08237v4 VGGSounder: Audio-Visual Evaluations for Foundation Models 2026-06-03T15:31:08Z

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

2025-08-11T17:53:23Z Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2025 Daniil Zverev Thaddäus Wiedemer Ameya Prabhu Matthias Bethge Wieland Brendel A. Sophia Koepke http://arxiv.org/abs/2606.04943v1 Differentiable Articulatory Copy-Synthesis of Biphonic Singing 2026-06-03T14:33:48Z

Sygyt is a Tuvan style of biphonic singing in which a low vocal drone is sustained while a high harmonic is selectively amplified in the 1--3\,kHz region. Copy-synthesizing this effect remains challenging for articulatory models, since it requires fine control of narrowly focused resonances that standard low-dimensional tract parameterizations cannot easily reproduce. We address this problem with a differentiable Kelly--Lochbaum waveguide augmented with a sublingual second source, cubic B-spline tract parameterization, and spatially varying learnable damping, optimized end-to-end by gradient descent from audio. On 20 segments from two independent sygyt datasets (5 singers, 10 pitches), the proposed model reduces log-spectral distance by 30--38\% relative to an articulatory baseline, with the largest gains concentrated in the overtone region. Cepstral-envelope analysis further shows more accurate recovery of the merged formant structure characteristic of sygyt production. The model also outperforms a DDSP harmonic-plus-noise baseline with direct per-harmonic spectral control, suggesting that explicit acoustic structure is a useful inductive bias for overtone-singing copy-synthesis.

2026-06-03T14:33:48Z Accepted to DAFx 2026 Mateo Cámara María Pilar Daza-Llin Fernando Marcos-Macías José Luis Blanco http://arxiv.org/abs/2606.04939v1 UAT: Unified Audio-Text Diffusion for Audio Generation, Editing, and Captioning 2026-06-03T14:29:52Z

Audio generation and audio-to-text understanding remain largely separate, with diffusion models dominating high-fidelity synthesis and autoregressive (AR) language models driving captioning and semantic prediction. Existing unified approaches typically rely on either heterogeneous modules or AR-centric modeling, which can hinder joint optimization and limit acoustic fidelity. We present UAT, to our knowledge, the first diffusion-centric framework that supports unified audio generation, editing, and captioning. UAT couples continuous latent diffusion for audio with masked discrete diffusion for text, enabling bidirectional audio-text modeling within a shared dual-stream backbone. Experiments show that UAT preserves strong audio generation and editing capabilities while achieving competitive captioning performance, demonstrating a favorable balance between acoustic synthesis and semantic prediction. Demo samples are available at https://UAT-demo.github.io.

2026-06-03T14:29:52Z Hui Wang Yifan Yang Zeyue Tian Yuhang Jia Jinghua Zhao Long Zhou Bing Han Cheng Liu Jiaming Zhou Geng Tu Yong Qin http://arxiv.org/abs/2510.20253v2 Neural Directional Filtering with Configurable Directivity Pattern at Inference 2026-06-03T14:23:49Z

Spatial filtering with a desired directivity pattern is advantageous for many audio applications. In this work, we propose neural directional filtering with user-defined directivity patterns (UNDF), which enables spatial filtering based on directivity patterns that users can define during inference. To achieve this, we propose a DNN architecture that integrates feature-wise linear modulation (FiLM), allowing user-defined patterns to serve as conditioning inputs. Through analysis, we demonstrate that the FiLM-based architecture enables the UNDF to generalize to unseen user-defined patterns during interference with higher directivities, scaling variations, and different steering directions. Furthermore, we progressively refine training strategies to enhance pattern approximation and enable UNDF to approximate irregular shapes. Lastly, experimental comparisons show that UNDF outperforms conventional methods.

2025-10-23T06:22:53Z Final camera-ready version of EUSIPCO 2026 Weilong Huang Srikanth Raj Chetupalli Emanuël A. P. Habets http://arxiv.org/abs/2606.04921v1 SURF: Separation via Unsupervised Remixing Flow 2026-06-03T14:17:12Z

The goal of single-channel source separation is to reconstruct $K$ sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, ill-posed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited, and even when available, supervised models are vulnerable to domain shifts. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based self-supervised techniques. At a high level, starting from a teacher model, we utilize a "remixing" step to bootstrap the learning of a student flow model from the teacher's estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods. See our demo page for examples. https://google.github.io/df-conformer/surf/

2026-06-03T14:17:12Z Accepted at ICML 2026 Henry Li Robin Scheibler Efthymios Tzinis Matt Shannon Arnaud Doucet John R. Hershey http://arxiv.org/abs/2606.04730v1 Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026 2026-06-03T11:13:37Z

With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

2026-06-03T11:13:37Z 9 pages main paper, IWSLT 2026 Instruction Following track Enes Yavuz Ugan Maike Züfle Yuka Ko Supriti Sinhamahapatra Fabian Retkowski Seymanur Akti Jan Niehues Alexander Waibel http://arxiv.org/abs/2606.04680v1 Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy 2026-06-03T10:03:19Z

Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.

2026-06-03T10:03:19Z Submitted to Interspeech 2026. 6 pages, 4 figures Zhihan Li Hankun Wang Yiwei Guo Bohan Li Xie Chen Kai Yu http://arxiv.org/abs/2606.03283v2 SpeakerCard-1M: An Evidence-Grounded Speaker Card Corpus for In-the-Wild Speaker Verification 2026-06-03T09:14:41Z

Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition. This paper introduces SpeakerCard-1M, a bilingual speaker-centric resource for evidence-grounded SV, derived from VoxCeleb1/2 and CN-Celeb1/2, where the "-1M" suffix refers to the 1.78M utterance-level captions contained in the release. We adopt a tool-first, LLM-last approach: ten acoustic probes produce field-level evidence, the evidence is aggregated into speaker profiles under a schema that separates relatively stable traits from utterance-level states, and bilingual Speaker Cards are rendered by a constrained LLM that sees only the structured fields. The release includes 56.7K Speaker Card records over 10.2K speakers, 1.78M utterance-level captions, and speaker-ID-disjoint hard-negative triplets. We further define two SV-oriented cross-modal protocols, bidirectional Speaker-Text Retrieval (T2S-R / S2T-R) and Attribute-Conditioned Verification (AC-Verify), and compare a dual-encoder baseline against recent audio language models under a zero-shot forced-choice setting. Joint audio-text training increases VoxCeleb1-O EER by 0.31% absolute over the audio-only baseline. Under a style-symmetric LLM-generated counterfactual protocol, eight recent audio language models (7B-30B+ parameters, both open- and closed-source) score 49-77% on pitch-level AC-Verify under two-way forced choice, compared with 88.66% reached by our dual encoder.

2026-06-02T07:49:30Z Corpus and protocols at https://junyipeng00.github.io/SpeakerCard-1M-page Junyi Peng Oldřich Plchot Xiao Song Dading Chong Lichun Fan Hang Su Themos Stafylakis Junjie Li Kong Aik Lee Shuai Wang Jian Luan Jan Černocký http://arxiv.org/abs/2508.14623v2 A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References 2026-06-03T05:46:56Z

This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.

2025-08-20T11:22:11Z Accepted for IEEE ASRU 2025, Workshop on Automatic Speech Recognition and Understanding. Copyright (c) 2025 IEEE. 8 pages, 6 figures, 2 tables 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Honolulu, HI, USA, 2025, pp. 1-8 Simon Dahl Jepsen Mads Græsbøll Christensen Jesper Rindom Jensen 10.1109/ASRU65441.2025.11434756 http://arxiv.org/abs/2606.04418v1 CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding 2026-06-03T03:56:14Z

Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

2026-06-03T03:56:14Z Eugene Kwek Feng Liu Rui Zhang Wenpeng Yin http://arxiv.org/abs/2606.01804v2 SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing 2026-06-03T03:45:51Z

Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce SpeechEditBench, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code are avaialble at https://github.com/daxintan-cuhk/SpeechEditBench .

2026-06-01T07:21:02Z Hanlin Zhang Daxin Tan Dehua Tao Xiao Chen Haochen Tan Linqi Song http://arxiv.org/abs/2606.04370v1 Masked Wavelet Scattering Transform Neural Field for Sound Field Reconstruction 2026-06-03T02:34:45Z

In this paper, we propose a reconstruction framework that leverages the Wavelet Scattering Transform (WST) as a multi-scale feature extractor to impose statistical priors under sparse observation conditions. The reconstruction problem is formulated as an optimization task and solved using a neural field, with the WST incorporated into the training loss function. As a proof of concept, we validate the proposed method on HRTF upsampling. A masking strategy is applied to the WST coefficients, resulting in a two-phase procedure. The first phase learns a binary mask from a small multi-subject dataset, while the second phase applies the learned mask to the WST coefficients of an individual HRTF to preserve informative statistical structures during reconstruction. Validation against baseline methods, which also serve as an ablation study of the different components of the framework, demonstrates the effectiveness of the proposed approach.

2026-06-03T02:34:45Z 5 pages, 2 figures, conference Xinmeng Luan Samuel A. Verburg Efren Fernandez-Grande Gary Scavone http://arxiv.org/abs/2606.04358v1 Gauss Circle Lattices with Geometric Convolutions for Synthesizing High Dimensional Image-Source Room Impulse Responses 2026-06-03T02:18:39Z

The image-source model (ISM) is a widely adopted method for efficiently simulating acoustic room impulse responses (RIRs) under specular reflection assumptions. Acoustic paths between source and receiver are traced to lattice points computed from successive reflections over bounding planes of the room. Rectangular rooms bound the total number of image-sources to be polynomial in the RIR's duration or distance $k$ equivalent, with degree equal the number of room dimensions $N$. Direct ISM simulations are therefore compute upper-bound by $O \left ( k^N \right )$, and consider only cases of $N \leq 3$ for tractability and real-world applications. This work proposes an alternative computational method that lowers the asymptotic compute bound to $O \left ( N k^2 \log k \right )$ for integer coordinates and room dimensions via reducing ISM lattice point counting to the classic Gauss circle problem (GCP). We extend the lattice counting model to frequency-dependent and reflection weighted image-sources in higher dimensions, relating solutions between successive dimensions via the convolution operator. Two constructions for realizing RIRs are presented, along with time-frequency controls, error and run-time analysis, and RIR statistics.

2026-06-03T02:18:39Z Accepted for publication at the 29th International Conference on Digital Audio Effects 2026 Yuancheng Luo