https://arxiv.org/api/Gf/JkTCgF0Gso6I8sKw/bBO/x7w2026-06-13T11:08:23Z216831515http://arxiv.org/abs/2510.23320v2LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization2026-06-10T16:18:20ZWe introduce LibriConvo, a synthetic conversational speech corpus for speaker diarization and automatic speech recognition (ASR), built by instantiating the previously proposed Speaker-Aware Simulated Conversation (SASC) framework in a dataset and benchmarking setting. The main contribution of this paper is a corpus construction pipeline and benchmark derived from that framework. To make the data more suitable for downstream ASR and diarization, conversational timing statistics are estimated from English CallHome using external voice activity detection, long pauses are compressed, LibriTTS utterances are grouped by book to improve local semantic continuity, and room impulse responses are selected with a spatial-plausibility heuristic. The resulting corpus contains 240.1 hours of audio across 1,496 dialogues involving 830 speakers, partitioned into speaker-disjoint train, validation, and test splits. We report baseline results for both diarization and ASR. On the test split, Sortformer outperforms the pyannote pipeline in diarization (11.1\% vs.~24.4\% DER). For ASR, a Fast Conformer-CTC XLarge model fine-tuned with Serialized Output Training achieves 7.29\% WER and 6.97\% cpWER, outperforming zero-shot Whisper-large-v3. These results position LibriConvo as a practical benchmark for studying synthetic conversational speech and for evaluating multi-speaker speech processing systems.2025-10-27T13:35:22ZAccepted by TSD 2026Máté GedeonPéter Mihajlikhttp://arxiv.org/abs/2606.12199v1Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation2026-06-10T15:19:15ZSpoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.2026-06-10T15:19:15ZAccepted by Interspeech 2026 long paperZhen YeXu TanYiming LiGuangyan ZhangChimin ChanHaohe LiuZhengxi LiuHongzhan LinZheqi DaiXinshen ZhangPeiwen SunQiuqiang KongWei Xuehttp://arxiv.org/abs/2602.00560v2Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards2026-06-10T10:57:05ZImperceptible text-based speech editing modifies spoken content through transcript manipulation while preserving acoustic continuity. Prior acoustic-space approaches suffer from content-style entanglement, causing unstable generation and boundary artifacts. We introduce a framework guided by the principle of "Edit Content, Preserve Acoustics". Editing is conducted in a stable semantic space, while acoustic realization is handled by a Flow Matching decoder. To ensure perceptual consistency, we propose Self-Consistency Rewards Group Relative Policy Optimization, which leverages a pre-trained Text-to-Speech model as an implicit critic, together with intelligibility and duration constraints. Experiments demonstrate consistent improvements over state-of-the-art autoregressive and non-autoregressive baselines in intelligibility, robustness, and perceptual quality.2026-01-31T06:56:50ZAccepted by Interspeech 2026Yong RenJiangyan YiJianhua TaoTao WangLe XuZhengqi Wenhttp://arxiv.org/abs/2606.05394v2nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies2026-06-10T10:24:15ZnnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.2026-06-03T20:00:41ZAbhinaba RoyJunyi LiangDorien Herremanshttp://arxiv.org/abs/2606.06065v3Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition2026-06-10T09:39:11ZSecond-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance. Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.2026-06-04T12:07:07Z5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for AudioSeung Hwan ChoYoung-Min Kimhttp://arxiv.org/abs/2606.11795v1Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency2026-06-10T08:26:29ZMulti-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.2026-06-10T08:26:29ZAccepted to Interspeech 2026 (Long Paper Track)Shota HoriguchiMarc DelcroixNaohiro TawaraTakanori AshiharaAtsushi Andohttp://arxiv.org/abs/2606.02220v2SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment2026-06-10T08:04:55ZSelf-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context invariance across different speaking styles. We introduce SiamCTC, a framework that integrates Siamese networks with Connectionist Temporal Classification (CTC) to learn speech representations without strict frame-level correspondence. By employing CTC loss to establish flexible, monotonic alignments between differing temporal realizations of the same content, SiamCTC accommodates speed perturbations and other temporal augmentations. This design relaxes frame-wise constraints while preserving temporal coherence and enhancing robustness to speaking-rate variations in downstream tasks. Our experiments demonstrate that SiamCTC leads to more adaptable speech representations, particularly at diverse speaking rates.2026-06-01T13:20:46ZAccepted to Interspeech 2025SooHwan EomMark Hasegawa-JohnsonChang D. Yoo10.21437/Interspeech.2025-2746http://arxiv.org/abs/2606.11766v1Fast Speech Foundation Model Distillation Using Interleaved Stacking2026-06-10T07:43:32ZDistilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.2026-06-10T07:43:32ZAccepted by Interspeech 2026Eungbeom KimKyogu Leehttp://arxiv.org/abs/2606.06940v2Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models2026-06-10T06:55:49ZWhile Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.2026-06-05T06:11:38ZAccepted by Interspeech2026Zhixian ZhaoShuiyuan WangWenjie TianJingbin HuZiyu ZhangLei Xiehttp://arxiv.org/abs/2405.06995v4Benchmarking Cross-Domain Audio-Visual Deception Detection2026-06-10T05:04:18ZAutomated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.2024-05-11T12:06:31Z17 pagesXiaobao GuoZitong YuNithish Muthuchamy SelvarajBingquan ShenAdams Wai-Kin KongAlex C. Kothttp://arxiv.org/abs/2606.11631v1Benchmarking Neural Speech Compression from a Rate-Distortion Perspective2026-06-10T03:49:54ZLearning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: https://avery-xu.github.io/ECC-demo/2026-06-10T03:49:54ZJun XuZhengxue ChengFengxi ZhangYuhan LiuLi SongWenjun Zhanghttp://arxiv.org/abs/2509.15680v3SAM: A Mamba-2 State-Space Audio-Language Model2026-06-10T03:06:55ZWe present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.2025-09-19T06:54:42Z6 pages, Accepted to Interspeech 2026Taehan LeeJaehan JungHyukjun Leehttp://arxiv.org/abs/2606.11581v1Sensitivity Analysis of Generative Spatial Audio Metrics: A Study on Responsiveness, Smoothness, and Symmetry2026-06-10T02:15:15ZEvaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spatial trajectories, drawing on principles of sensitivity analysis in parametric sound synthesis. Using controlled FOA scenes with increasing scene complexity, we define three desiderata for metric behavior: Responsiveness, Smoothness, and Symmetry. We assess standard distribution-based and sample-based metrics, including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps. Our findings show that FAD using localization-specific embeddings and acoustic maps yield high Responsiveness and robust Smoothness and Symmetry across conditions, while intensity vectors degrade with increasing scene complexity. This is the first step towards investigating the sensitivity of metrics for generative spatial audio.2026-06-10T02:15:15ZAccepted for publication at Interspeech 2026Purnima KamathAdrian S. RomanKoichi SaitoYuki MitsufujiJuan P. Bellohttp://arxiv.org/abs/2603.11678v2RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis2026-06-09T22:39:13ZWe propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.2026-03-12T08:43:04ZAccepted to Interspeech 2026 Long paper track. Code: https://github.com/infected4098/Relativistic-Adversarial-FeedbackYongjoon LeeJung-Woo Choihttp://arxiv.org/abs/2606.11429v1Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains2026-06-09T20:27:59ZSpeech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.2026-06-09T20:27:59ZAccepted by Interspeech 2026Zilai WangNatarajan Balaji ShankarMohan ShiKaiyuan ZhangAbeer Alwan