https://arxiv.org/api/mKDQZ7L+P33kgkq18euD3dEZCwI2026-06-22T08:07:10Z2177433015http://arxiv.org/abs/2605.29948v2HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding2026-06-01T07:03:07ZUnified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.2026-05-28T13:55:19Z14 pages, 2 figures, 8 tablesBohan LiShi LianHankun WangYiwei GuoYu XiZhihan LiDa ZhengColin ZhangKai Yuhttp://arxiv.org/abs/2606.01704v1Kinship Verification Using Voice2026-06-01T05:13:39ZKinship verification (KV) from voice, the task of determining whether two speakers are biologically related, has received only little attention. Our work establishes a foundational basis for this emerging frontier, contributing to both performance evaluation and detection methodologies. First, leveraging the speech recordings of the large-scale audio-visual dataset, KAN-AV, we propose a revised evaluation protocol that controls for various confounders and adopts a family-disjoint train--test split to address open-set KV. Second, we analyze the close connection between speaker verification and KV, showing that genealogical similarity of speaker pairs plays opposite roles in the two tasks. Third, we tackle KV using three neural speaker embedding extractors (ECAPA-TDNN, WavLM-ECAPA, and ReDimNet) combined with various back-ends. In zero-shot KV including same-speaker target trials, ReDimNet achieves the lowest equal error rate (EER) of $20.8\%$; however, performance degrades to $39.7\%$ under strict kin trials, where same-speaker target trials are excluded. Our best trainable back-end, which applies asymmetric processing of the embedding pair to mitigate age-difference effects, obtains an EER of $32.0\%$ ($18.6\%$ with speaker target trials included). These results highlight the difficulty of KV while showing that speaker embeddings encode familial cues, offering a promising foundation for voice-based kinship analysis.2026-06-01T05:13:39ZSubmited to IEEE TASLPJagabandhu MishraTomi H. Kinnunenhttp://arxiv.org/abs/2606.01639v1RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection2026-06-01T03:45:22ZDeep learning has advanced pathological voice detection rapidly, yet rare laryngeal diseases remain underexplored due to data scarcity. Recurrent Respiratory Papillomatosis (RRP) exemplifies this gap: an HPV-induced disease of the larynx in which patients oscillate between recurrence and post-surgical remission over the years. RRP demands continuous voice monitoring that existing cross-sectional corpora cannot support. We introduce the first longitudinal voice dataset for RRP, comprising recordings from 26 patients with up to ten years of follow-up. Each session pairs sustained vowels with sentence-level utterances, which are annotated by otolaryngologists and confirmed synchronously with laryngoscopy. Building on this resource, we establish a systematic benchmark spanning handcrafted features, end-to-end deep networks, self-supervised pretrained models, and recent audio large language models, all evaluated under session-level cross-validation with patient-level audit. Per-subject longitudinal analyses further confirm that the cross-sectional discriminative signal reflects laryngoscopic disease state rather than stable speaker attributes. This work lays a foundation for rare longitudinal pathological voice tasks in low-resource clinical settings.2026-06-01T03:45:22ZSubmitted to APSIPA ASC 2026 Special TracksWenze RenKe-Han LuKai-Wei ChangTiantian FengChing FangZhi-Chi LiaoDao Thi Hai YenSyu-Siang WangYu TsaoChi-Te WangShih-Hau Fanghttp://arxiv.org/abs/2601.06199v3FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation2026-06-01T03:39:22ZScaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.2026-01-08T07:46:03ZTitle updatedJunseok LeeSangyong LeeChang-Jae Chunhttp://arxiv.org/abs/2606.01578v1Description and Discussion on DCASE 2026 Challenge Task 2: Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring2026-06-01T02:25:54ZThis paper presents an overview of DCASE 2026 Challenge Task 2, titled "Noise-aware unsupervised anomalous sound detection (UASD) for machine condition monitoring." The task aims to advance noise-robust anomalous sound detection for machine condition monitoring under the unsupervised setting, where only normal machine sounds are available for training. Reliable detection under noisy conditions is crucial for practical deployment, but previous DCASE Task 2 settings provided limited information about environmental noise, potentially limiting UASD performance in highly noisy situations. To address this limitation, DCASE 2026 allows participants to exploit two-channel audio samples simultaneously captured at locations near and far from the target machine. Since the distant microphone is expected to contain relatively stronger environmental noise and weaker direct machine sounds, it may help distinguish environmental noise components from the target machine sounds. After the challenge submission deadline, challenge results and an analysis of the submitted systems will be added.2026-06-01T02:25:54Zthis article draws heavily from arXiv:2506.10097Tomoya NishidaNoboru HaradaDaiki TakeuchiDaisuke NiizumiKeisuke ImotoKota DohiHarsh PurohitTakashi EndoYohei Kawaguchihttp://arxiv.org/abs/2605.30748v2Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS2026-06-01T01:53:31ZWe present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency tokens. To mitigate this without architectural modification, we introduce two inference-time techniques: prior-calibrated scoring, which subtracts the block-level marginal token distribution, and an early-decoding schedule, which adaptively terminates iteration based on calibrated confidence. On standard zero-shot TTS benchmarks, Chatterbox-Flash attains high-fidelity synthesis comparable to strong autoregressive and non-autoregressive baselines, while supporting streaming inference with time-to-first-packet on par with streaming AR systems and substantially lower real-time factor. Code and audio samples are available at https://github.com/resemble-ai/chatterbox-flash.2026-05-29T02:25:02Z8 pages, 4 figures, 9 tablesDeokjin SeoGangin ParkKihyun Namhttp://arxiv.org/abs/2604.25719v2Step-Audio-R1.5 Technical Report2026-05-31T23:35:54ZRecent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.2026-04-28T14:44:30ZYuxin ZhangXiangyu Tony ZhangDaijiao LiuFei TianYayue DengJun ChenQingjian LinHaoyang ZhangYuxin LiJinglan GongYechang HuangLiang ZhaoChengyuan YaoHexin LiuEng Siong ChngXuerui YangGang YuXiangyu ZhangDaxin Jianghttp://arxiv.org/abs/2606.01483v1MURMUR: An Efficient Inference System for Long-Form ASR2026-05-31T22:54:57ZLong-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.2026-05-31T22:54:57ZWei-Tzu LeeKeisuke KamahoriBaris Kasikcihttp://arxiv.org/abs/2606.01460v1A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation2026-05-31T21:42:46ZMulti-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered set of source-like pitch maps. The model uses permutation-invariant Hungarian matching to avoid fixed output semantics and treats the number of slots as an upper bound on the number of active sources. We further study two modular extensions: a self-supervised timbre encoder that provides training-time targets for slot-level timbre embeddings, and a polyphony branch that regularizes the pitch density of mixture- and slot-level predictions. Experiments show that Hungarian matching substantially improves instrument family decomposition on URMP. Stem-level prediction remains more challenging: timbre and polyphony supervision improve selected configurations, but do not consistently resolve source assignment. The results suggest that slot-based architectures are a promising direction for source-aware MPE, while highlighting the need to couple auxiliary musical cues to slot identity more carefully.2026-05-31T21:42:46ZPreprint submitted to the IEEE 28th International Workshop on Multimedia Signal Processing (MMSP). This work has been submitted to the IEEE for possible publication. 6 pages, 2 figuresMichael Taenzerhttp://arxiv.org/abs/2606.01264v1A 1000-hour EEG-EMG-audio dataset of Japanese speech production2026-05-31T14:30:46ZWe present a multimodal dataset of 1020 hours of simultaneously recorded scalp electroencephalography (EEG), facial electromyography (EMG), and speech audio from three healthy native Japanese speakers during open-vocabulary overt speech. Recordings were acquired with three EEG systems-an ultra-high-density system (g.Pangolin) and two cap-type systems (g.SCARABEO and eegosports), spanning 62-128 channels-across many sessions over several months. Each session provides time-synchronized EEG, facial EMG, and audio, together with speech-event annotations and transcriptions. Although collected with speech decoding as a primary motivation, the dataset also supports work on multimodal signal processing, artifact modeling, longitudinal and cross-device adaptation, and EEG representation learning. Technical validation included power spectral density and event-related potential analyses across participants, devices, and tasks, which showed the expected 1/f spectral profile, task-related alpha-band attenuation, and time-locked evoked responses. The dataset is released in Brain Imaging Data Structure (BIDS) format via OpenNeuro under a CC0 waiver to support both speech-related and broader EEG research.2026-05-31T14:30:46ZMotoshige SatoIlya HoriguchiMasakazu InoueKenichi TomeokaEri HatakeyamaYuya KitaAtsushi YamamotoIppei FujisawaShuntaro Sasaihttp://arxiv.org/abs/2606.02642v1SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models2026-05-31T14:00:37ZDespite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.2026-05-31T14:00:37ZAccepted at CVPR 2026Chenshuang ZhangKyeong Seon KimChengxin LiuTae-Hyun Ohhttp://arxiv.org/abs/2510.01891v2HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering2026-05-31T13:26:11ZIndividual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.2025-10-02T10:59:21ZAccepted to IEEE Transactions on Multimedia 2026Xuyi HuJian LiShaojie ZhangStefan GoetzLorenzo PicinaliOzgur B. AkanAidan O. T. Hogghttp://arxiv.org/abs/2606.01134v1Context-aware child-directed speech detection from long-form recordings2026-05-31T10:12:47ZAutomatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.2026-05-31T10:12:47Z6 pages, 1 figureThéo CharlotTarek KunzeKaveri K. ShethAlejandrina CristiaMarvin Lavechinhttp://arxiv.org/abs/2601.03615v2SARA: Stress Test Reasoning in Audio Deepfake Detection2026-05-31T06:28:59ZAudio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADD), moving beyond \textit{black-box} classifiers by providing transparency to their predictions via reasoning traces. However, such reasoning may not support the model predictions, reflecting poor coherence, or, worse, may rationalize incorrect predictions with plausible but misleading explanation. Moreover, the behavior of ALM reasoning under adversarial attacks remains under-explored, raising questions about the practical reliability of such explanation capabilities. To address this gap, this study introduces \textbf{SARA} (\textbf{S}hift \textbf{A}nalysis of \textbf{R}easoning in \textbf{A}udio), a diagnostic framework that evaluates ALM reasoning across three dimensions: acoustic perception, reasoning-verdict coherence and dissonance. We test five open-source ALMs against both acoustic and linguistic adversarial attacks. We show that acoustic attacks significantly degrade reasoning-verdict coherence (average decrease of 14.20\%), frequently inducing internal logical conflicts. Conversely, linguistic attacks achieve higher attack success rates while maintaining reasoning coherence. We further demonstrate that the textual coherence of generated reasoning traces also serves as a latent indicator of adversarial inputs, enabling effective detection of perturbed audio (0.78 in F1) \textit{without accessing the raw acoustic signal}. These findings suggest that reasoning traces provide diagnostic utility that persists even when final classification outputs are compromised.2026-01-07T05:46:45ZPreprint for ACL 2026 submissionBinh NguyenCharles FlemingThai Lehttp://arxiv.org/abs/2606.02638v1SegTune: Structured and Fine-Grained Control for Song Generation2026-05-31T06:01:15ZRecent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (https://github.com/KlingAIResearch/SegTune) for codes and more generated songs.2026-05-31T06:01:15ZThis paper has been accepted to ACL 2026 as an oral presentation and has been nominated for the Best Paper Award. This work is a revised and extended version of an earlier technical report (arXiv:2510.18416). arXiv admin note: text overlap with arXiv:2510.18416Yuejiao WangZihao JiPengfei CaiXu LiHaorui ZhengZewen SongZhongliang LiuChen ZhangPengfei Wan