https://arxiv.org/api/aqTf7h81M0giKhea07Z03yslVSc 2026-03-24T15:12:15Z 20243 90 15 http://arxiv.org/abs/2603.05310v2 Latent-Mark: An Audio Watermark Robust to Neural Resynthesis 2026-03-16T06:51:37Z While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the imperceptible waveform variations used in prior watermarking methods. To address this limitation, we propose Latent-Mark, the first zero-bit audio watermarking framework designed to survive semantic compression. Our key insight is that robustness to the encode-decode process requires embedding the watermark within the codec's invariant latent space. We achieve this by optimizing the audio waveform to induce a detectable directional shift in its encoded latent representation, while constraining perturbations to align with the natural audio manifold to ensure imperceptibility. To prevent overfitting to a single codec's quantization rules, we introduce Cross-Codec Optimization, jointly optimizing the waveform across multiple surrogate codecs to target shared latent invariants. Extensive evaluations demonstrate robust zero-shot transferability to unseen neural codecs, achieving state-of-the-art resilience against traditional DSP attacks while preserving perceptual imperceptibility. Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions. 2026-03-05T15:51:09Z Yen-Shan Chen Shih-Yu Lai Ying-Jung Tsou Yi-Cheng Lin Bing-Yu Chen Yun-Nung Chen Hung-yi Lee Shang-Tse Chen http://arxiv.org/abs/2603.14853v1 WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework 2026-03-16T05:52:57Z Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper. 2026-03-16T05:52:57Z Under Review Tianyi Tan Jiaxin Ye Yuanming Zhang Xiaohuai Le Xianjun Xia Chuanzeng Huang Jing Lu http://arxiv.org/abs/2408.01180v2 Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation 2026-03-16T05:20:58Z Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset. 2024-08-02T11:02:38Z Accepted at 25th International Society for Music Information Retrieval Conference (ISMIR 2024) HaeJun Yoo Hao-Wen Dong Jongmin Jung Dasaem Jeong http://arxiv.org/abs/2510.08586v2 Dynamic Stress Detection: A Study of Temporal Progression Modelling of Stress in Speech 2026-03-16T04:09:42Z Detecting psychological stress from speech is critical in high-pressure settings. While prior work has leveraged acoustic features for stress detection, most treat stress as a static label. In this work, we model stress as a temporally evolving phenomenon influenced by historical emotional state. We propose a dynamic labelling strategy that derives fine-grained stress annotations from emotional labels and introduce cross-attention-based sequential models, a Unidirectional LSTM and a Transformer Encoder, to capture temporal stress progression. Our approach achieves notable accuracy gains on MuSE (+5%) and StressID (+18%) over existing baselines, and generalises well to a custom real-world dataset. These results highlight the value of modelling stress as a dynamic construct in speech. 2025-10-02T06:30:44Z Vishakha Lall Yisi Liu 10.1109/CogMI67134.2025.00023 http://arxiv.org/abs/2603.12840v2 DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training 2026-03-16T03:47:49Z Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encoders with a three-stage training strategy. Stage I establishes foundational speaker-discriminative representations. Stage II leverages the shared identity-transformation characteristics of voice conversion and anonymization, exposing the model to diverse converted speech to build cross-system robustness. Stage III provides lightweight adaptation to target anonymized data. Results on the VoicePrivacy Attacker Challenge (VPAC) dataset demonstrate that Stage II is the primary driver of generalization, enabling strong attacking performance on unseen anonymization datasets. With Stage III, fine-tuning on only 10\% of the target anonymization dataset surpasses current state-of-the-art attackers in terms of EER. 2026-03-13T09:43:35Z Ridwan Arefeen Xiaoxiao Miao Rong Tong Aik Beng Ng Simon See Timothy Liu http://arxiv.org/abs/2603.14767v1 Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments 2026-03-16T03:02:35Z Logical Access (LA) attacks, also known as audio deepfake attacks, use Text-to-Speech (TTS) or Voice Conversion (VC) methods to generate spoofed speech data. This can represent a serious threat to Automatic Speaker Verification (ASV) systems, as intruders can use such attacks to bypass voice biometric security. In this study, we investigate the correlation between speech quality and the performance of audio spoofing detection systems (i.e., LA task). For that, the performance of two enhancement algorithms is evaluated based on two perceptual speech quality measures, namely Perceptual Evaluation of Speech Quality (PESQ) and Speech-to-Reverberation Modulation Ratio (SRMR), and in respect to their impact on the audio spoofing detection system. We adopted the LA dataset, provided in the ASVspoof 2019 Challenge, and corrupted its test set with different Signal-to-Noise Ratio (SNR) levels, while leaving the training data untouched. Enhancement was applied to attenuate the detrimental effects of noisy speech, and the performances of two models, Speech Enhancement Generative Adversarial Network (SEGAN) and Metric-Optimized Generative Adversarial Network Plus (MetricGAN+), were compared. Although we expect that speech quality will correlate well with speech applications' performance, it can also have as a side effect on downstream tasks if unwanted artifacts are introduced or relevant information is removed from the speech signal. Our results corroborate with this hypothesis, as we found that the enhancement algorithm leading to the highest speech quality scores, MetricGAN+, provided the lowest Equal Error Rate (EER) on the audio spoofing detection task, whereas the enhancement method with the lowest speech quality scores, SEGAN, led to the lowest EER, thus leading to better performance on the LA task. 2026-03-16T03:02:35Z Anacin Angela Shruti Kshirsagar Anderson R. Avila http://arxiv.org/abs/2601.20432v2 Self Voice Conversion as an Attack against Neural Audio Watermarking 2026-03-16T02:07:34Z Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker's voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques. 2026-01-28T09:41:18Z 7 pages; 2 figures; 2 tables; accepted at IEICE, SP/SLP 2026 Yigitcan Özer Wanying Ge Zhe Zhang Xin Wang Junichi Yamagishi http://arxiv.org/abs/2603.14636v1 Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models 2026-03-15T22:12:36Z Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks. Results show general accuracy gains up to 4.4% over CoT prompting. Notably, we identify a cross-modal transfer where steering vectors derived from few text samples effectively guide speech-based reasoning, demonstrating high data efficiency. We also examine hyperparameter sensitivity to understand the robustness of these approaches. Our findings position model steering as a practical direction for strengthening LALM reasoning. 2026-03-15T22:12:36Z 6 pages, 4 figures, 2 tables Lok-Lam Ieong Chia-Chien Chen Chih-Kai Yang Yu-Han Huang An-Yu Cheng Hung-yi Lee http://arxiv.org/abs/2510.16917v2 SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models 2026-03-15T21:57:46Z Knowledge editing enables targeted updates without retraining, but prior work focuses on textual or visual facts, leaving abstract auditory perceptual knowledge underexplored. We introduce SAKE, the first benchmark for editing perceptual auditory attribute knowledge in large audio-language models (LALMs), which requires modifying acoustic generalization rather than isolated facts. We evaluate eight diverse editing methods on three LALMs across reliability, generality, locality, and portability, under single and sequential edits. Results show that most methods enforce edits reliably but struggle with auditory generalization, intra-attribute locality, and multimodal knowledge propagation, and often exhibit forgetting or degeneration in sequential editing. Additionally, fine-tuning the modality connector emerges as a more robust and balanced baseline compared with directly editing the LLM backbones. SAKE reveals key limitations of current methods and provides a foundation for developing auditory-specific LALM editing techniques. 2025-10-19T16:22:09Z Work in progress. Resources: https://github.com/ckyang1124/SAKE Chih-Kai Yang Yen-Ting Piao Tzu-Wen Hsu Szu-Wei Fu Zhehuai Chen Ke-Han Lu Sung-Feng Huang Chao-Han Huck Yang Yu-Chiang Frank Wang Yun-Nung Chen Hung-yi Lee http://arxiv.org/abs/2603.15688v1 PulmoVec: A Two-Stage Stacking Meta-Learning Architecture Built on the HeAR Foundation Model for Multi-Task Classification of Pediatric Respiratory Sounds 2026-03-15T21:13:47Z Background: Respiratory diseases are a leading cause of childhood morbidity and mortality, yet lung auscultation remains subjective and limited by inter-listener variability, particularly in pediatric populations. Existing AI approaches are further constrained by small datasets and single-task designs. We developed PulmoVec, a multi-task framework built on the Health Acoustic Representations (HeAR) foundation model for classification of pediatric respiratory sounds. Methods: In this retrospective analysis of the SPRSound database, 24,808 event-level annotated segments from 1,652 pediatric patients were analyzed. Three task-specific classifiers were trained for screening, sound-pattern recognition, and disease-group prediction. Their out-of-fold probability outputs were combined with demographic metadata in a LightGBM stacking meta-model, and event-level predictions were aggregated to the patient level using ensemble voting. Results: At the event level, the screening model achieved an ROC-AUC of 0.96 (95% CI, 0.95-0.97), the sound-pattern recognition model a macro ROC-AUC of 0.96 (95% CI, 0.96-0.97), and the disease-group prediction model a macro ROC-AUC of 0.94 (95% CI, 0.93-0.94). At the patient level, disease-group classification yielded an accuracy of 0.74 (95% CI, 0.71-0.77), a weighted F1-score of 0.73, and a macro ROC-AUC of 0.91 (95% CI, 0.90-0.93). Stacking improved performance across all tasks compared with base models alone. Conclusions: PulmoVec links event-level acoustic phenotyping with patient-level clinical classification, supporting the potential of foundation-model-based digital auscultation in pediatric respiratory medicine. Multi-center external validation across devices and real-world conditions remains essential. 2026-03-15T21:13:47Z 14 pages, 2 figures, 4 tables; supplementary material included (4 tables, 3 multi-panel figures) Izzet Turkalp Akbasli Oguzhan Serin http://arxiv.org/abs/2603.16941v1 The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs 2026-03-15T16:54:36Z Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect consistent disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. The bias is implicit: responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, uncovering sharper intersectional disparities. 2026-03-15T16:54:36Z 5 pages, 3 figures, 1 table, Submitted to Interspeech 2026 Shree Harsha Bokkahalli Satish Christoph Minixhofer Maria Teleki James Caverlee Ondřej Klejch Peter Bell Gustav Eje Henter Éva Székely http://arxiv.org/abs/2509.19001v2 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS 2026-03-15T16:39:06Z Large Language Model (LLM)-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical decoding strategy, where the LLM generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/. 2025-09-23T13:45:56Z 5 pages, 2 figures, 3 tables; Accepted to ICASSP2026(Oral) Sihang Nie Xiaofen Xing Jingyuan Xing Baiji Liu Xiangmin Xu http://arxiv.org/abs/2603.14456v1 PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark 2026-03-15T16:06:24Z Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench 2026-03-15T16:06:24Z Submitted to Interspeech 2026 Mohammad Javad Ranjbar Kalahroodi Mohammad Amini Parmis Bathayan Heshaam Faili Azadeh Shakery http://arxiv.org/abs/2603.15685v1 DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression 2026-03-15T15:22:06Z Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH. 2026-03-15T15:22:06Z Bingzhou Li Tao Huang http://arxiv.org/abs/2603.14432v1 Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations 2026-03-15T15:16:25Z Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream. 2026-03-15T15:16:25Z Deok-Hyeon Cho Hyung-Seok Oh Seung-Bin Kim Seong-Whan Lee