https://arxiv.org/api/vZiaZLjBX7gA1aGSWeRVF4h6158 2026-03-24T13:22:59Z 20243 75 15 http://arxiv.org/abs/2603.15440v1 Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches 2026-03-16T15:43:48Z Automatic music genre classification is a long-standing challenge in Music Information Retrieval (MIR); work on non-Western music traditions remains scarce. Nepali music encompasses culturally rich and acoustically diverse genres--from the call-and-response duets of Lok Dohori to the rhythmic poetry of Deuda and the distinctive melodies of Tamang Selo--that have not been addressed by existing classification systems. In this paper, we construct a novel dataset of approximately 8,000 labeled 30-second audio clips spanning eight Nepali music genres and conduct a systematic comparison of nine classification models across two paradigms. Five classical machine learning classifiers (Logistic Regression, SVM, KNN, Random Forest, and XGBoost) are trained on 51 hand-crafted audio features extracted via Librosa, while four deep learning architectures (CNN, RNN, parallel CNN-RNN, and sequential CNN followed by RNN) operate on Mel spectrograms of dimension 640 x 128. Our experiments reveal that the sequential Convolutional Recurrent Neural Network (CRNN)--in which convolutional layers feed into an LSTM--achieves the highest accuracy of 84%, substantially outperforming both the best classical models (Logistic Regression and XGBoost, both at 71%) and all other deep architectures. We provide per-class precision, recall, F1-score, confusion matrices, and ROC analysis for every model, and offer a culturally grounded interpretation of misclassification patterns that reflects genuine overlaps in Nepal's musical traditions. 2026-03-16T15:43:48Z 8 pages Sachin Prajuli Abhishek Karna OmPrakash Dhakl http://arxiv.org/abs/2509.26207v2 The silence of the weights: a structural pruning strategy for attention-based audio signal architectures with second order metrics 2026-03-16T15:24:44Z Transformer-based models have become the state of the art across multiple domains, from natural language processing to machine listening, thanks to the attention mechanisms. However, the attention layers require a large number of parameters and high-end hardware for both training and inference. We propose a novel channel-pruning technique explicitly targeted at the attention mechanism, decoupling the pruning of each head and the four layers in the attention block: query, key, value, and output projection matrices, employing a second-order metric to score the network's parameters. We compare our technique against head-pruning strategies and magnitude-driven scoring metrics, investigating the effects of pruning on Audio Spectrogram Transformer (AST) and Whisper. Our results show that even after pruning 50\% of the parameters in the attention block, performance is largely preserved. 2025-09-30T13:10:19Z Andrea Diecidue Carlo Alberto Barbano Piero Fraternali Mathieu Fontaine Enzo Tartaglione http://arxiv.org/abs/2509.20396v2 Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling 2026-03-16T15:18:22Z ASR systems struggle with non-normative speech due to high acoustic variability and data scarcity. We propose a data-efficient method using phoneme-level uncertainty to guide fine-tuning for personalization. Instead of computationally expensive ensembles, we leverage Variational Low-Rank Adaptation (VI LoRA) to estimate epistemic uncertainty in foundation models. These estimates form a composite Phoneme Difficulty Score (PhDScore) that drives a targeted oversampling strategy. Evaluated on English and German datasets, including a longitudinal analysis against two clinical reports taken one year apart, we demonstrate that: (1) VI LoRA-based uncertainty aligns better with expert clinical assessments than standard entropy; (2) PhDScore captures stable, persistent articulatory difficulties; and (3) uncertainty-guided sampling significantly improves ASR accuracy for impaired speech. 2025-09-23T12:54:30Z Niclas Pokel Pehuén Moure Roman Böhringer Yingqiang Gao http://arxiv.org/abs/2603.15261v1 Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization 2026-03-16T13:27:53Z Personalizing automatic speech recognition (ASR) systems for non-normative speech, such as dysarthric and aphasic speech, is challenging. While speaker-specific fine-tuning (SS-FT) is widely used, it is typically initialized directly from a generic pre-trained model. Whether speaker-independent adaptation provides a stronger initialization prior under such mismatch remains unclear. In this work, we propose a two-stage adaptation framework consisting of speaker-independent fine-tuning (SI-FT) on multi-speaker non-normative data followed by SS-FT, and evaluate it through a controlled comparison with direct SS-FT under identical per-speaker conditions. Experiments on AphasiaBank and UA-Speech with Whisper-Large-v3 and Qwen3-ASR, alongside evaluation on typical-speech datasets TED-LIUM v3 and FLEURS, show that two-stage adaptation consistently improves personalization while maintaining manageable out-of-domain (OOD) trade-offs. 2026-03-16T13:27:53Z submitted to Interspeech 2026 Shan Jiang Jiawen Qi Chuanbing Huo Yingqiang Gao Qinyu Chen http://arxiv.org/abs/2510.17512v2 AWARE: Audio Watermarking with Adversarial Resistance to Edits 2026-03-16T10:48:28Z Prevailing practice in learning-based audio watermarking is to pursue robustness by expanding the set of simulated distortions during training. However, such surrogates are narrow and prone to overfitting. This paper presents AWARE (Audio Watermarking with Adversarial Resistance to Edits), an alternative approach that avoids reliance on attack-simulation stacks and handcrafted differentiable distortions. Embedding is obtained through adversarial optimization in the time-frequency domain under a level-proportional perceptual budget. Detection employs a time-order-agnostic detector with a Bitwise Readout Head (BRH) that aggregates temporal evidence into one score per watermark bit, enabling reliable watermark decoding even under desynchronization and temporal cuts. Empirically, AWARE attains high audio quality and speech intelligibility (PESQ/STOI) and consistently low BER across various audio edits, often surpassing representative state-of-the-art learning-based systems. 2025-10-20T13:10:52Z Kosta Pavlović Lazar Stanarević Petar Nedić Elena Nešović Slavko Kovačević Igor Djurović http://arxiv.org/abs/2510.12720v2 Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception 2026-03-16T10:45:28Z Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions. 2025-10-14T17:00:09Z Accepted by ICLR2026. Open Source at https://github.com/ddlBoJack/Omni-Captioner Ziyang Ma Ruiyang Xu Zhenghao Xing Yunfei Chu Yuxuan Wang Jinzheng He Jin Xu Pheng-Ann Heng Kai Yu Junyang Lin Eng Siong Chng Xie Chen http://arxiv.org/abs/2603.15083v1 ReactMotion: Generating Reactive Listener Motions from Speaker Utterance 2026-03-16T10:37:42Z In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions. 2026-03-16T10:37:42Z 42 pages, 11 tables, 8 figures Cheng Luo Bizhu Wu Bing Li Jianfeng Ren Ruibin Bai Rong Qu Linlin Shen Bernard Ghanem http://arxiv.org/abs/2603.15037v1 PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation 2026-03-16T09:40:56Z The growing sophistication of speech generated by Artificial Intelligence (AI) has introduced new challenges in audio deepfake detection. Text-to-speech (TTS) and voice conversion (VC) technologies can create highly convincing synthetic speech with naturalness and intelligibility. This poses serious threats to voice biometric security and to systems designed to combat the spread of spoken misinformation, where synthetic voices may be used to disseminate false or malicious content. While interest in AI-generated speech has increased, resources for evaluating naturalness at the phoneme level remain limited. In this work, we address this gap by presenting the Phoneme-Level DeepFake dataset (PhonemeDF), comprising parallel real and synthetic speech segmented at the phoneme level. Real speech samples are derived from a subset of LibriSpeech, while synthetic samples are generated using four TTS and three VC systems. For each system, phoneme-aligned TextGrid files are obtained using the Montreal Forced Aligner (MFA). We compute the Kullback-Leibler divergence (KLD) between real and synthetic phoneme distributions to quantify fidelity and establish a ranking based on similarity to natural speech. Our findings show a clear correlation between the KLD of real and synthetic phoneme distributions and the performance of classifiers trained to distinguish them, suggesting that KLD can serve as an indicator of the most discriminative phonemes for deepfake detection. 2026-03-16T09:40:56Z 11 pages, 6 figures, 9 tables. Accepted at the 15th Language Resources and Evaluation Conference (LREC 2026), Palma, Spain Vamshi Nallaguntla Aishwarya Fursule Shruti Kshirsagar Anderson R. Avila http://arxiv.org/abs/2601.15668v2 EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning 2026-03-16T09:27:03Z Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: https://github.com/dingdongwang/EmotionThinker 2026-01-22T05:51:53Z ICLR 2026 (Oral). Project page: https://github.com/dingdongwang/EmotionThinker Dingdong Wang Shujie Liu Tianhua Zhang Youjun Chen Jinyu Li Helen Meng http://arxiv.org/abs/2506.04779v3 MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark 2026-03-16T09:24:19Z Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU. 2025-06-05T09:09:36Z ICLR 2026. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Project page https://github.com/dingdongwang/MMSU Dingdong Wang Junan Li Jincenzi Wu Dongchao Yang Xueyuan Chen Tianhua Zhang Helen Meng http://arxiv.org/abs/2602.09823v2 Covo-Audio Technical Report 2026-03-16T09:19:54Z In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs. 2026-02-10T14:31:11Z Technical Report Wenfu Wang Chenxing Li Liqiang Zhang Yiyang Zhao Yuxiang Zou Hanzhao Li Mingyu Cui Hao Zhang Kun Wei Le Xu Zikang Huang Jiajun Xu Jiliang Hu Xiang He Zeyu Xie Jiawen Kang Youjun Chen Meng Yu Dong Yu Rilin Chen Linlin Di Shulin Feng Na Hu Yang Liu Bang Wang Shan Yang http://arxiv.org/abs/2603.14983v1 Cepstral Smoothing of Binary Masks for Convolutive Blind Separation of Speech Mixtures 2026-03-16T08:45:52Z In this paper, we propose a novel separation system for extracting two speech signals from two microphone recordings. Our system combines the blind source separation technique with cepstral smoothing of binary time-frequency masks. The last is composed of two steps. First, the two binary masks are estimated from the separated output signals of BSS algorithm. In the second step, a cepstral smoothing is applied of these spectral masks in order to reduce musical noise typically produced by time-frequency masking. Experiments were carried out with both artificially mixed speech signals using simulated room model and two real recordings. The evaluation results are promising and have shown the effectiveness of our system. 2026-03-16T08:45:52Z International Journal of Digital Content Technology and its Applications (JDCTA), vol. 6, no. 17, pp. 532-541, 2012 Ibrahim Missaoui Zied Lachiri 10.4156/jdcta.vol6.issue17.58 http://arxiv.org/abs/2601.04658v2 LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence 2026-03-16T08:13:54Z Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps. 2026-01-08T07:05:35Z 5 pages, 2 figures; Accepted to ICASSP 2026 Hyeongkeun Lee Jongmin Choi KiHyun Nam Joon Son Chung http://arxiv.org/abs/2602.07803v2 SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis 2026-03-16T08:10:29Z While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings. 2026-02-08T03:51:23Z Technical Report Jiale Qian Hao Meng Tian Zheng Pengcheng Zhu Haopeng Lin Yuhang Dai Hanke Xie Wenxiao Cao Ruixuan Shang Jun Wu Hongmei Liu Hanlin Wen Jian Zhao Zhonglin Jiang Yong Chen Shunshun Yin Ming Tao Jianguo Wei Lei Xie Xinsheng Wang http://arxiv.org/abs/2601.00557v2 A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR 2026-03-16T07:50:05Z Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves comparable performance to two-stage inference approaches while reducing RTF by 11.7% and 8.2%, respectively, leading to improved decoding efficiency for low-resource mASR applications. 2026-01-02T04:08:39Z 5 pages, submitted to IEEE Communications Letters Yuang Zheng Dongxu Chen Yuxiang Mei Dongxing Xu Jie Chen Yanhua Long