https://arxiv.org/api/+QWJbgjKwRSKPZxiZ9vGu0MbaDE 2026-04-03T13:51:16Z 21216 105 15 http://arxiv.org/abs/2603.21073v1 SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing 2026-03-22T06:00:41Z Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/. 2026-03-22T06:00:41Z Under Review Jianyi Chen Rongxiu Zhong Shilei Zhang Kun Qian Jinglei Liu Yike Guo Wei Xue http://arxiv.org/abs/2603.25752v1 Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition 2026-03-22T01:27:22Z In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion. 2026-03-22T01:27:22Z 19 pages neurocomputing2026 Ying Liu Yuntao Shou Wei Ai Tao Meng Keqin Li 10.1016/j.neucom.2026.133306 http://arxiv.org/abs/2510.07909v2 Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor 2026-03-21T10:03:46Z Backdoor data poisoning is a crucial technique for ownership protection and defending against malicious attacks. Embedding hidden triggers in training data can manipulate model outputs, enabling provenance verification, and deterring unauthorized use. However, current audio backdoor methods are suboptimal, as poisoned audio often exhibits degraded perceptual quality, which is noticeable to human listeners. This work explores the intrinsic stealthiness and effectiveness of audio watermarking in achieving successful poisoning. We propose a novel Watermark-as-Trigger concept, integrated into the Bloodroot backdoor framework via adversarial LoRA fine-tuning, which enhances perceptual quality while achieving a much higher trigger success rate and clean-sample accuracy. Experiments on speech recognition (SR) and speaker identification (SID) datasets show that watermark-based poisoning remains effective under acoustic filtering and model pruning. The proposed Bloodroot backdoor framework not only secures data-to-model ownership, but also well reveals the risk of adversarial misuse. 2025-10-09T08:02:00Z 5 pages, 3 figures accepted to ICASSP 2026 Kuan-Yu Chen Yi-Cheng Lin Jeng-Lin Li Jian-Jiun Ding http://arxiv.org/abs/2603.20638v1 OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement 2026-03-21T04:21:22Z Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks. Our model and code will be open-sourced. Our demo page is available. 2026-03-21T04:21:22Z Jingbin Hu Haoyu Zhang Dake Guo Qirui Zhan Wenhao Li Huakang Chen Guobin Ma Hanke Xie Chengyou Wang Pengyuan Xie Chuan Xie Qiang Zhang Lei Xie http://arxiv.org/abs/2603.20433v1 ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models' In-Context Learning Ability 2026-03-20T19:03:29Z While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration. 2026-03-20T19:03:29Z Submitted to Interspeech 2026 Yen-Ting Piao Jay Chiehen Liao Wei-Tang Chien Toshiki Ogimoto Shang-Tse Chen Yun-Nung Chen Chun-Yi Lee Shao-Yuan Lo http://arxiv.org/abs/2603.20387v1 End-to-End Multi-Task Learning for Adjustable Joint Noise Reduction and Hearing Loss Compensation 2026-03-20T18:05:23Z A multi-task learning framework is proposed for optimizing a single deep neural network (DNN) for joint noise reduction (NR) and hearing loss compensation (HLC). A distinct training objective is defined for each task, and the DNN predicts two time-frequency masks. During inference, the amounts of NR and HLC can be adjusted independently by exponentiating each mask before combining them. In contrast to recent approaches that rely on training an auditory-model emulator to define a differentiable training objective, we propose an auditory model that is inherently differentiable, thus allowing end-to-end optimization. The audiogram is provided as an input to the DNN, thereby enabling listener-specific personalization without the need for retraining. Results show that the proposed approach not only allows adjusting the amounts of NR and HLC individually, but also improves objective metrics compared to optimizing a single training objective. It also outperforms a cascade of two DNNs that were separately trained for NR and HLC, and shows competitive HLC performance compared to a traditional hearing-aid prescription. To the best of our knowledge, this is the first study that uses an auditory model to train a single DNN for both NR and HLC across a wide range of listener profiles. 2026-03-20T18:05:23Z Philippe Gonzalez Vera Margrethe Frederiksen Torsten Dau Tobias May http://arxiv.org/abs/2603.20165v1 Audio Avatar Fingerprinting: An Approach for Authorized Use of Voice Cloning in the Era of Synthetic Audio 2026-03-20T17:42:24Z With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person's mouth. This imposes a new set of forensics-related challenges on speech-based authentication systems, videoconferencing, and audio-visual broadcasting platforms, where we want to detect synthetic speech. At the same time, leveraging AI speech synthesis can enhance the different modes of communication through features such as low-bandwidth communication and audio enhancements - leading to ever-increasing legitimate use-cases of synthetic audio. In this case, we want to verify if the synthesized voice is actually spoken by the user. This will require a mechanism to verify whether a given synthetic audio is driven by an authorized identity, or not. We term this task audio avatar fingerprinting. As a step towards audio forensics in these new and emerging situations, we analyze and extend an off-the-shelf speaker verification model developed outside of forensics context for the task of fake speech detection and audio avatar fingerprinting, the first experimentation of its kind. Furthermore, we observe that no existing dataset allows for the novel task of verifying the authorized use of synthetic audio - a limitation which we address by introducing a new speech forensics dataset for this novel task. 2026-03-20T17:42:24Z Candice R. Gerstner http://arxiv.org/abs/2402.01703v5 Community-Informed AI Models for Police Accountability 2026-03-20T17:39:55Z Face-to-face interactions between police officers and the public affect both individual well-being and democratic legitimacy. Many government-public interactions are captured on video, including interactions between police officers and drivers captured on bodyworn cameras (BWCs). New advances in AI technology enable these interactions to be analyzed at scale, opening promising avenues for improving government transparency and accountability. However, for AI to serve democratic governance effectively, models must be designed to include the preferences and perspectives of the governed. This article proposes a community-informed, approach to developing multi-perspective AI tools for government accountability. We illustrate our approach by describing the research project through which the approach was inductively developed: an effort to build AI tools to analyze BWC footage of traffic stops conducted by the Los Angeles Police Department. We focus on the role of social scientists as members of multidisciplinary teams responsible for integrating the perspectives of diverse stakeholders into the development of AI tools in the domain of police -- and government -- accountability. 2024-01-24T19:56:20Z 33 pages, 4 figures, 2 tables Benjamin A. T. Grahama Lauren Brown Georgios Chochlakis Morteza Dehghani Raquel Delerme Brittany Friedman Ellie Graeden Preni Golazizian Rajat Hebbar Parsa Hejabi Aditya Kommineni Mayagüez Salinas Michael Sierra-Arévalo Jackson Trager Nicholas Weller Shrikanth Narayanan http://arxiv.org/abs/2603.20118v1 BioDCASE 2026 Challenge Baseline for Cross-Domain Mosquito Species Classification 2026-03-20T16:41:35Z Mosquito-borne diseases affect more than one billion people each year and cause close to one million deaths. Traditional surveillance methods rely on traps and manual identification that are slow, labor-intensive, and difficult to scale. Audio-based mosquito monitoring offers a non-destructive, lower-cost, and more scalable complement to trap-based surveillance, but reliable species classification remains difficult under real-world recording conditions. Mosquito flight tones are narrow-band, often low in signal-to-noise ratio, and easily masked by background noise, and recordings for several epidemiologically relevant species remain limited, creating pronounced class imbalance. Variation across devices, environments, and collection protocols further increases the difficulty of robust classification. Such variation can cause models to rely on domain-specific recording artefacts rather than species-relevant acoustic cues, which makes transfer to new acquisition settings difficult. The BioDCASE 2026 Cross-Domain Mosquito Species Classification (CD-MSC) challenge is designed around this deployment problem by evaluating performance on both seen and unseen domains. This paper presents the official baseline system and evaluation pipeline as a simple, fully reproducible reference for the CD-MSC challenge task. The baseline uses log-mel features and a multitemporal resolution convolutional neural network (MTRCNN) with species and auxiliary domain outputs, together with complete training and test scripts. The baseline system performs strongly on seen domains but degrades markedly on unseen domains, showing that cross-domain generalisation, rather than within-domain recognition, is the central challenge for practical mosquito species classification from multi-source bioacoustic recordings. 2026-03-20T16:41:35Z BioDCASE 2026 CD-MSC Baseline, source code and models: https://github.com/Yuanbo2020/CD-MSC Yuanbo Hou Vanja Zdravkovic Marianne Sinka Yunpeng Li Wenwu Wang Mark D. Plumbley Kathy Willis Stephen Roberts http://arxiv.org/abs/2603.15597v2 AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer 2026-03-20T11:51:41Z Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning. Code and demo are available at: https://ff2416.github.io/AC-Foley-Page 2026-03-16T17:53:07Z Accepted at ICLR 2026. 15 pages, 5 figures, add project webpage Pengjun Fang Yingqing He Yazhou Xing Qifeng Chen Ser-Nam Lim Harry Yang http://arxiv.org/abs/2603.19831v1 Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech? 2026-03-20T10:17:10Z Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/ 2026-03-20T10:17:10Z Accepted at The 2nd International Workshop on Bodily Expressed Emotion Understanding (BEEU) at AAAI 2026 [non-archival] Lokesh Kumar Nirmesh Shah Ashishkumar P. Gudmalwar Pankaj Wasnik http://arxiv.org/abs/2603.19798v1 Borderless Long Speech Synthesis 2026-03-20T09:37:54Z Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis. 2026-03-20T09:37:54Z Xingchen Song Di Wu Dinghao Zhou Pengyu Cheng Hongwu Ding Yunchao He Jie Wang Shengfan Shen Sixiang Lv Lichun Fan Hang Su Yifeng Wang Shuai Wang Meng Meng Jian Luan http://arxiv.org/abs/2603.19697v1 Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction 2026-03-20T07:05:29Z The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones. Audio samples are available at: https://plugandsteer.github.io 2026-03-20T07:05:29Z Submitted to Interspeech 2026; demo available https://plugandsteer.github.io Doyeop Kwak Suyeon Lee Joon Son Chung http://arxiv.org/abs/2509.24773v4 VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning 2026-03-20T03:36:49Z Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward feature-level data synthesis method, demonstrating that our framework provides a robust foundation that easily adapts to joint sound and speech generation using synthetic data. Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative models. Project page: https://vasflow1.github.io/vasflow/ 2025-09-29T13:38:24Z Paper Under Review Xin Cheng Yuyue Wang Xihua Wang Yihan Wu Kaisi Guan Yijing Chen Peng Zhang Xiaojiang Liu Meng Cao Ruihua Song http://arxiv.org/abs/2603.18048v2 DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models 2026-03-20T01:17:57Z Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding. 2026-03-17T15:52:26Z 14 pages,6 figures Jiaqi Xiong Yunjia Qi Qi Cao Yu Zheng Yutong Zhang Ziteng Wang Ruofan Liao Weisheng Xu Sichen Liu