https://arxiv.org/api//taN8+XL/dYCPIomL3tblY9MxFw 2026-06-22T10:10:05Z 21774 360 15 http://arxiv.org/abs/2509.14789v2 Acoustic Simulation Framework for Multi-channel Replay Speech Detection 2026-05-29T10:37:14Z

Replay speech attacks pose a significant threat to voice-controlled systems, especially in smart environments where voice assistants are widely deployed. While multi-channel audio offers spatial cues that can enhance replay detection robustness, existing datasets and methods predominantly rely on single-channel recordings. Moreover, previous studies highlighted that generalization of this attack to new environments is challenging, requiring new methods for generating data encompassing various acoustic conditions. Hence, in this work we introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations using publicly available resources. Using the framework, we train the state-of-the-art multi-channel replay detector M-ALRAD and evaluate its generalisation on the ReMASC real-recording corpus without any real training data. To improve the exploitation of spatial information, we extend M-ALRAD with inter-channel phase difference features computed for adjacent microphone pairs, augmenting the beamformed representation with directional cues. Synthetic datasets will be available upon acceptance of the paper.

2025-09-18T09:38:58Z Submitted to IEEE MMSP 2026 Michael Neri Tuomas Virtanen http://arxiv.org/abs/2605.31101v1 On the Use of Dereverberation for Acoustic Feedback Cancellation 2026-05-29T10:13:24Z

In public address systems and hearing aids, the maximally achievable amplification or gain is limited by acoustic feedback. Therefore, in order to be able to apply a higher gain, feedback cancellation methods are required. In addition, it is oftentimes also desirable to dereverberate a recorded signal, that is, remove the late reverberation component of the signal, before playing it back. In this paper, it is shown that under two mild conditions, the acoustic feedback signal can be written as a reverberant version of the source signal. Therefore, it is possible to treat the joint dereverberation and acoustic feedback cancellation problem as a dereverberation-only problem, meaning that dereverberation algorithms can be applied to the joint problem. Simulations corroborate this finding

2026-05-29T10:13:24Z Accepted for publication in proceedings of EUSIPCO 2026 Basil Liekens Arnout Roebben Toon van Waterschoot Marc Moonen http://arxiv.org/abs/2605.30993v1 SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue 2026-05-29T08:27:57Z

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.

2026-05-29T08:27:57Z Technical Report Ruiqi Li Yu Zhang Changhao Pan Ke Lei Xiang Yin Cheng Yang http://arxiv.org/abs/2605.30965v1 ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment 2026-05-29T07:58:54Z

Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

2026-05-29T07:58:54Z Accepted to ACL 2026 main conference. Code is available at https://github.com/jjunak-yun/ImmersiveTTS Jun-Hak Yun Seung-Bin Kim Seong-Whan Lee http://arxiv.org/abs/2605.30940v1 Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer 2026-05-29T07:28:24Z

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.

2026-05-29T07:28:24Z Accepted by ICML 2026 Ke Lei Yu Zhang Changhao Pan Xueyi Pu Wenxiang Guo Ruiqi Li Zhou Zhao http://arxiv.org/abs/2605.30899v1 A Unified and Reproducible Experimentation Framework for Speech Understanding 2026-05-29T06:33:36Z

Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.

2026-05-29T06:33:36Z This paper is submitted to INTERSPEECH 2026 Jing Peng Junhao Du Chenghao Wang Hanqi Li Yi Yang Yixuan Wang Xiaoyu Gu Guanyu Chen Yucheng Wang Jiang Li Zhangjie Zhao Haoran Wang Wenming Tu Haoyu Li Duo Ma Lirong Qian Yu Xi Wen Wen Jiaqi Guo Hui Zhang Shuai Fan Wenbin Jiang Shuai Wang Kai Yu http://arxiv.org/abs/2605.30792v1 OpenSTBench: Beyond Semantic Evaluation for Speech Translation 2026-05-29T03:31:04Z

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.

2026-05-29T03:31:04Z Submitted to EMNLP 2026 Yanjie An Yuxiang Zhao Yichi Zhang Qixi Zheng Yujie Tu Keqi Deng Kai Yu Xie Chen http://arxiv.org/abs/2605.24863v2 Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems 2026-05-29T01:13:17Z

Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.

2026-05-24T04:46:30Z 4 pages, 1 figure, working in process Yang Xiao Siyi Wang Eun-Jung Holden Ting Dang http://arxiv.org/abs/2601.13704v3 Performance and Complexity Trade-off Optimization of Speech Models During Training 2026-05-28T23:47:31Z

In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.

2026-01-20T08:00:05Z This work has been submitted to the IEEE for possible publication Esteban Gómez Tom Backström http://arxiv.org/abs/2605.30594v1 FiPA-SR -- FiLM-Conditioned Perceptually Informed Audio Super-Resolution 2026-05-28T21:40:06Z

Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMamba}_\textrm{P}$ framework, the proposed model incorporates FiLM layers to adapt the reconstruction process according to the respective bandwidth. Experiments on the MUSDB dataset show that FiPA-SR outperforms the state-of-the-art AudioSR model across 8, 20, and 32 kHz input sampling rates. Moreover, the proposed architecture uses approximately 3$\times$ less GPU memory and performs inference more than 60$\times$ faster than the diffusion-based baseline.

2026-05-28T21:40:06Z Submitted to the XLIV BRAZILIAN SYMPOSIUM ON TELECOMMUNICATIONS AND SIGNAL PROCESSING - SBrT 2026 Wallace Abreu Luiz W. P. Biscainho http://arxiv.org/abs/2605.30339v1 Benchmarking Single-Factor Physical Video-to-Audio Generation 2026-05-28T17:59:09Z

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/

2026-05-28T17:59:09Z CVPR 2026 Tingle Li Siddharth Gururani Kevin J. Shih Gantavya Bhatt Sang-gil Lee Zhifeng Kong Arushi Goel Gopala Anumanchipalli Ming-Yu Liu http://arxiv.org/abs/2508.12001v3 FNH-TTS: Mixture-of-Experts Duration Modeling for Robust Neural Speech Synthesis 2026-05-28T14:34:05Z

Current non-autoregressive (NAR) text-to-speech (TTS) systems still struggle to model diverse and speaker-dependent duration variation. We further observe that richer duration variation can increase the synthesis difficulty of existing HiFi-GAN-based vocoders, leading to spectral artifacts and unstable time-frequency structures. To address these issues, we propose FNH-TTS, a VITS-based end-to-end TTS system with Mixture-of-Experts duration modeling and robust vocoder-side synthesis. Specifically, we introduce a Mixture-of-Experts Duration Predictor (MoE-DP) to capture diverse phoneme duration patterns and speaker-dependent speaking-rate characteristics. To convert richer duration variation into stable waveform generation, we further integrate a VOCOS-style vocoder with Collaborative Multi-Band and Sub-Band Discriminators. Experiments on LJSpeech, VCTK, and LibriTTS show that FNH-TTS achieves improved synthesis quality, duration-category accuracy, vocoder reconstruction quality, and inference efficiency. Further analysis shows that MoE-DP is the main source of improved duration modeling, while stronger vocoder-side components are necessary for robust synthesis under richer duration variation.

2025-08-16T10:04:21Z Qingliang Meng Yuqing Deng Wei Liang Limei Yu Huizhi Liang Tian Li http://arxiv.org/abs/2605.29950v1 Frequency-Modulated and Single-Tone Excitation to Reveal Vibro-Acoustic Nonlinearities in Loosened Bolted Joints 2026-05-28T13:57:38Z

Preload loss in bolted joints results in alterations of the stiffness, damping, and nonlinearity of the structure, but existing monitoring techniques for rail-vehicle systems are often not capable of combining controlled shaker tests and sensing of nonlinear features. This paper proposes a method for detecting bolt loosening using a vibro-acoustic technique, where the structure is subjected to controlled shaker tests to sense the nonlinear features. A triaxial accelerometer was attached to the demonstrator, a microphone was placed in close proximity, and one of the bolts was tested under 0%, 20%, 40%, and 80% preload conditions. Single-tone and frequency-modulated (FM) signals close to the main natural frequency of 130 Hz, which was identified using sine sweep and narrow-band excitation, were applied to the demonstrator. When the structure was subjected to 130 Hz single-tone excitation, the loose state of the bolt exhibited several additional high-frequency spectral peaks. FM excitation between 125 and 135 Hz further distinguished between the states. Harmonic band power ratios, normalized to the carrier, distinguished between the loose state and the 80% preload state, where the difference between the loose and 80% preload states was 17.5 dB for l = 2 and 36.5 dB for l = 6.

2026-05-28T13:57:38Z Berkay Kullukcu Robin Pianowski Dina Hannebauer http://arxiv.org/abs/2605.29931v1 It`s All About Speed: AI`s Impact on Workflow in Music Production 2026-05-28T13:43:26Z

In this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focusing specifically on professional participants who identified as recording engineers, mixers, and producers, we discuss their usage of common AI and automated software, as well as their sentiments on the proliferation of these tools. We discuss tensions that may be created between users and automated tools in key areas such as the need for speed and efficiency, controllability, and maintaining creative agency, and how these tensions may be alleviated through tool design.

2026-05-28T13:43:26Z Audio Engineering Society Conference Paper - Presented at the AES International Conference on Machine Learning and Artificial Intelligence for Audio 2025 - September 8-10, London, UK Finn McClellan Fabio Morreale http://arxiv.org/abs/2605.29862v1 Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions 2026-05-28T12:43:03Z

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.

2026-05-28T12:43:03Z 2 figures, 4 tables, and 5 pages Heejoon Koo Yoon Tae Kim Miika Toikkanen June-Woo Kim