https://arxiv.org/api//taN8+XL/dYCPIomL3tblY9MxFw2026-06-22T10:10:05Z2177436015http://arxiv.org/abs/2509.14789v2Acoustic Simulation Framework for Multi-channel Replay Speech Detection2026-05-29T10:37:14ZReplay speech attacks pose a significant threat to voice-controlled systems, especially in smart environments where voice assistants are widely deployed. While multi-channel audio offers spatial cues that can enhance replay detection robustness, existing datasets and methods predominantly rely on single-channel recordings. Moreover, previous studies highlighted that generalization of this attack to new environments is challenging, requiring new methods for generating data encompassing various acoustic conditions. Hence, in this work we introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations using publicly available resources. Using the framework, we train the state-of-the-art multi-channel replay detector M-ALRAD and evaluate its generalisation on the ReMASC real-recording corpus without any real training data. To improve the exploitation of spatial information, we extend M-ALRAD with inter-channel phase difference features computed for adjacent microphone pairs, augmenting the beamformed representation with directional cues. Synthetic datasets will be available upon acceptance of the paper.2025-09-18T09:38:58ZSubmitted to IEEE MMSP 2026Michael NeriTuomas Virtanenhttp://arxiv.org/abs/2605.31101v1On the Use of Dereverberation for Acoustic Feedback Cancellation2026-05-29T10:13:24ZIn public address systems and hearing aids, the maximally achievable amplification or gain is limited by acoustic feedback. Therefore, in order to be able to apply a higher gain, feedback cancellation methods are required. In addition, it is oftentimes also desirable to dereverberate a recorded signal, that is, remove the late reverberation component of the signal, before playing it back. In this paper, it is shown that under two mild conditions, the acoustic feedback signal can be written as a reverberant version of the source signal. Therefore, it is possible to treat the joint dereverberation and acoustic feedback cancellation problem as a dereverberation-only problem, meaning that dereverberation algorithms can be applied to the joint problem. Simulations corroborate this finding2026-05-29T10:13:24ZAccepted for publication in proceedings of EUSIPCO 2026Basil LiekensArnout RoebbenToon van WaterschootMarc Moonenhttp://arxiv.org/abs/2605.30993v1SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue2026-05-29T08:27:57ZZero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.2026-05-29T08:27:57ZTechnical ReportRuiqi LiYu ZhangChanghao PanKe LeiXiang YinCheng Yanghttp://arxiv.org/abs/2605.30965v1ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment2026-05-29T07:58:54ZRecent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.2026-05-29T07:58:54ZAccepted to ACL 2026 main conference. Code is available at https://github.com/jjunak-yun/ImmersiveTTSJun-Hak YunSeung-Bin KimSeong-Whan Leehttp://arxiv.org/abs/2605.30940v1Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer2026-05-29T07:28:24ZReal-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.2026-05-29T07:28:24ZAccepted by ICML 2026Ke LeiYu ZhangChanghao PanXueyi PuWenxiang GuoRuiqi LiZhou Zhaohttp://arxiv.org/abs/2605.30899v1A Unified and Reproducible Experimentation Framework for Speech Understanding2026-05-29T06:33:36ZSpeech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.2026-05-29T06:33:36ZThis paper is submitted to INTERSPEECH 2026Jing PengJunhao DuChenghao WangHanqi LiYi YangYixuan WangXiaoyu GuGuanyu ChenYucheng WangJiang LiZhangjie ZhaoHaoran WangWenming TuHaoyu LiDuo MaLirong QianYu XiWen WenJiaqi GuoHui ZhangShuai FanWenbin JiangShuai WangKai Yuhttp://arxiv.org/abs/2605.30792v1OpenSTBench: Beyond Semantic Evaluation for Speech Translation2026-05-29T03:31:04ZSpeech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.2026-05-29T03:31:04ZSubmitted to EMNLP 2026Yanjie AnYuxiang ZhaoYichi ZhangQixi ZhengYujie TuKeqi DengKai YuXie Chenhttp://arxiv.org/abs/2605.24863v2Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems2026-05-29T01:13:17ZSpeech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.2026-05-24T04:46:30Z4 pages, 1 figure, working in processYang XiaoSiyi WangEun-Jung HoldenTing Danghttp://arxiv.org/abs/2601.13704v3Performance and Complexity Trade-off Optimization of Speech Models During Training2026-05-28T23:47:31ZIn speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.2026-01-20T08:00:05ZThis work has been submitted to the IEEE for possible publicationEsteban GómezTom Backströmhttp://arxiv.org/abs/2605.30594v1FiPA-SR -- FiLM-Conditioned Perceptually Informed Audio Super-Resolution2026-05-28T21:40:06ZAudio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMamba}_\textrm{P}$ framework, the proposed model incorporates FiLM layers to adapt the reconstruction process according to the respective bandwidth. Experiments on the MUSDB dataset show that FiPA-SR outperforms the state-of-the-art AudioSR model across 8, 20, and 32 kHz input sampling rates. Moreover, the proposed architecture uses approximately 3$\times$ less GPU memory and performs inference more than 60$\times$ faster than the diffusion-based baseline.2026-05-28T21:40:06ZSubmitted to the XLIV BRAZILIAN SYMPOSIUM ON TELECOMMUNICATIONS AND SIGNAL PROCESSING - SBrT 2026Wallace AbreuLuiz W. P. Biscainhohttp://arxiv.org/abs/2605.30339v1Benchmarking Single-Factor Physical Video-to-Audio Generation2026-05-28T17:59:09ZGenerative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/2026-05-28T17:59:09ZCVPR 2026Tingle LiSiddharth GururaniKevin J. ShihGantavya BhattSang-gil LeeZhifeng KongArushi GoelGopala AnumanchipalliMing-Yu Liuhttp://arxiv.org/abs/2508.12001v3FNH-TTS: Mixture-of-Experts Duration Modeling for Robust Neural Speech Synthesis2026-05-28T14:34:05ZCurrent non-autoregressive (NAR) text-to-speech (TTS) systems still struggle to model diverse and speaker-dependent duration variation. We further observe that richer duration variation can increase the synthesis difficulty of existing HiFi-GAN-based vocoders, leading to spectral artifacts and unstable time-frequency structures. To address these issues, we propose FNH-TTS, a VITS-based end-to-end TTS system with Mixture-of-Experts duration modeling and robust vocoder-side synthesis. Specifically, we introduce a Mixture-of-Experts Duration Predictor (MoE-DP) to capture diverse phoneme duration patterns and speaker-dependent speaking-rate characteristics. To convert richer duration variation into stable waveform generation, we further integrate a VOCOS-style vocoder with Collaborative Multi-Band and Sub-Band Discriminators. Experiments on LJSpeech, VCTK, and LibriTTS show that FNH-TTS achieves improved synthesis quality, duration-category accuracy, vocoder reconstruction quality, and inference efficiency. Further analysis shows that MoE-DP is the main source of improved duration modeling, while stronger vocoder-side components are necessary for robust synthesis under richer duration variation.2025-08-16T10:04:21ZQingliang MengYuqing DengWei LiangLimei YuHuizhi LiangTian Lihttp://arxiv.org/abs/2605.29950v1Frequency-Modulated and Single-Tone Excitation to Reveal Vibro-Acoustic Nonlinearities in Loosened Bolted Joints2026-05-28T13:57:38ZPreload loss in bolted joints results in alterations of the stiffness, damping, and nonlinearity of the structure, but existing monitoring techniques for rail-vehicle systems are often not capable of combining controlled shaker tests and sensing of nonlinear features. This paper proposes a method for detecting bolt loosening using a vibro-acoustic technique, where the structure is subjected to controlled shaker tests to sense the nonlinear features. A triaxial accelerometer was attached to the demonstrator, a microphone was placed in close proximity, and one of the bolts was tested under 0%, 20%, 40%, and 80% preload conditions. Single-tone and frequency-modulated (FM) signals close to the main natural frequency of 130 Hz, which was identified using sine sweep and narrow-band excitation, were applied to the demonstrator. When the structure was subjected to 130 Hz single-tone excitation, the loose state of the bolt exhibited several additional high-frequency spectral peaks. FM excitation between 125 and 135 Hz further distinguished between the states. Harmonic band power ratios, normalized to the carrier, distinguished between the loose state and the 80% preload state, where the difference between the loose and 80% preload states was 17.5 dB for l = 2 and 36.5 dB for l = 6.2026-05-28T13:57:38ZBerkay KullukcuRobin PianowskiDina Hannebauerhttp://arxiv.org/abs/2605.29931v1It`s All About Speed: AI`s Impact on Workflow in Music Production2026-05-28T13:43:26ZIn this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focusing specifically on professional participants who identified as recording engineers, mixers, and producers, we discuss their usage of common AI and automated software, as well as their sentiments on the proliferation of these tools. We discuss tensions that may be created between users and automated tools in key areas such as the need for speed and efficiency, controllability, and maintaining creative agency, and how these tensions may be alleviated through tool design.2026-05-28T13:43:26ZAudio Engineering Society Conference Paper - Presented at the AES International Conference on Machine Learning and Artificial Intelligence for Audio 2025 - September 8-10, London, UKFinn McClellanFabio Morrealehttp://arxiv.org/abs/2605.29862v1Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions2026-05-28T12:43:03ZAI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.2026-05-28T12:43:03Z2 figures, 4 tables, and 5 pagesHeejoon KooYoon Tae KimMiika ToikkanenJune-Woo Kim