https://arxiv.org/api/n35Ec7AXYrqfmArzB9XAAh9Yfek 2026-06-20T07:34:31Z 21774 300 15 http://arxiv.org/abs/2603.07584v2 Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations 2026-06-02T20:33:26Z

Computational engine sound modeling is central to the automotive audio industry, particularly for active sound design applications and virtual prototyping. Emerging data-driven engine sound synthesis methods require large volumes of standardized, clean audio recordings with precisely time-aligned operating-state annotations: data that is difficult to obtain due to high costs, specialized measurement equipment requirements, and inevitable noise contamination. We present an analysis-driven framework for generating engine audio with sample-accurate control annotations. The method extracts harmonic structures from real recordings through pitch-adaptive spectral analysis, which then drive an extended parametric harmonic-plus-noise synthesizer. With this framework, we augment 5-10 min of source audio per engine 15-30x via diverse control trajectories and parametric variation, producing the Procedural Engine Sounds Dataset (19.0 h, 5,935 files): a set of engine audio signals with sample-accurate RPM and torque annotations spanning a wide range of operating conditions, signal complexities, and harmonic profiles. Comparison against real recordings validates that the synthesized data preserves characteristic harmonic structures, and a baseline differentiable synthesis network trained on the dataset confirms its suitability for data-driven engine sound modeling. The dataset is released publicly to support research on engine timbre analysis, control parameter estimation, and neural generative synthesis.

2026-03-08T11:05:10Z To appear in the Proceedings of the 34th European Signal Processing Conference (EUSIPCO 2026) Robin Doerfler Lonce Wyse http://arxiv.org/abs/2603.09391v2 Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis 2026-06-02T20:22:55Z

Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal structure. We present the Pulse-Train-Resonator (PTR) model, a differentiable synthesis architecture that generates engine audio as parameterized pulse trains aligned to engine firing patterns and propagates them through recursive Karplus-Strong resonators simulating exhaust acoustics. The architecture integrates physics-informed inductive biases including harmonic decay, thermodynamic pitch modulation, valve-dynamics envelopes, exhaust system resonances and derived engine operating modes such as throttle operation and Deceleration Fuel Cutoff (DFCO). Validated on three diverse engine types totaling 7.5 hours of audio, PTR achieves a 21% improvement in harmonic reconstruction and a 5.7% reduction in total loss over a harmonic-plus-noise baseline model, while providing interpretable parameters corresponding to physical phenomena. Complete code, model weights, and audio examples are openly available.

2026-03-10T09:03:35Z Revised version; to appear in the Proceedings of the 34th European Signal Processing Conference (EUSIPCO 2026) Robin Doerfler Lonce Wyse http://arxiv.org/abs/2605.30457v2 Extracting accent features in spoken Brazilian Portuguese without sociolinguistic labels 2026-06-02T18:37:25Z

Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reliable or are not used in training objectives. This work introduces a novel workflow for feature extraction using only acoustic labels. By isolating explicit regional accent landmarks and using a phoneme-based forced aligner (ZIPA), our targeted feature set captures dialectal variance more effectively than utterance embeddings, demonstrating that localized features can outperform general-purpose architectures on accent-related tasks using minimal and objective data labels.

2026-05-28T18:27:56Z This work was submitted to the XLIV Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2026) Pedro H. L. Leite Pedro Benevenuto Valadares Luiz W. P. Biscainho http://arxiv.org/abs/2606.04103v1 The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids 2026-06-02T18:09:51Z

Conventional hearing aids rely on fixed, frequency-dependent amplification and compression to manage reduced sensitivity, which often fails to provide sufficient listening support in complex environments, such as situations with multiple speakers (the ``cocktail party'' problem). To more comprehensively address the underlying encoding dysfunctions of hearing loss, we introduce the Differentiable Auditory Loop (DAL), a new open-source framework for personalized hearing aid design and fitting. Our first implementation of DAL incorporates CARFAC, a differentiable model of human cochlear function, which we ported to JAX, to optimize a deep neural network to match impaired auditory neural activity patterns with a normal-hearing reference. To build a hearing aid with the fine-grained spectro-temporal signal processing required, we adopt SEANet, a waveform-to-waveform fully convolutional UNet generator. We fine-tune the network by comparing the outputs of a CARFAC model fitted to normal hearing with that of a CARFAC model fitted to match each subject's individual hearing impairment. The comparison is done using loss functions derived from the respective CARFAC neural activity pattern (NAP) outputs and stabilized auditory images (SAIs), the latter providing a 2D representation that captures phase-insensitive temporal structure in the auditory nerve output. Through gradient descent, the SEANet model learns to both denoise the input and compensate for the hearing loss modelled by the impaired CARFAC model. Across neural-representation and signal-fidelity metrics, the DAL-optimized SEANet model outperformed the tested master hearing aid (MHA) baselines. The DAL framework provides a practical path toward model-based, machine-learning-driven personalization of hearing aid signal processing. Next steps include hardware deployment to enable real-world clinical testing.

2026-06-02T18:09:51Z Alejandro Ballesta Rosen Jason Mikiel-Hunter Julian Maclaren Jack Collins Richard F. Lyon Simon Carlile http://arxiv.org/abs/2606.03957v1 Efficient ASR Training with Conversations that Never Happened 2026-06-02T17:46:12Z

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

2026-06-02T17:46:12Z Máté Gedeon Péter Mihajlik http://arxiv.org/abs/2606.02400v2 SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription 2026-06-02T17:20:11Z

Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.

2026-06-01T15:47:01Z 10 pages, 4 figures, 3tables Yuhang Dai Haopeng Lin Zhennan Lin Jiale Qian Jun Wu Hanke Xie Hao Meng Hanlin Wen Chuang Ding Shunshun Yin Ming Tao Lei Xie Xinsheng Wang http://arxiv.org/abs/2606.02173v2 Domain-Agnostic Incremental Learning for Sound Classification. A DCASE 2026 Challenge task 2026-06-02T16:47:01Z

This paper presents the Domain-Agnostic Incremental Learning for Audio Classification Task of the DCASE 2026 Challenge. Incremental learning refers to sequentially learning new tasks with the same system while maintaining its knowledge and performance on the previously learned task. Domain-incremental learning for sound classification refers to learning the same sound classes but in different acoustic domains, and was formalized as a data challenge for the first time in DCASE 2026. Participants will train a system to learn ten sound classes in three different domains, with learning at each incremental task not having access to previous task data. Submitted systems will be ranked by the overall average accuracy calculated over the three domains. During the development stage, the provided baseline system obtains a modest performance of 52.5\% accuracy over the last two domains, mostly due to erroneous inference of the domain for the test sample.

2026-06-01T12:31:52Z White paper. To be completed after the challenge deadline and submitted for the DCASE 2026 Workshop. Revision: Table 1 corrected to provide macro-average accuracy Riccardo Casciotti Manjunath Mulimani Manu Harju Jesper Rindom Jensen Annamaria Mesaros http://arxiv.org/abs/2606.03832v1 In-the-Loop Training of Deep Feedback Cancellation for Hearing Aids 2026-06-02T16:17:09Z

Acoustic feedback limits the maximum gain in hearing aids. In addition to several approaches based on adaptive filtering, recently a deep-neural-network-based feedback cancellation (DFC) approach has been proposed, which is trained via an open-loop framework. Since open-loop-trained DFC (DFC-OL) can become unstable during inference at high gains, in this paper we propose an in-the-loop-trained DFC (DFC-IL) that integrates the DFC directly into the optimisation loop. This allows the model to be exposed to unstable conditions during training. A two-stage training strategy involving pre-training on stable systems and fine-tuning on a wider gain range enables DFC-IL to learn robust howling reduction. Experimental results on measured feedback paths demonstrate that in scenarios with small gains, the proposed DFC-IL performs similarly to DFC-OL, and both exceed the performance of adaptive filters. In scenarios with high amplification gains, DFC-IL clearly outperforms DFC-OL by maintaining system stability.

2026-06-02T16:17:09Z Svantje Voit Simon Doclo http://arxiv.org/abs/2509.09685v5 TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation 2026-06-02T15:04:53Z

We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In the proposed pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are released at https://talkpl-ai.github.io.

2025-08-18T05:06:58Z Keunwoo Choi Seungheon Doh Juhan Nam http://arxiv.org/abs/2606.03747v1 Stable Hybrid Cross-Attention Fusion for Audio-Visual Event Recognition 2026-06-02T15:01:06Z

Audio-Visual Event Recognition (AVER) is essential for intelligent urban monitoring systems, where robust multimodal understanding of complex environments is required. This paper proposes a stable hybrid cross-attention fusion framework for audio-visual event recognition in smart urban environments. The proposed architecture combines pretrained Video Masked Autoencoder (VideoMAE) and Audio Spectrogram Transformer (AST) representations with FiLM-based audio conditioning, bidirectional cross-attention fusion, multimodal Transformer encoding, and modality-temporal attention. To improve computational efficiency and training stability, frozen pretrained backbones and cached feature extraction are employed. Extensive experiments on the AVE dataset show that the proposed framework achieves the highest average performance among the evaluated unimodal and multimodal baselines across multiple evaluation metrics, obtaining a best validation accuracy of 91.74% and a test accuracy of 83.85 plus/minus 1.40% over five independent runs. The results indicate that the proposed hybrid fusion strategy effectively captures complementary audio-visual information and provides robust multimodal representation learning for challenging realworld urban monitoring scenarios.

2026-06-02T15:01:06Z 6 pages, 4 Figures Parinaz Binandeh Dehaghani Danilo Pena A. Pedro Aguiar http://arxiv.org/abs/2605.31530v2 UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion 2026-06-02T14:24:03Z

We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.

2026-05-29T16:43:07Z Zhaoqing Li Haoning Xu Jingran Su Yaofang Liu Zhefan Rao Huimeng Wang Jiajun Deng Tianzi Wang Zengrui Jin Rui Liu Haoxuan Che Xunying Liu http://arxiv.org/abs/2606.03455v1 WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling 2026-06-02T10:33:20Z

Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.

2026-06-02T10:33:20Z Wenxi Chen Dongya Jia Yushen Chen Zhikang Niu Yuzhe Liang Xiquan Li Ruiqi Yan Ziyang Ma Guanrou Yang Sanyuan Chen Yue Wang Zhuo Chen Kai Yu Xie Chen http://arxiv.org/abs/2510.01698v4 TalkPlay-Tools: Conversational Music Recommendation with LLM Tool Calling 2026-06-02T09:47:42Z

While the recent developments in large language models (LLMs) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an LLM-based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an LLM as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), sparse retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.

2025-10-02T06:08:54Z Accepted for publication at The Workshop on AI for Music, Neural Information Processing Systems (NeurIPS-AI4Music) Seungheon Doh Keunwoo Choi Juhan Nam http://arxiv.org/abs/2502.13713v5 TALKPLAY: Multimodal Music Recommendation with Large Language Models 2026-06-02T09:14:11Z

We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.

2025-02-19T13:28:20Z Seungheon Doh Keunwoo Choi Juhan Nam http://arxiv.org/abs/2606.03241v1 Benchmarking Speech-to-Speech Translation Models 2026-06-02T07:01:33Z

Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $ρ>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($ρ\geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.

2026-06-02T07:01:33Z Paper under submission Alkis Koudounas Hayato Futami Quentin Jodelet Osamu Take Shinji Watanabe Emiru Tsunoo