https://arxiv.org/api/HEVKSfJonPye7rbjbtq7rmKV2Bg2026-06-09T21:33:06Z209311515http://arxiv.org/abs/2510.04593v3UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models2026-06-08T05:49:30ZLarge language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.2025-10-06T08:47:38Zaccepted at interspeech2026Wenhao GuanZhikang NiuZiyue JiangKaidi WangPeijie ChenQingyang HongLin LiXie Chenhttp://arxiv.org/abs/2606.09050v1MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion2026-06-08T05:39:23ZStreaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.2026-06-08T05:39:23ZAccepted by Interspeech 2026Guobin MaYuxuan XiaYuepeng JiangDake GuoHanke XieJingbin HuYanbo WangLei XiePengcheng Zhuhttp://arxiv.org/abs/2606.09048v1BareWave: Waveform-Native Flow-Matching Text-to-Speech2026-06-08T05:36:42ZRemoving intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.2026-06-08T05:36:42ZUnder ReviewWei FanChao-Hong TanQian ChenWen WangXiangang LiKejiang ChenWeiming ZhangNenghai Yuhttp://arxiv.org/abs/2606.09019v1TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech2026-06-08T04:32:08ZCodec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.2026-06-08T04:32:08ZYejin LeeJunwon MoonHyoeun KimHyunjin ChoiHeeseung KimKyuhong Shimhttp://arxiv.org/abs/2606.08843v1From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data2026-06-07T21:25:14ZWe present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.2026-06-07T21:25:14ZMoshe MandelShlomo E. Chazanhttp://arxiv.org/abs/2605.06582v2PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization2026-06-07T17:38:26ZMany operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly.
We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse.
PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control.
On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.2026-05-07T17:11:22Z57 pages main content, 109 total pages, 9 Figures, pre-print, Under ReviewAdhiraj BanerjeeVipul Arorahttp://arxiv.org/abs/2606.08722v1Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding2026-06-07T16:32:59ZSymbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench2026-06-07T16:32:59ZAccepted at Ital-IA 2026Matteo SpanioMohammad TorabiAndrea PoltronieriAntonio Rodàhttp://arxiv.org/abs/2606.08678v1Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck2026-06-07T15:31:00ZSophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.2026-06-07T15:31:00ZAnh-Tuan DaoDriss MatroufMickael RouvierNicholas Evanshttp://arxiv.org/abs/2606.08669v1A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis2026-06-07T15:20:38ZVoice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self-supervised learning feature extractors paired with four back-end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph-based back-ends. Through multi-corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross-linguistic analysis reveals that fine-tuning with just 8 hours of target-language data enhances detection robustness. Together, these findings emphasize the critical need for domain-aware and language-specific adaptation in spoofing detection.2026-06-07T15:20:38ZAnh-Tuan DaoDriss MatroufMickael RouvierNicholas Evanshttp://arxiv.org/abs/2606.08663v1Probing Token Spaces under Generator Shift in AI-Generated Music Detection2026-06-07T15:08:19ZAI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.2026-06-07T15:08:19ZAccepted to ICML 2026 ML4Audio workshopJoonyong ParkJungwoo KimJunyoung KohYuki Saitohttp://arxiv.org/abs/2502.16584v2Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound2026-06-07T13:34:59ZRecent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.2025-02-23T14:24:15ZLiumeng XueZiya ZhouJiahao PanZixuan LiShuai FanYinghao MaSitong ChengDongchao YangHaohan GuoYujia XiaoXinsheng WangZixuan ShenChuanbo ZhuXinshen ZhangTianchi LiuRuibin YuanZeyue TianHaohe LiuXingjian DuEmmanouil BenetosGe ZhangYike GuoWei Xuehttp://arxiv.org/abs/2606.08580v1G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching2026-06-07T11:28:32ZUsing speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.2026-06-07T11:28:32ZAccepted to Interspeech 2026Yike ZhuZiqian WangZikai LiuXingchen LiZhuangqi ChenXianjun XiaChuanzeng HuangLei Xiehttp://arxiv.org/abs/2601.09239v6DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion2026-06-07T10:32:48ZSpeech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/2026-01-14T07:22:24ZSubmit to ACL ARR 2026 MayHanlin ZhangDaxin TanDehua TaoXiao ChenHaochen TanYunhe LiYuchen CaoLinqi Songhttp://arxiv.org/abs/2604.10628v2BMdataset: A Musicologically Curated LilyPond Dataset2026-06-07T10:28:59ZSymbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.2026-04-12T13:09:20ZSubmitted to SMC2026Matteo SpanioIlay GulerAntonio Rodàhttp://arxiv.org/abs/2603.12046v2Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition2026-06-07T09:33:32ZAudio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.2026-03-12T15:22:27ZAccepted to INTERSPEECH 2026 [Long Paper track]. Project website: https://umbertocappellazzo.github.io/Dr-SHAP-AVUmberto CappellazzoStavros PetridisMaja Pantic