https://arxiv.org/api/z2iquAwRaikxRk5dBEfSiL354tE2026-06-13T13:47:29Z216834515http://arxiv.org/abs/2606.10675v1Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming2026-06-09T10:27:59ZWe present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.2026-06-09T10:27:59ZInterspeech 2026Roy WeberMeidan ZehaviRotem RoussoJoseph Keshethttp://arxiv.org/abs/2606.10581v1ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models2026-06-09T08:45:52ZSpeech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.2026-06-09T08:45:52ZYuxiang WangQinke NiShengbo CaiWan LinLiqiang ZhangZhizheng Wuhttp://arxiv.org/abs/2606.10565v1A Lightweight Dual-Factor Acoustic Authentication System via Cascaded GMM-DTW Architecture for Edge Computing2026-06-09T08:29:16ZThis paper presents a lightweight, cascaded GMM-DTW dual-factor voice lock system for resource-constrained edge environments. By utilizing a shared MFCC feature space, the framework implements a sequential defense mechanism combining GMM speaker screening and DTW passphrase verification. To counter presentation threats without extra hardware, a dynamic joint absolute-relative margin constraint is integrated into the GMM classification space, limiting the physical imposter and high-fidelity replay attack False Acceptance Rates (FAR) to 2.73% and 6.67%, respectively, with a legitimate False Rejection Rate (FRR) of 16.67%. Due to Sakoe-Chiba window optimization, the global end-to-end processing latency under temporal stress is rigidly bounded at 9.82ms on a single-core CPU, comprising 1.51ms for feature extraction, 0.54ms for GMM scoring, and 7.77ms for worst-case DTW matching. These empirical benchmarks demonstrate the viability of white-box acoustic cascades for secure, deterministic real-time deployment on low-power edge nodes.2026-06-09T08:29:16ZYutong Zhanghttp://arxiv.org/abs/2606.09677v2MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation2026-06-09T08:08:09ZWhile discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.2026-06-08T15:58:31Z5 pages, accepted to Interspeech 2026Dohwan KimJung-Woo Choihttp://arxiv.org/abs/2508.07048v2Whisfusion: Parallel ASR Decoding with Masked Diffusion2026-06-09T07:22:42ZAutoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (NAR) systems avoid this bottleneck but their conditional independence assumption sacrifices transcript-level generative modeling. Masked diffusion language models (e.g., LLaDA, MDLM) offer a competitive NAR text-generation approach. We ask whether such models can bring NAR ASR into the accuracy regime of strong AR ASR systems while removing the left-to-right bottleneck. We propose Whisfusion, which trains a dedicated masked diffusion decoder from scratch on top of frozen Whisper-large-v3 audio embeddings, denoising masked transcripts in just a few steps. We train on ~68k hours of 11-language speech with high-mask specialization to align training with the fully masked starting point of inference, and decode via Parallel Diffusion Decoding. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK benchmarks, while running 4-5x faster, additionally surpassing Whisper-turbo in both accuracy and throughput. It reaches accuracy competitive with Canary and Qwen3-ASR while running 3-7x faster. These results establish masked diffusion as a Pareto-competitive non-autoregressive paradigm for high-throughput multilingual transcription. Code and model weights are available at https://github.com/taeyoun811/Whisfusion.2025-08-09T17:20:54Z16 pages, 3 figuresTaeyoun KwonJunhyuk AhnTaegeun YunHeeju JwaYoonchae ChoiSiwon ParkJongchan KimHyungon RyuHyuk-Jae LeeNam-Joon Kimhttp://arxiv.org/abs/2509.22153v2Towards Paradigm-General Suicide Risk Detection via Speech LLM2026-06-09T06:36:14ZSuicide risk among adolescents remains a critical public health concern, and speech provides a non-invasive and scalable approach for its detection. Speech-based suicide risk assessment commonly relies on carefully designed speech elicitation paradigms (\textit{e.g.,} verbal fluency, reading, or question answering) to probe cognitive and affective states. Existing approaches, however, typically focus on one single paradigm at a time. This paper, for the first time, investigates cross-paradigm approaches that unify diverse speech elicitation paradigms within a single model. Specifically, we use a speech LLM as backbone with a mixture of DoRA experts (MoDE) to capture complementary cues across assessments dynamically, tested on 1,223 participants across ten speech elicitation paradigms. Results show that MoDE outperforms both paradigm-specific and conventional joint-learning models. Moreover, it can generalise to unseen paradigms and provide better confidence calibration.2025-09-26T10:11:54ZJialun LiWeitao JiangZiyun CuiYinan DuanDiyang QuChao ZhangRunsen ChenChang LeiWen Wuhttp://arxiv.org/abs/2606.10464v1GC-LoRA: Gated Convolutional LoRA for Parameter-Efficient Acoustic Adaptation2026-06-09T06:29:13ZTransformer-based Speech Foundation Models excel in most Automatic Speech Recognition tasks but often suffer performance degradation when applied to domains with mismatched acoustic characteristics. While Parameter Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), adjust global attention, they lack the local context modeling crucial for capturing domain-specific variations. We propose GC-LoRA, a novel adapter architecture that injects Conformer-style local convolutional processing into pretrained Transformer encoders. By integrating a lightweight adapter to encoder attention output projections, our method efficiently captures local acoustic dependencies without disrupting pretrained global representations. Experiments across diverse datasets (acoustically-degraded, bandlimited, dialectal, child) demonstrate the efficacy of our approach, achieving Word Error Rate (WER) reductions of up to 10.9% compared to baselines while adding minimal trainable parameters.2026-06-09T06:29:13ZAccepted for publication at Interspeech 2026Natarajan Balaji ShankarZilai WangKaiyuan ZhangMohan ShiAbeer Alwanhttp://arxiv.org/abs/2603.11482v2AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style2026-06-09T06:22:57ZEvaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.2026-03-12T03:07:42ZAccepted to INTERSPEECH 2026Joonyong ParkJerry Lihttp://arxiv.org/abs/2606.10454v1Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR2026-06-09T06:02:31ZWhile Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.2026-06-09T06:02:31ZAccepted to Interspeech 2026Mohan ShiKaiyuan ZhangZilai WangNatarajan Balaji ShankarEray ErenAbeer Alwanhttp://arxiv.org/abs/2606.10439v1Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling2026-06-09T05:35:31ZThe rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.2026-06-09T05:35:31ZAccepted by ICASSP 2026ICASSP (2026),18807-18811Guodong LinZiqi ChenYuxiang FuKe LiWei-Qiang Zhang10.1109/ICASSP55912.2026.11464266http://arxiv.org/abs/2412.11449v2Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music2026-06-09T04:55:57ZWe propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.2024-12-16T05:03:48Z6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, IndiaPrateek Vermahttp://arxiv.org/abs/2606.09141v2FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation2026-06-09T03:52:24ZRecent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.2026-06-08T07:39:26ZAccepted to Interspeech 2026Hanke XieXiaming RenDake GuoRuonan YouWenhao LiJingbin HuGuobin MaHuakang ChenKejie XuRui HuangWeiguo TanXianrong WangLei Xiehttp://arxiv.org/abs/2606.10317v1SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space2026-06-09T02:14:11ZWe introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.2026-06-09T02:14:11ZAccepted to Interspeech2026Tomoya TanabuHiroshi NishijimaDaisuke SaitoNobuaki Minematsuhttp://arxiv.org/abs/2606.10233v1ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling2026-06-08T22:46:30ZWhile speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.2026-06-08T22:46:30ZAccepted at Interspeech 2026Zhuoyan TaoJiatong ShiHye-jin ShimShinji Watanabehttp://arxiv.org/abs/2603.07238v2Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster2026-06-08T19:27:21ZSimilarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling an S3M-based language identification system from 126 to 4,017 languages reshapes this topology, and find a non-linear effect: phylogenetic recovery stays flat up to the 1K scale, but the 4K model undergoes a qualitative shift, resolving both clear lineages and long-term linguistic contact. Most strikingly, a robust Pacific macro-cluster emerges, grouping genealogically unrelated Papuan, Oceanic, and Australian languages, and we trace its driver to a concentrated encoding that captures shared acoustic signatures such as global energy dynamics. These results suggest that massive S3Ms internalize multiple layers of language history, offering a promising perspective for computational phylogenetics and the study of language contact.2026-03-07T14:48:48ZAccepted to Interspeech 2026Minu KimHoirin KimDavid R. Mortensen