https://arxiv.org/api/i1OQLLqgywHJFptuqF+k6h2spjw2026-06-13T17:22:20Z216839015http://arxiv.org/abs/2606.08425v1TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints2026-06-07T02:50:24ZCurrent advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.2026-06-07T02:50:24ZAccepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.appVinh-Thuan Lyhttp://arxiv.org/abs/2606.08393v1SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation2026-06-07T01:10:11ZVideo-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.2026-06-07T01:10:11Z6 pages, 4 figuresHaoyu ZhangYuta OshimaXingjian DuChunfeng WangIrene LiYusuke IwasawaYutaka Matsuohttp://arxiv.org/abs/2606.08247v1AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals2026-06-06T16:11:45ZAcute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.2026-06-06T16:11:45Z10 pages, 8 figures, 5 tables, 14 equationsAueaphum Aueawatthanaphisuthttp://arxiv.org/abs/2606.08210v1Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion2026-06-06T14:54:44ZAutomated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.2026-06-06T14:54:44ZAccepted at INTERSPEECH 2026 (Main)Rashini LiyanarachchiRachael MackayAlison ShortAditya JoshiErik Meijeringhttp://arxiv.org/abs/2606.08171v1Predictive Fixed-Filter Active Noise Control (PFANC) Using Convolutional Recurrent Neural Networks for Dynamic Noises2026-06-06T13:35:34ZThe existing Generative Fixed-Filter Active Noise Control (GFANC) method generates a suitable control filter based on the current noise frame. This reactive design aims to estimate a control filter that is optimal for the present frame rather than the upcoming one. Consequently, it suffers from an inherent tracking lag and lacks the predictive capability to handle rapidly varying noises. To address this limitation, we propose the Predictive Fixed-Filter Active Noise Control (PFANC) method with a proactive control paradigm in this paper. In the PFANC method, multiple consecutive noise frames are processed by a Convolutional Recurrent Neural Network (CRNN) to predict the next-frame control filter. By utilizing temporal correlations across noise frames to anticipate the control filter in advance, the PFANC method can effectively track dynamic noise changes. Furthermore, the theoretical analysis based on a high-order Markov chain shows that incorporating multiple noise frames enhances the prediction of the control filter. Numerical simulations with linear and logarithmic chirp signals, as well as real-world dynamic noises, validate the effectiveness of the PFANC method and its superiority over GFANC and its variations. The PFANC method also exhibits good transferability across different acoustic paths.2026-06-06T13:35:34ZZhengding LuoHaowen LiHaozhe MaDongyuan ShiWen ZhangWoon-Seng Ganhttp://arxiv.org/abs/2512.20978v2GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model2026-06-06T11:42:51ZLanguage Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.2025-12-24T06:13:02ZAccepted to Interspeech2026Haoyang LiXuyi ZhuangAzmat AdnanYe NiWei RaoShreyas GopalEng Siong ChngBoon Siew HanYuanjin Zhenghttp://arxiv.org/abs/2601.04178v2Sound Event Detection with Boundary-Aware Optimization and Inference2026-06-06T11:02:56ZTemporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.2026-01-07T18:45:29ZAccepted for publication in IEEE Signal Processing Letters, 2026Florian SchmidChi Ian TangSanjeel ParekhVamsi Krishna IthapuJuan Azcarreta OrtizGiacomo FerroniYijun QianArnoldas JasonasCosmin FrateanuCamilla ClarkGerhard WidmerÇağdaş Bilenhttp://arxiv.org/abs/2312.15946v3EnchantDance: Unveiling the Potential of Music-Driven Dance Movement2026-06-06T10:48:53ZThe task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.2023-12-26T08:19:10ZProject Page: https://fluide1022.github.io/EnchantDance/Bo HanTeng ZhangZeyu LingFeilin Hanhttp://arxiv.org/abs/2602.20967v2Training-Free Intelligibility-Guided Observation Addition for Noisy ASR2026-06-06T10:41:57ZAutomatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.2026-02-24T14:46:54ZAccepted to Interspeech2026Haoyang LiChangsong LiuWei RaoHao ShiSakriani SaktiEng Siong Chnghttp://arxiv.org/abs/2602.23958v2An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance2026-06-06T08:55:46ZFréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.2026-02-27T12:01:16ZAccepted to Interspeech 2026. Source code and evaluation pipeline are available at: https://github.com/wonwoo-jeong/fad-encoder-biasWonwoo Jeonghttp://arxiv.org/abs/2604.24199v4Speech Enhancement Based on Drifting Models2026-06-06T06:47:13ZWe propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.2026-04-27T09:00:51Z6 pages, 2 figuresLiang XuDiego Caviedes-NozalW. Bastiaan KleijnLongfei Felix YanRasmus Kongsgaard Olssonhttp://arxiv.org/abs/2606.07494v1Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech2026-06-05T17:48:46ZRecent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.2026-06-05T17:48:46ZWork in progressXuanjun ChenYun-Shing WuWei-Chung LuClaire LinHaibin WuHung-yi LeeJyh-Shing Roger Janghttp://arxiv.org/abs/2501.08238v3CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech2026-06-05T17:32:46ZWith the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.2025-01-14T16:26:14ZAccepted by TASLP 2026Xuanjun ChenJiawei DuHaibin WuLin ZhangI-Ming LinI-Hsiang ChiuWenze RenYuan TsengYu TsaoJyh-Shing Roger JangHung-yi Leehttp://arxiv.org/abs/2603.08683v2Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio2026-06-05T15:11:26ZAutoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.2026-03-09T17:52:02ZAccepted at Interspeech 2026, 7 pages, 5 figuresPhillip LongZachary NovackChris Donahuehttp://arxiv.org/abs/2606.07264v1VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track2026-06-05T13:39:39ZAudio reasoning requires multi-step, evidence-grounded inference over temporally dynamic and acoustically mixed signals, exceeding conventional perception tasks such as ASR or captioning. We present VISA, our submission to the Interspeech 2026 Audio Reasoning Challenge (Agent Track), evaluated via the MMAR Rubrics for correctness and reasoning quality. Under a "LALM as a Tool" paradigm, VISA strengthens large audio language models with auxiliary multi-modal evidence while avoiding heavy orchestration. The system integrates three components: multi-modal feature extraction for complementary audio and acoustic-visual clues, model-voting inference with consistency checking for stable predictions, and fine-grained category-aware routing to resolve disagreements and select rubric-aligned reasoning chains. On the official Agent Track leaderboard, VISA ranks 2nd overall with a 66.23% Rubrics score. It also achieves 77.40% Accuracy, the highest among all systems listed across both the Single Model and Agent tracks.2026-06-05T13:39:39ZSubmitted to INTERSPEECH 2026Wenming TuJian GaoYanru HuoYixuan WangJing PengBohan LiZiyang MaTao LiuShuai FanKai YuXie ChenZilong Zheng