https://arxiv.org/api/i1OQLLqgywHJFptuqF+k6h2spjw 2026-06-13T17:22:20Z 21683 90 15 http://arxiv.org/abs/2606.08425v1 TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints 2026-06-07T02:50:24Z

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.

2026-06-07T02:50:24Z Accepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.app Vinh-Thuan Ly http://arxiv.org/abs/2606.08393v1 SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation 2026-06-07T01:10:11Z

Video-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.

2026-06-07T01:10:11Z 6 pages, 4 figures Haoyu Zhang Yuta Oshima Xingjian Du Chunfeng Wang Irene Li Yusuke Iwasawa Yutaka Matsuo http://arxiv.org/abs/2606.08247v1 AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals 2026-06-06T16:11:45Z

Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.

2026-06-06T16:11:45Z 10 pages, 8 figures, 5 tables, 14 equations Aueaphum Aueawatthanaphisut http://arxiv.org/abs/2606.08210v1 Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion 2026-06-06T14:54:44Z

Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.

2026-06-06T14:54:44Z Accepted at INTERSPEECH 2026 (Main) Rashini Liyanarachchi Rachael Mackay Alison Short Aditya Joshi Erik Meijering http://arxiv.org/abs/2606.08171v1 Predictive Fixed-Filter Active Noise Control (PFANC) Using Convolutional Recurrent Neural Networks for Dynamic Noises 2026-06-06T13:35:34Z

The existing Generative Fixed-Filter Active Noise Control (GFANC) method generates a suitable control filter based on the current noise frame. This reactive design aims to estimate a control filter that is optimal for the present frame rather than the upcoming one. Consequently, it suffers from an inherent tracking lag and lacks the predictive capability to handle rapidly varying noises. To address this limitation, we propose the Predictive Fixed-Filter Active Noise Control (PFANC) method with a proactive control paradigm in this paper. In the PFANC method, multiple consecutive noise frames are processed by a Convolutional Recurrent Neural Network (CRNN) to predict the next-frame control filter. By utilizing temporal correlations across noise frames to anticipate the control filter in advance, the PFANC method can effectively track dynamic noise changes. Furthermore, the theoretical analysis based on a high-order Markov chain shows that incorporating multiple noise frames enhances the prediction of the control filter. Numerical simulations with linear and logarithmic chirp signals, as well as real-world dynamic noises, validate the effectiveness of the PFANC method and its superiority over GFANC and its variations. The PFANC method also exhibits good transferability across different acoustic paths.

2026-06-06T13:35:34Z Zhengding Luo Haowen Li Haozhe Ma Dongyuan Shi Wen Zhang Woon-Seng Gan http://arxiv.org/abs/2512.20978v2 GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model 2026-06-06T11:42:51Z

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

2025-12-24T06:13:02Z Accepted to Interspeech2026 Haoyang Li Xuyi Zhuang Azmat Adnan Ye Ni Wei Rao Shreyas Gopal Eng Siong Chng Boon Siew Han Yuanjin Zheng http://arxiv.org/abs/2601.04178v2 Sound Event Detection with Boundary-Aware Optimization and Inference 2026-06-06T11:02:56Z

Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.

2026-01-07T18:45:29Z Accepted for publication in IEEE Signal Processing Letters, 2026 Florian Schmid Chi Ian Tang Sanjeel Parekh Vamsi Krishna Ithapu Juan Azcarreta Ortiz Giacomo Ferroni Yijun Qian Arnoldas Jasonas Cosmin Frateanu Camilla Clark Gerhard Widmer Çağdaş Bilen http://arxiv.org/abs/2312.15946v3 EnchantDance: Unveiling the Potential of Music-Driven Dance Movement 2026-06-06T10:48:53Z

The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.

2023-12-26T08:19:10Z Project Page: https://fluide1022.github.io/EnchantDance/ Bo Han Teng Zhang Zeyu Ling Feilin Han http://arxiv.org/abs/2602.20967v2 Training-Free Intelligibility-Guided Observation Addition for Noisy ASR 2026-06-06T10:41:57Z

Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

2026-02-24T14:46:54Z Accepted to Interspeech2026 Haoyang Li Changsong Liu Wei Rao Hao Shi Sakriani Sakti Eng Siong Chng http://arxiv.org/abs/2602.23958v2 An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance 2026-06-06T08:55:46Z

Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

2026-02-27T12:01:16Z Accepted to Interspeech 2026. Source code and evaluation pipeline are available at: https://github.com/wonwoo-jeong/fad-encoder-bias Wonwoo Jeong http://arxiv.org/abs/2604.24199v4 Speech Enhancement Based on Drifting Models 2026-06-06T06:47:13Z

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

2026-04-27T09:00:51Z 6 pages, 2 figures Liang Xu Diego Caviedes-Nozal W. Bastiaan Kleijn Longfei Felix Yan Rasmus Kongsgaard Olsson http://arxiv.org/abs/2606.07494v1 Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech 2026-06-05T17:48:46Z

Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.

2026-06-05T17:48:46Z Work in progress Xuanjun Chen Yun-Shing Wu Wei-Chung Lu Claire Lin Haibin Wu Hung-yi Lee Jyh-Shing Roger Jang http://arxiv.org/abs/2501.08238v3 CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech 2026-06-05T17:32:46Z

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

2025-01-14T16:26:14Z Accepted by TASLP 2026 Xuanjun Chen Jiawei Du Haibin Wu Lin Zhang I-Ming Lin I-Hsiang Chiu Wenze Ren Yuan Tseng Yu Tsao Jyh-Shing Roger Jang Hung-yi Lee http://arxiv.org/abs/2603.08683v2 Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio 2026-06-05T15:11:26Z

Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

2026-03-09T17:52:02Z Accepted at Interspeech 2026, 7 pages, 5 figures Phillip Long Zachary Novack Chris Donahue http://arxiv.org/abs/2606.07264v1 VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track 2026-06-05T13:39:39Z

Audio reasoning requires multi-step, evidence-grounded inference over temporally dynamic and acoustically mixed signals, exceeding conventional perception tasks such as ASR or captioning. We present VISA, our submission to the Interspeech 2026 Audio Reasoning Challenge (Agent Track), evaluated via the MMAR Rubrics for correctness and reasoning quality. Under a "LALM as a Tool" paradigm, VISA strengthens large audio language models with auxiliary multi-modal evidence while avoiding heavy orchestration. The system integrates three components: multi-modal feature extraction for complementary audio and acoustic-visual clues, model-voting inference with consistency checking for stable predictions, and fine-grained category-aware routing to resolve disagreements and select rubric-aligned reasoning chains. On the official Agent Track leaderboard, VISA ranks 2nd overall with a 66.23% Rubrics score. It also achieves 77.40% Accuracy, the highest among all systems listed across both the Single Model and Agent tracks.

2026-06-05T13:39:39Z Submitted to INTERSPEECH 2026 Wenming Tu Jian Gao Yanru Huo Yixuan Wang Jing Peng Bohan Li Ziyang Ma Tao Liu Shuai Fan Kai Yu Xie Chen Zilong Zheng