https://arxiv.org/api/L2cpnWWmzAV6Jj8/q2HWyiSKujo 2026-03-24T13:25:42Z 9246 75 15 http://arxiv.org/abs/2603.10314v1 PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion 2026-03-11T01:29:30Z This paper proposes PRoADS, a provably secure and robust audio steganographic framework based on audio diffusion models. As a generative steganography scheme, PRoADS embeds secret messages into the initial noise of diffusion models via orthogonal matrix projection. To address the reconstruction errors in diffusion inversion that cause high bit error rates (BER), we introduce Latent Optimization and Backward Euler Inversion to minimize the latent reconstruction and diffusion inversion errors. Comprehensive experiments demonstrate that our scheme sustains a remarkably low BER of 0.15\% under 64 kbps MP3 compression, significantly outperforming existing methods and exhibiting strong robustness. 2026-03-11T01:29:30Z This paper has been accepted for presentation at the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026) YongPeng Yan Yanan Li Qiyang Xiao Yanzhen Ren http://arxiv.org/abs/2508.11872v2 Singing Syllabi with Virtual Avatars: Enhancing Student Engagement Through AI-Generated Music and Digital Embodiment 2026-03-10T13:14:42Z In practical teaching, we observe that few students thoroughly read or fully comprehend the information provided in traditional, text-based course syllabi. As a result, essential details, such as course policies and learning outcomes, are frequently overlooked. To address this challenge, in this paper, we propose a novel approach leveraging AI-generated singing and virtual avatars to present syllabi in a format that is more visually appealing, engaging, and memorable. Especially, we leveraged the open-source tool, HeyGem, to transform textual syllabi into audiovisual presentations, in which digital avatars perform the syllabus content as songs. The proposed approach aims to stimulate students' curiosity, foster emotional connection, and enhance retention of critical course information. Student feedback indicated that AI-sung syllabi significantly improved awareness and recall of key course information. 2025-08-16T02:12:39Z 19 pages, 3 figures, 2 tables Xinxing Wu http://arxiv.org/abs/2505.11237v2 Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification 2026-03-10T13:05:59Z Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}. 2025-05-16T13:27:57Z ICMR'25, June 30-July 3, 2025, Chicago, IL, USA Wenhao Qian Zhenzhen Hu Zijie Song Jia Li 10.1145/3731715.3733296 http://arxiv.org/abs/2512.00883v2 Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound 2026-03-10T13:04:08Z World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent's performance. 2025-11-30T13:11:56Z Jiahua Wang Leqi Zheng Jialong Wu Yaoxin Mao http://arxiv.org/abs/2603.09541v1 Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA 2026-03-10T11:51:54Z Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency. 2026-03-10T11:51:54Z Xin Lu Rui Li Xun Huang Weixin Li Chuanqing Zhuang Jiayuan Li Zhengda Lu Jun Xiao Yunhong Wang http://arxiv.org/abs/2603.09536v1 Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective 2026-03-10T11:48:37Z In virtual reality (VR) educational scenarios, Pedagogical agents (PAs) enhance immersive learning through realistic appearances and interactive behaviors. However, most existing PAs rely on static speech and simple gestures. This limitation reduces their ability to dynamically adapt to the semantic context of instructional content. As a result, interactions often lack naturalness and effectiveness in the teaching process. To address this challenge, this study proposes a large language model (LLM)-driven multimodal expression generation method that constructs semantically sensitive prompts to generate coordinated speech and gesture instructions, enabling dynamic alignment between instructional semantics and multimodal expressive behaviors. A VR-based PA prototype was developed and evaluated through user experience-oriented subjective experiments. Results indicate that dynamically generated multimodal expressions significantly enhance learners' perceived learning effectiveness, engagement, and intention to use, while effectively alleviating feelings of fatigue and boredom during the learning process. Furthermore, the combined dynamic expression of speech and gestures notably enhances learners' perceptions of human-likeness and social presence. The findings provide new insights and design guidelines for building more immersive and naturally expressive intelligent PAs. 2026-03-10T11:48:37Z Ninghao Wan Jiarun Song Fuzheng Yang http://arxiv.org/abs/2603.09478v1 MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning 2026-03-10T10:30:59Z Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and cross-modal reasoning abilities. Existing methods, mainly classification-based or generation-based without reasoning, struggle to handle complex extraction scenarios in the MORE task and suffer from limited scalability and intermediate reasoning transparency. To address these challenges, we propose MORE-R1, a novel model that introduces explicit stepwise reasoning with Reinforcement Learning (RL) to enable Large Vision-Language Model (LVLM) to address the MORE task effectively. MORE-R1 integrates a two-stage training process, including an initial cold-start training stage with Supervised Fine-Tuning (SFT) and a subsequent RL stage for reasoning ability optimization. In the initial stage, we design an efficient way to automatically construct a high-quality SFT dataset containing fine-grained stepwise reasoning tailored to the MORE task, enabling the model to learn an effective reasoning paradigm. In the subsequent stage, we employ the Group Relative Policy Optimization (GRPO) RL algorithm with a Progressive Sample-Mixing Strategy to stabilize training and further enhance model's reasoning ability on hard samples. Comprehensive experiments on the MORE benchmark demonstrate that MORE-R1 achieves state-of-the-art performance with significant improvement over baselines. 2026-03-10T10:30:59Z Accepted by the 31st International Conference on Database Systems for Advanced Applications. This is the Accepted Manuscript (AM) version Xiang Yuan Xu Chu Xinrong Chen Haochen Li Zonghong Dai Hongcheng Fan Xiaoyue Yuan Weiping Li Tong Mo http://arxiv.org/abs/2603.09294v1 Latency Effects on Multi-Dimensional QoE in Networked VR Whiteboards 2026-03-10T07:23:16Z Networked virtual reality (NVR) whiteboards are increasingly important for enabling geographically dispersed users to engage in real-time idea sharing, collaborative design, and discussion. However, latency caused by network limitations, rendering delays, or synchronization issues can significantly degrade the Quality of Experience (QoE) in whiteboard collaboration. To systematically investigate the impact of latency, this study classified QoE into pragmatic and hedonic aspects, each comprising multiple sub-dimensions. Controlled experiments were conducted to identify the sub-dimensions most affected by latency, which were then adopted as the primary QoE indicators, with the aim of uncovering the processes and mechanisms through which latency shapes QoE. Building on this, we further examined how these impacts vary across different collaboration modes, namely sequential collaboration (SC) for structured design workflows and free collaboration (FC) for open discussion. We also compared two VR whiteboard types, one with avatars (VR+) and the other without avatars (VR), and included a traditional PC-based whiteboard as a baseline. This multi-dimensional design enables a comprehensive evaluation of latency's impact on QoE across collaboration modes and platforms, providing practical guidance for optimizing NVR whiteboard systems under real-world network and system constraints. 2026-03-10T07:23:16Z Jiarun Song Yongkang Hou Fuzheng Yang http://arxiv.org/abs/2603.09264v1 TPIFM: A Task-Aware Model for Evaluating Perceptual Interaction Fluency in Remote AR Collaboration 2026-03-10T06:53:58Z Remote Collaborative Augmented Reality (RCAR) enables geographically distributed users to collaborate by integrating virtual and physical environments. However, because RCAR relies on real-time transmission, it is susceptible to delay and stalling impairments under constrained network conditions. Perceptual interaction fluency (PIF), defined as the perceived pace and responsiveness of collaboration, is influenced not only by physical network impairments but also by intrinsic task characteristics. These characteristics can be interpreted as the task-specific just-noticeable difference (JND), i.e., the maximal tolerable temporal responsiveness before PIF degrades. When the average response time (ART), measured as the mean time per operation from receiving collaborator feedback to initiating the next action, falls within the JND, PIF is generally sustained, whereas values exceeding it indicate disruption. Tasks differ in their JNDs, reflecting distinct temporal responsiveness demands and sensitivities to impairments. From the perspective of the Free Energy Principle (FEP), tasks with lower JNDs impose stricter temporal prediction demands, making PIF more vulnerable to impairments, whereas higher JNDs allow greater tolerance. On this basis, we classify RCAR tasks by JND and evaluate their PIF through controlled subjective experiments under delay, stalling, and hybrid conditions. Building on these findings, we propose the Task-Aware Perceptual Interaction Fluency Model (TPIFM). Experimental results show that TPIFM accurately assesses PIF under network impairments, providing guidance for adaptive RCAR design and user experience optimization under network constraints. 2026-03-10T06:53:58Z Jiarun Song Ninghao Wan Fuzheng Yang Weisi Lin http://arxiv.org/abs/2603.09261v1 From Perception to Cognition: How Latency Affects Interaction Fluency and Social Presence in VR Conferencing 2026-03-10T06:49:40Z Virtual reality (VR) conferencing has the potential to provide geographically dispersed users with an immersive environment, enabling rich social interactions and user experience using avatars. However, remote communication in VR inevitably introduces end-to-end (E2E) latency, which can significantly impact user experience. To clarify the impact of latency, we conducted subjective experiments to analyze how it influences interaction fluency from the perspective of quality perception and social presence from the perspective of social cognition, comparing VR conferencing with traditional video conferencing (VC). Specifically, interaction fluency emphasizes user perception of interaction pace and responsiveness and is assessed using Absolute Category Rating (ACR) method. In contrast, social presence focuses on the cognitive understanding of interaction, specifically whether individuals can comprehend the intentions, emotions, and behaviors expressed by others. It is primarily measured using the Networked Minds Social Presence Inventory (NMSPI). Building on this analysis, we further investigate the relationship between interaction fluency and social presence under different latency conditions to clarify the underlying perceptual and cognitive mechanisms. The findings from these subjective tests provide meaningful insights for optimizing the related systems, helping to improve interaction fluency and enhancing social presence in immersive virtual environments. 2026-03-10T06:49:40Z Jiarun Song Ninghao Wan FuZheng Yang Weisi Lin http://arxiv.org/abs/2510.18533v3 Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification 2026-03-10T02:53:18Z Robust speaker verification under noisy conditions remains an open challenge. Conventional deep learning methods learn a robust unified speaker representation space against diverse background noise and achieve significant improvement. In contrast, this paper presents a noise-conditioned mixture-ofexperts framework that decomposes the feature space into specialized noise-aware subspaces for speaker verification. Specifically, we propose a noise-conditioned expert routing mechanism, a universal model based expert specialization strategy, and an SNR-decaying curriculum learning protocol, collectively improving model robustness and generalization under diverse noise conditions. The proposed method can automatically route inputs to expert networks based on noise information derived from the inputs, where each expert targets distinct noise characteristics while preserving speaker identity information. Comprehensive experiments demonstrate consistent superiority over baselines 2025-10-21T11:23:32Z Accepted by Signal Processing Letters Bin Gu Haitao Zhao Jibo Wei http://arxiv.org/abs/2603.08936v1 VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs 2026-03-09T21:10:34Z Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions. 2026-03-09T21:10:34Z submitted to Interspeech 2026 Hezhao Zhang Huang-Cheng Chou Shrikanth Narayanan Thomas Hain http://arxiv.org/abs/2603.08927v1 MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering 2026-03-09T20:53:51Z Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io. 2026-03-09T20:53:51Z MEGC 2026 at IEEE FG 2026 Xinqi Fan Jingting Li John See Moi Hoon Yap Su-Jing Wang Adrian K. Davison http://arxiv.org/abs/2603.08417v1 Scalable On-the-fly Transcoding for Adaptive Streaming of Dynamic Point Clouds 2026-03-09T14:14:57Z On-the-fly transcoding of dynamic point cloud sequences reduces storage requirements and virtually increases the number of available representations for on demand streaming scenarios. On-the-fly transcoding introduces, however, additional workload to media providers' infrastructure. While V-PCC encoded content can be efficiently transcoded by re-encoding the underlying video bitstreams, which greatly benefits from hardware-accelerated video codec implementations, the scalability of such a system remains unclear. In this work, we introduce and evaluate a dynamic point cloud streaming system that utilizes on-the-fly transcoding. We explore the limits of scalability of this system in terms of request fulfillment times, specifically evaluating the perceived user Quality of Experience. We empirically show how caching and speculative transcoding allow to significantly reduce transcoding loads, allowing to scale to a higher number of simultaneous clients. 2026-03-09T14:14:57Z 7 pages, 6 figures Michael Rudolph Matthias De Fré Finn Schnier Tim Wauters Amr Rizk http://arxiv.org/abs/2603.08154v1 Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds 2026-03-09T09:36:12Z Environmental sound classification is a field of growing importance for urban monitoring and cultural soundscape analysis, especially within the acoustically rich environments of South Asia. These regions present a unique challenge as multiple natural, human, and cultural sounds often overlap, straining traditional methods that frequently rely on Mel Frequency Cepstral Coefficients (MFCC). This study introduces a novel spectrogram-based methodology with a superior ability to capture these complex auditory patterns. A Convolutional Neural Network (CNN) architecture is implemented to solve a demanding multilabel, multiclass classification problem on the SAS-KIIT dataset. To demonstrate robustness and comparability, the approach is also validated using the renowned UrbanSound8K dataset. The results confirm that the proposed spectrogram-based method significantly outperforms existing MFCC-based techniques, achieving higher classification accuracy across both datasets. This improvement lays the groundwork for more robust and accurate audio classification systems in real-world applications. 2026-03-09T09:36:12Z Sudip Chakrabarty Pappu Bishwas Rajdeep Chatterjee Tathagata Bandyopadhyay Digonto Biswas Bibek Howlader