https://arxiv.org/api/L2cpnWWmzAV6Jj8/q2HWyiSKujo
2026-03-24T13:25:42Z
9246
75
15
http://arxiv.org/abs/2603.10314v1
PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion
2026-03-11T01:29:30Z
This paper proposes PRoADS, a provably secure and robust audio steganographic framework based on audio diffusion models. As a generative steganography scheme, PRoADS embeds secret messages into the initial noise of diffusion models via orthogonal matrix projection. To address the reconstruction errors in diffusion inversion that cause high bit error rates (BER), we introduce Latent Optimization and Backward Euler Inversion to minimize the latent reconstruction and diffusion inversion errors. Comprehensive experiments demonstrate that our scheme sustains a remarkably low BER of 0.15\% under 64 kbps MP3 compression, significantly outperforming existing methods and exhibiting strong robustness.
2026-03-11T01:29:30Z
This paper has been accepted for presentation at the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
YongPeng Yan
Yanan Li
Qiyang Xiao
Yanzhen Ren
http://arxiv.org/abs/2508.11872v2
Singing Syllabi with Virtual Avatars: Enhancing Student Engagement Through AI-Generated Music and Digital Embodiment
2026-03-10T13:14:42Z
In practical teaching, we observe that few students thoroughly read or fully comprehend the information provided in traditional, text-based course syllabi. As a result, essential details, such as course policies and learning outcomes, are frequently overlooked. To address this challenge, in this paper, we propose a novel approach leveraging AI-generated singing and virtual avatars to present syllabi in a format that is more visually appealing, engaging, and memorable. Especially, we leveraged the open-source tool, HeyGem, to transform textual syllabi into audiovisual presentations, in which digital avatars perform the syllabus content as songs. The proposed approach aims to stimulate students' curiosity, foster emotional connection, and enhance retention of critical course information. Student feedback indicated that AI-sung syllabi significantly improved awareness and recall of key course information.
2025-08-16T02:12:39Z
19 pages, 3 figures, 2 tables
Xinxing Wu
http://arxiv.org/abs/2505.11237v2
Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification
2026-03-10T13:05:59Z
Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
2025-05-16T13:27:57Z
ICMR'25, June 30-July 3, 2025, Chicago, IL, USA
Wenhao Qian
Zhenzhen Hu
Zijie Song
Jia Li
10.1145/3731715.3733296
http://arxiv.org/abs/2512.00883v2
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
2026-03-10T13:04:08Z
World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent's performance.
2025-11-30T13:11:56Z
Jiahua Wang
Leqi Zheng
Jialong Wu
Yaoxin Mao
http://arxiv.org/abs/2603.09541v1
Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA
2026-03-10T11:51:54Z
Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.
2026-03-10T11:51:54Z
Xin Lu
Rui Li
Xun Huang
Weixin Li
Chuanqing Zhuang
Jiayuan Li
Zhengda Lu
Jun Xiao
Yunhong Wang
http://arxiv.org/abs/2603.09536v1
Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective
2026-03-10T11:48:37Z
In virtual reality (VR) educational scenarios, Pedagogical agents (PAs) enhance immersive learning through realistic appearances and interactive behaviors. However, most existing PAs rely on static speech and simple gestures. This limitation reduces their ability to dynamically adapt to the semantic context of instructional content. As a result, interactions often lack naturalness and effectiveness in the teaching process. To address this challenge, this study proposes a large language model (LLM)-driven multimodal expression generation method that constructs semantically sensitive prompts to generate coordinated speech and gesture instructions, enabling dynamic alignment between instructional semantics and multimodal expressive behaviors. A VR-based PA prototype was developed and evaluated through user experience-oriented subjective experiments. Results indicate that dynamically generated multimodal expressions significantly enhance learners' perceived learning effectiveness, engagement, and intention to use, while effectively alleviating feelings of fatigue and boredom during the learning process. Furthermore, the combined dynamic expression of speech and gestures notably enhances learners' perceptions of human-likeness and social presence. The findings provide new insights and design guidelines for building more immersive and naturally expressive intelligent PAs.
2026-03-10T11:48:37Z
Ninghao Wan
Jiarun Song
Fuzheng Yang
http://arxiv.org/abs/2603.09478v1
MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning
2026-03-10T10:30:59Z
Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and cross-modal reasoning abilities. Existing methods, mainly classification-based or generation-based without reasoning, struggle to handle complex extraction scenarios in the MORE task and suffer from limited scalability and intermediate reasoning transparency. To address these challenges, we propose MORE-R1, a novel model that introduces explicit stepwise reasoning with Reinforcement Learning (RL) to enable Large Vision-Language Model (LVLM) to address the MORE task effectively. MORE-R1 integrates a two-stage training process, including an initial cold-start training stage with Supervised Fine-Tuning (SFT) and a subsequent RL stage for reasoning ability optimization. In the initial stage, we design an efficient way to automatically construct a high-quality SFT dataset containing fine-grained stepwise reasoning tailored to the MORE task, enabling the model to learn an effective reasoning paradigm. In the subsequent stage, we employ the Group Relative Policy Optimization (GRPO) RL algorithm with a Progressive Sample-Mixing Strategy to stabilize training and further enhance model's reasoning ability on hard samples. Comprehensive experiments on the MORE benchmark demonstrate that MORE-R1 achieves state-of-the-art performance with significant improvement over baselines.
2026-03-10T10:30:59Z
Accepted by the 31st International Conference on Database Systems for Advanced Applications. This is the Accepted Manuscript (AM) version
Xiang Yuan
Xu Chu
Xinrong Chen
Haochen Li
Zonghong Dai
Hongcheng Fan
Xiaoyue Yuan
Weiping Li
Tong Mo
http://arxiv.org/abs/2603.09294v1
Latency Effects on Multi-Dimensional QoE in Networked VR Whiteboards
2026-03-10T07:23:16Z
Networked virtual reality (NVR) whiteboards are increasingly important for enabling geographically dispersed users to engage in real-time idea sharing, collaborative design, and discussion. However, latency caused by network limitations, rendering delays, or synchronization issues can significantly degrade the Quality of Experience (QoE) in whiteboard collaboration. To systematically investigate the impact of latency, this study classified QoE into pragmatic and hedonic aspects, each comprising multiple sub-dimensions. Controlled experiments were conducted to identify the sub-dimensions most affected by latency, which were then adopted as the primary QoE indicators, with the aim of uncovering the processes and mechanisms through which latency shapes QoE. Building on this, we further examined how these impacts vary across different collaboration modes, namely sequential collaboration (SC) for structured design workflows and free collaboration (FC) for open discussion. We also compared two VR whiteboard types, one with avatars (VR+) and the other without avatars (VR), and included a traditional PC-based whiteboard as a baseline. This multi-dimensional design enables a comprehensive evaluation of latency's impact on QoE across collaboration modes and platforms, providing practical guidance for optimizing NVR whiteboard systems under real-world network and system constraints.
2026-03-10T07:23:16Z
Jiarun Song
Yongkang Hou
Fuzheng Yang
http://arxiv.org/abs/2603.09264v1
TPIFM: A Task-Aware Model for Evaluating Perceptual Interaction Fluency in Remote AR Collaboration
2026-03-10T06:53:58Z
Remote Collaborative Augmented Reality (RCAR) enables geographically distributed users to collaborate by integrating virtual and physical environments. However, because RCAR relies on real-time transmission, it is susceptible to delay and stalling impairments under constrained network conditions. Perceptual interaction fluency (PIF), defined as the perceived pace and responsiveness of collaboration, is influenced not only by physical network impairments but also by intrinsic task characteristics. These characteristics can be interpreted as the task-specific just-noticeable difference (JND), i.e., the maximal tolerable temporal responsiveness before PIF degrades. When the average response time (ART), measured as the mean time per operation from receiving collaborator feedback to initiating the next action, falls within the JND, PIF is generally sustained, whereas values exceeding it indicate disruption. Tasks differ in their JNDs, reflecting distinct temporal responsiveness demands and sensitivities to impairments. From the perspective of the Free Energy Principle (FEP), tasks with lower JNDs impose stricter temporal prediction demands, making PIF more vulnerable to impairments, whereas higher JNDs allow greater tolerance. On this basis, we classify RCAR tasks by JND and evaluate their PIF through controlled subjective experiments under delay, stalling, and hybrid conditions. Building on these findings, we propose the Task-Aware Perceptual Interaction Fluency Model (TPIFM). Experimental results show that TPIFM accurately assesses PIF under network impairments, providing guidance for adaptive RCAR design and user experience optimization under network constraints.
2026-03-10T06:53:58Z
Jiarun Song
Ninghao Wan
Fuzheng Yang
Weisi Lin
http://arxiv.org/abs/2603.09261v1
From Perception to Cognition: How Latency Affects Interaction Fluency and Social Presence in VR Conferencing
2026-03-10T06:49:40Z
Virtual reality (VR) conferencing has the potential to provide geographically dispersed users with an immersive environment, enabling rich social interactions and user experience using avatars. However, remote communication in VR inevitably introduces end-to-end (E2E) latency, which can significantly impact user experience. To clarify the impact of latency, we conducted subjective experiments to analyze how it influences interaction fluency from the perspective of quality perception and social presence from the perspective of social cognition, comparing VR conferencing with traditional video conferencing (VC). Specifically, interaction fluency emphasizes user perception of interaction pace and responsiveness and is assessed using Absolute Category Rating (ACR) method. In contrast, social presence focuses on the cognitive understanding of interaction, specifically whether individuals can comprehend the intentions, emotions, and behaviors expressed by others. It is primarily measured using the Networked Minds Social Presence Inventory (NMSPI). Building on this analysis, we further investigate the relationship between interaction fluency and social presence under different latency conditions to clarify the underlying perceptual and cognitive mechanisms. The findings from these subjective tests provide meaningful insights for optimizing the related systems, helping to improve interaction fluency and enhancing social presence in immersive virtual environments.
2026-03-10T06:49:40Z
Jiarun Song
Ninghao Wan
FuZheng Yang
Weisi Lin
http://arxiv.org/abs/2510.18533v3
Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
2026-03-10T02:53:18Z
Robust speaker verification under noisy conditions remains an open challenge. Conventional deep learning methods learn a robust unified speaker representation space against diverse background noise and achieve significant improvement. In contrast, this paper presents a noise-conditioned mixture-ofexperts framework that decomposes the feature space into specialized noise-aware subspaces for speaker verification. Specifically, we propose a noise-conditioned expert routing mechanism, a universal model based expert specialization strategy, and an SNR-decaying curriculum learning protocol, collectively improving model robustness and generalization under diverse noise conditions. The proposed method can automatically route inputs to expert networks based on noise information derived from the inputs, where each expert targets distinct noise characteristics while preserving speaker identity information. Comprehensive experiments demonstrate consistent superiority over baselines
2025-10-21T11:23:32Z
Accepted by Signal Processing Letters
Bin Gu
Haitao Zhao
Jibo Wei
http://arxiv.org/abs/2603.08936v1
VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
2026-03-09T21:10:34Z
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
2026-03-09T21:10:34Z
submitted to Interspeech 2026
Hezhao Zhang
Huang-Cheng Chou
Shrikanth Narayanan
Thomas Hain
http://arxiv.org/abs/2603.08927v1
MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering
2026-03-09T20:53:51Z
Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io.
2026-03-09T20:53:51Z
MEGC 2026 at IEEE FG 2026
Xinqi Fan
Jingting Li
John See
Moi Hoon Yap
Su-Jing Wang
Adrian K. Davison
http://arxiv.org/abs/2603.08417v1
Scalable On-the-fly Transcoding for Adaptive Streaming of Dynamic Point Clouds
2026-03-09T14:14:57Z
On-the-fly transcoding of dynamic point cloud sequences reduces storage requirements and virtually increases the number of available representations for on demand streaming scenarios. On-the-fly transcoding introduces, however, additional workload to media providers' infrastructure. While V-PCC encoded content can be efficiently transcoded by re-encoding the underlying video bitstreams, which greatly benefits from hardware-accelerated video codec implementations, the scalability of such a system remains unclear. In this work, we introduce and evaluate a dynamic point cloud streaming system that utilizes on-the-fly transcoding. We explore the limits of scalability of this system in terms of request fulfillment times, specifically evaluating the perceived user Quality of Experience. We empirically show how caching and speculative transcoding allow to significantly reduce transcoding loads, allowing to scale to a higher number of simultaneous clients.
2026-03-09T14:14:57Z
7 pages, 6 figures
Michael Rudolph
Matthias De Fré
Finn Schnier
Tim Wauters
Amr Rizk
http://arxiv.org/abs/2603.08154v1
Soundscapes in Spectrograms: Pioneering Multilabel Classification for South Asian Sounds
2026-03-09T09:36:12Z
Environmental sound classification is a field of growing importance for urban monitoring and cultural soundscape analysis, especially within the acoustically rich environments of South Asia. These regions present a unique challenge as multiple natural, human, and cultural sounds often overlap, straining traditional methods that frequently rely on Mel Frequency Cepstral Coefficients (MFCC). This study introduces a novel spectrogram-based methodology with a superior ability to capture these complex auditory patterns. A Convolutional Neural Network (CNN) architecture is implemented to solve a demanding multilabel, multiclass classification problem on the SAS-KIIT dataset. To demonstrate robustness and comparability, the approach is also validated using the renowned UrbanSound8K dataset. The results confirm that the proposed spectrogram-based method significantly outperforms existing MFCC-based techniques, achieving higher classification accuracy across both datasets. This improvement lays the groundwork for more robust and accurate audio classification systems in real-world applications.
2026-03-09T09:36:12Z
Sudip Chakrabarty
Pappu Bishwas
Rajdeep Chatterjee
Tathagata Bandyopadhyay
Digonto Biswas
Bibek Howlader