https://arxiv.org/api/ha6BEra0dFKW+QeFTyJSoWCSum0 2026-03-20T10:38:37Z 9230 15 15 http://arxiv.org/abs/2603.16966v1 CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization 2026-03-17T09:00:32Z Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings. 2026-03-17T09:00:32Z Accepted to CVPR 2026 Liangbin Huang Xiaohua Liao Chaoqun Cui Shijing Wang Zhaolong Huang Yanlong Du Wenji Mao http://arxiv.org/abs/2603.16259v1 Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction 2026-03-17T08:46:32Z Multimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To address the expanding category set including newly emerging entity types or relations on websites, prior research proposed the zero-shot MIE (ZS-MIE) task which aims to extract unseen structural knowledge with textual and visual modalities. However, the ZS-MIE models are limited to recognizing the samples that fall within the unseen category set, and they struggle to deal with real-world scenarios that encompass both seen and unseen categories. The shortcomings of existing methods can be ascribed to two main aspects. On one hand, these methods construct representations of samples and categories within Euclidean space, failing to capture the hierarchical semantic relationships between the two modalities within a sample and their corresponding category prototypes. On the other hand, there is a notable gap in the distribution of semantic similarity between seen and unseen category sets, which impacts the generative capability of the ZS-MIE models. To overcome the disadvantages, we delve into the generalized zero-shot MIE (GZS-MIE) task and propose the hyperbolic multimodal generative representation learning framework (HMGRL). The variational information bottleneck and autoencoder networks are reconstructed with hyperbolic space for modeling the multi-level hierarchical semantic correlations among samples and prototypes. Furthermore, the proposed model is trained with the unseen samples generated by the decoder, and we introduce the semantic similarity distribution alignment loss to enhance the model's generalization performance. Experimental evaluations on two benchmark datasets underscore the superiority of HMGRL compared to existing baseline methods. 2026-03-17T08:46:32Z Accepted by WWW 2026 Baohang Zhou Kehui Song Rize Jin Yu Zhao Xuhui Sui Xinying Qian Xingyue Guo Ying Zhang http://arxiv.org/abs/2603.14267v2 DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization 2026-03-17T05:01:44Z Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics. 2026-03-15T07:53:23Z Accepted at CVPR 2026 Findings Ngoc-Son Nguyen Thanh V. T. Tran Jeongsoo Choi Hieu-Nghia Huynh-Nguyen Truong-Son Hy Van Nguyen http://arxiv.org/abs/2603.16093v1 Diffusion Models for Joint Audio-Video Generation 2026-03-17T03:31:37Z Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation. 2026-03-17T03:31:37Z Alejandro Paredes La Torre http://arxiv.org/abs/2512.07209v2 Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits 2026-03-16T23:51:27Z We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity. 2025-12-08T06:45:11Z Source code: https://github.com/SonyResearch/CoherentAVEdit Masato Ishii Akio Hayakawa Takashi Shibuya Yuki Mitsufuji http://arxiv.org/abs/2603.15997v1 Visual Set Program Synthesizer 2026-03-16T23:15:54Z A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference. 2026-03-16T23:15:54Z 10 pages, IEEE International Conference on Multimedia and Expo 2026 IEEE International Conference on Multimedia and Expo 2026 Zehua Cheng Wei Dai Wenhu Zhang Thomas Lukasiewicz Jiahao Sun http://arxiv.org/abs/2603.15597v1 AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer 2026-03-16T17:53:07Z Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning. 2026-03-16T17:53:07Z Accepted at ICLR 2026. 15 pages, 5 figures Pengjun Fang Yingqing He Yazhou Xing Qifeng Chen Ser-Nam Lim Harry Yang http://arxiv.org/abs/2603.15392v1 Multimodal Cyber-physical Interaction in XR: Hybrid Doctoral Thesis Defense 2026-03-16T15:08:11Z Academic events, such as a doctoral thesis defense, are typically limited to either physical co-location or flat video conferencing, resulting in rigid participation formats and fragmented presence. We present a multimodal framework that breaks this binary by supporting a spectrum of participation - from in-person attendance to immersive virtual reality (VR) or browser access - and report our findings from using it to organize the first ever hybrid doctoral thesis defense using extended reality (XR). The framework integrates full-body motion tracking to synchronize the user's avatar motions and gestures, enabling natural interaction with onsite participants as well as body language and gestures with remote attendees in the virtual world. It leverages WebXR to provide cross-platform and instant accessibility with easy setup. User feedback analysis reveals positive VR experiences and demonstrates the framework's effectiveness in supporting various hybrid event activities. 2026-03-16T15:08:11Z 10 pages, 3 figures, magazine paper Ahmad Alhilal Kit Yung Lam Lik-Hang Lee Xuetong Wang Sijia Li Matti Siekkinen Tristan Braud Pan Hui http://arxiv.org/abs/2510.17512v2 AWARE: Audio Watermarking with Adversarial Resistance to Edits 2026-03-16T10:48:28Z Prevailing practice in learning-based audio watermarking is to pursue robustness by expanding the set of simulated distortions during training. However, such surrogates are narrow and prone to overfitting. This paper presents AWARE (Audio Watermarking with Adversarial Resistance to Edits), an alternative approach that avoids reliance on attack-simulation stacks and handcrafted differentiable distortions. Embedding is obtained through adversarial optimization in the time-frequency domain under a level-proportional perceptual budget. Detection employs a time-order-agnostic detector with a Bitwise Readout Head (BRH) that aggregates temporal evidence into one score per watermark bit, enabling reliable watermark decoding even under desynchronization and temporal cuts. Empirically, AWARE attains high audio quality and speech intelligibility (PESQ/STOI) and consistently low BER across various audio edits, often surpassing representative state-of-the-art learning-based systems. 2025-10-20T13:10:52Z Kosta Pavlović Lazar Stanarević Petar Nedić Elena Nešović Slavko Kovačević Igor Djurović http://arxiv.org/abs/2510.12720v2 Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception 2026-03-16T10:45:28Z Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions. 2025-10-14T17:00:09Z Accepted by ICLR2026. Open Source at https://github.com/ddlBoJack/Omni-Captioner Ziyang Ma Ruiyang Xu Zhenghao Xing Yunfei Chu Yuxuan Wang Jinzheng He Jin Xu Pheng-Ann Heng Kai Yu Junyang Lin Eng Siong Chng Xie Chen http://arxiv.org/abs/2603.15083v1 ReactMotion: Generating Reactive Listener Motions from Speaker Utterance 2026-03-16T10:37:42Z In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions. 2026-03-16T10:37:42Z 42 pages, 11 tables, 8 figures Cheng Luo Bizhu Wu Bing Li Jianfeng Ren Ruibin Bai Rong Qu Linlin Shen Bernard Ghanem http://arxiv.org/abs/2603.14992v1 Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos 2026-03-16T08:58:02Z Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff. 2026-03-16T08:58:02Z 16 pages, 7 figures, 11 tables Chong Tian Yu Wang Chenxu Yang Junyi Guan Zheng Lin Yuhan Liu Xiuying Chen Qirong Ho http://arxiv.org/abs/2603.14976v1 Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation 2026-03-16T08:37:37Z Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods. 2026-03-16T08:37:37Z Lingsi Zhu Yuefeng Zou Yunxiang Zhang Naixiang Zheng Guoyuan Wang Jun Yu Jiaen Liang Wei Huang Shengping Liu Ximin Zheng http://arxiv.org/abs/2603.14916v1 EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing 2026-03-16T07:24:56Z Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: https://github.com/IntMeGroup/EditHF. 2026-03-16T07:24:56Z Zitong Xu Huiyu Duan Zhongpeng Ji Xinyun Zhang Yutao Liu Xiongkuo Min Ke Gu Jian Zhang Shusong Xu Jinwei Chen Bo Li Guangtao Zhai http://arxiv.org/abs/2603.13099v2 Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation 2026-03-16T03:28:24Z We introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning in which no competitive model preserves more than 60% of matched steps in the correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves a 32% improvement in Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation. 2026-03-13T15:48:15Z Wayner Barrios SouYoung Jin