https://arxiv.org/api/ha6BEra0dFKW+QeFTyJSoWCSum02026-03-20T10:38:37Z92301515http://arxiv.org/abs/2603.16966v1CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization2026-03-17T09:00:32ZTraditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.2026-03-17T09:00:32ZAccepted to CVPR 2026Liangbin HuangXiaohua LiaoChaoqun CuiShijing WangZhaolong HuangYanlong DuWenji Maohttp://arxiv.org/abs/2603.16259v1Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction2026-03-17T08:46:32ZMultimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To address the expanding category set including newly emerging entity types or relations on websites, prior research proposed the zero-shot MIE (ZS-MIE) task which aims to extract unseen structural knowledge with textual and visual modalities. However, the ZS-MIE models are limited to recognizing the samples that fall within the unseen category set, and they struggle to deal with real-world scenarios that encompass both seen and unseen categories. The shortcomings of existing methods can be ascribed to two main aspects. On one hand, these methods construct representations of samples and categories within Euclidean space, failing to capture the hierarchical semantic relationships between the two modalities within a sample and their corresponding category prototypes. On the other hand, there is a notable gap in the distribution of semantic similarity between seen and unseen category sets, which impacts the generative capability of the ZS-MIE models. To overcome the disadvantages, we delve into the generalized zero-shot MIE (GZS-MIE) task and propose the hyperbolic multimodal generative representation learning framework (HMGRL). The variational information bottleneck and autoencoder networks are reconstructed with hyperbolic space for modeling the multi-level hierarchical semantic correlations among samples and prototypes. Furthermore, the proposed model is trained with the unseen samples generated by the decoder, and we introduce the semantic similarity distribution alignment loss to enhance the model's generalization performance. Experimental evaluations on two benchmark datasets underscore the superiority of HMGRL compared to existing baseline methods.2026-03-17T08:46:32ZAccepted by WWW 2026Baohang ZhouKehui SongRize JinYu ZhaoXuhui SuiXinying QianXingyue GuoYing Zhanghttp://arxiv.org/abs/2603.14267v2DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization2026-03-17T05:01:44ZVideo dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.2026-03-15T07:53:23ZAccepted at CVPR 2026 FindingsNgoc-Son NguyenThanh V. T. TranJeongsoo ChoiHieu-Nghia Huynh-NguyenTruong-Son HyVan Nguyenhttp://arxiv.org/abs/2603.16093v1Diffusion Models for Joint Audio-Video Generation2026-03-17T03:31:37ZMultimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.2026-03-17T03:31:37ZAlejandro Paredes La Torrehttp://arxiv.org/abs/2512.07209v2Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits2026-03-16T23:51:27ZWe introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.2025-12-08T06:45:11ZSource code: https://github.com/SonyResearch/CoherentAVEditMasato IshiiAkio HayakawaTakashi ShibuyaYuki Mitsufujihttp://arxiv.org/abs/2603.15997v1Visual Set Program Synthesizer2026-03-16T23:15:54ZA user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.2026-03-16T23:15:54Z10 pages, IEEE International Conference on Multimedia and Expo 2026IEEE International Conference on Multimedia and Expo 2026Zehua ChengWei DaiWenhu ZhangThomas LukasiewiczJiahao Sunhttp://arxiv.org/abs/2603.15597v1AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer2026-03-16T17:53:07ZExisting video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.2026-03-16T17:53:07ZAccepted at ICLR 2026. 15 pages, 5 figuresPengjun FangYingqing HeYazhou XingQifeng ChenSer-Nam LimHarry Yanghttp://arxiv.org/abs/2603.15392v1Multimodal Cyber-physical Interaction in XR: Hybrid Doctoral Thesis Defense2026-03-16T15:08:11ZAcademic events, such as a doctoral thesis defense, are typically limited to either physical co-location or flat video conferencing, resulting in rigid participation formats and fragmented presence. We present a multimodal framework that breaks this binary by supporting a spectrum of participation - from in-person attendance to immersive virtual reality (VR) or browser access - and report our findings from using it to organize the first ever hybrid doctoral thesis defense using extended reality (XR). The framework integrates full-body motion tracking to synchronize the user's avatar motions and gestures, enabling natural interaction with onsite participants as well as body language and gestures with remote attendees in the virtual world. It leverages WebXR to provide cross-platform and instant accessibility with easy setup. User feedback analysis reveals positive VR experiences and demonstrates the framework's effectiveness in supporting various hybrid event activities.2026-03-16T15:08:11Z10 pages, 3 figures, magazine paperAhmad AlhilalKit Yung LamLik-Hang LeeXuetong WangSijia LiMatti SiekkinenTristan BraudPan Huihttp://arxiv.org/abs/2510.17512v2AWARE: Audio Watermarking with Adversarial Resistance to Edits2026-03-16T10:48:28ZPrevailing practice in learning-based audio watermarking is to pursue robustness by expanding the set of simulated distortions during training. However, such surrogates are narrow and prone to overfitting. This paper presents AWARE (Audio Watermarking with Adversarial Resistance to Edits), an alternative approach that avoids reliance on attack-simulation stacks and handcrafted differentiable distortions. Embedding is obtained through adversarial optimization in the time-frequency domain under a level-proportional perceptual budget. Detection employs a time-order-agnostic detector with a Bitwise Readout Head (BRH) that aggregates temporal evidence into one score per watermark bit, enabling reliable watermark decoding even under desynchronization and temporal cuts. Empirically, AWARE attains high audio quality and speech intelligibility (PESQ/STOI) and consistently low BER across various audio edits, often surpassing representative state-of-the-art learning-based systems.2025-10-20T13:10:52ZKosta PavlovićLazar StanarevićPetar NedićElena Nešović Slavko KovačevićIgor Djurovićhttp://arxiv.org/abs/2510.12720v2Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception2026-03-16T10:45:28ZFine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.2025-10-14T17:00:09ZAccepted by ICLR2026. Open Source at https://github.com/ddlBoJack/Omni-CaptionerZiyang MaRuiyang XuZhenghao XingYunfei ChuYuxuan WangJinzheng HeJin XuPheng-Ann HengKai YuJunyang LinEng Siong ChngXie Chenhttp://arxiv.org/abs/2603.15083v1ReactMotion: Generating Reactive Listener Motions from Speaker Utterance2026-03-16T10:37:42ZIn this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.2026-03-16T10:37:42Z42 pages, 11 tables, 8 figuresCheng LuoBizhu WuBing LiJianfeng RenRuibin BaiRong QuLinlin ShenBernard Ghanemhttp://arxiv.org/abs/2603.14992v1Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos2026-03-16T08:58:02ZShort-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.2026-03-16T08:58:02Z16 pages, 7 figures, 11 tablesChong TianYu WangChenxu YangJunyi GuanZheng LinYuhan LiuXiuying ChenQirong Hohttp://arxiv.org/abs/2603.14976v1Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation2026-03-16T08:37:37ZEstimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.2026-03-16T08:37:37ZLingsi ZhuYuefeng ZouYunxiang ZhangNaixiang ZhengGuoyuan WangJun YuJiaen LiangWei HuangShengping LiuXimin Zhenghttp://arxiv.org/abs/2603.14916v1EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing2026-03-16T07:24:56ZRecent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: https://github.com/IntMeGroup/EditHF.2026-03-16T07:24:56ZZitong XuHuiyu DuanZhongpeng JiXinyun ZhangYutao LiuXiongkuo MinKe GuJian ZhangShusong XuJinwei ChenBo LiGuangtao Zhaihttp://arxiv.org/abs/2603.13099v2Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation2026-03-16T03:28:24ZWe introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning in which no competitive model preserves more than 60% of matched steps in the correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves a 32% improvement in Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.2026-03-13T15:48:15ZWayner BarriosSouYoung Jin