https://arxiv.org/api/lBKnlgLfsRT/wyVLcVtJxcGuTkE 2026-03-24T10:13:05Z 9246 45 15 http://arxiv.org/abs/2603.13099v2 Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation 2026-03-16T03:28:24Z We introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning in which no competitive model preserves more than 60% of matched steps in the correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves a 32% improvement in Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation. 2026-03-13T15:48:15Z Wayner Barrios SouYoung Jin http://arxiv.org/abs/2602.14771v4 GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture 2026-03-15T17:21:04Z The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness. 2026-02-16T14:26:07Z Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking IEEE Transactions on Circuits and Systems for Video Technology 2026 Shih-Fang Chen Jun-Cheng Chen I-Hong Jhuo Yen-Yu Lin 10.1109/TCSVT.2026.3675005 http://arxiv.org/abs/2603.15685v1 DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression 2026-03-15T15:22:06Z Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH. 2026-03-15T15:22:06Z Bingzhou Li Tao Huang http://arxiv.org/abs/2603.14426v1 GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos 2026-03-15T15:09:57Z Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on huggingface.co. 2026-03-15T15:09:57Z Minghan Li Tongna Chen Tianrui Lv Yishuai Zhang Suchao An Guodong Zhou http://arxiv.org/abs/2602.11547v2 H.265/HEVC Video Steganalysis Based on CU Block Structure Gradients and IPM Mapping 2026-03-15T09:37:37Z Existing H.265/HEVC video steganalysis research mainly focuses on detecting the steganography based on motion vectors, intra prediction modes, and transform coefficients. However, there is currently no effective steganalysis method capable of detecting steganography based on Coding Unit (CU) block structure. To address this issue, we propose, for the first time, a H.265/HEVC video steganalysis algorithm based on CU block structure gradients and intra prediction mode mapping. The proposed method first constructs a new gradient map to explicitly describe changes in CU block structure, and combines it with a block level mapping representation of IPM. It can jointly model the structural perturbations introduced by steganography based on CU block structure. Then, we design a novel steganalysis network called GradIPMFormer, whose core innovation is an integrated architecture that combines convolutional local embedding with Transformer-based token modeling to jointly capture local CU boundary perturbations and long-range cross-CU structural dependencies, thereby effectively enhancing the capability to perceive CU block structure embedding. Experimental results show that under different quantization parameters and resolution settings, the proposed method consistently achieves superior detection performance across multiple steganography methods based on CU block structure. This study provides a new CU block structure steganalysis paradigm for H.265/HEVC and has significant research value for covert communication security detection. 2026-02-12T04:14:06Z Xiang Zhang Haiyang Xia Ziwen He Wenbin Huang Fei Peng Zhangjie Fu http://arxiv.org/abs/2507.08801v2 Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective 2026-03-15T06:33:57Z Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive (AR) video generation. Existing AR video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an LLM-based unified model for AR video generation with efficient discrete diffusion. Firstly, to fit videos with LLMs, we identify that 1D RoPE is ill-suited for visual spatiotemporal correlation modeling, and while demonstrated to be useful, naive 3D RoPE exhibits imbalanced frequency spectra. Therefore, we propose MM-RoPE, which preserves the original textual RoPE while seamlessly accommodating video data with comprehensive frequency spectra and scaled 3D positions. Secondly, to fit the video data's nature and overcome the inefficiency of next-token decoding, we adopt a parallel and mask-based discrete diffusion with the intra-frame bidirectional and inter-frame causal attention masks. Based on this attention mask, we uncover the frame-wise loss imbalance issue caused by spatial information redundancy and propose Autoregressive Discrete Diffusion Forcing, which introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. Despite using only 48 GPUs for pre-training and fine-tuning, limited data and a discrete tokenizer, Lumos-1 achieves results surpassing those of Show-o2 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos. 2025-07-11T17:59:42Z ICLR 2026 Camera Ready Version. Code and Models: https://github.com/alibaba-damo-academy/Lumos Hangjie Yuan Weihua Chen Jun Cen Hu Yu Jingyun Liang Shuning Chang Zhihui Lin Tao Feng Pengwei Liu Jiazheng Xing Hao Luo Jiasheng Tang Fan Wang Yi Yang http://arxiv.org/abs/2603.14238v1 Domain-Skewed Federated Learning with Feature Decoupling and Calibration 2026-03-15T06:20:22Z Federated learning (FL) allows distributed clients to collaboratively train a global model in a privacy-preserving manner. However, one major challenge is domain skew, where clients' data originating from diverse domains may hinder the aggregated global model from learning a consistent representation space, resulting in poor generalizable ability in multiple domains. In this paper, we argue that the domain skew is reflected in the domain-specific biased features of each client, causing the local model's representations to collapse into a narrow low-dimensional subspace. We then propose Federated Feature Decoupling and Calibration ($F^2$DC), which liberates valuable class-relevant information by calibrating the domain-specific biased features, enabling more consistent representations across domains. A novel component, Domain Feature Decoupler (DFD), is first introduced in $F^2$DC to determine the robustness of each feature unit, thereby separating the local features into domain-robust features and domain-related features. A Domain Feature Corrector (DFC) is further proposed to calibrate these domain-related features by explicitly linking discriminative signals, capturing additional class-relevant clues that complement the domain-robust features. Finally, a domain-aware aggregation of the local models is performed to promote consensus among clients. Empirical results on three popular multi-domain datasets demonstrate the effectiveness of the proposed $F^2$DC and the contributions of its two modules. Code is available at https://github.com/mala-lab/F2DC. 2026-03-15T06:20:22Z Accepted at CVPR 2026 Huan Wang Jun Shen Jun Yan Guansong Pang http://arxiv.org/abs/2512.09824v2 Composing Concepts from Images and Videos via Concept-prompt Binding 2026-03-14T08:16:04Z Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity. 2025-12-10T16:57:31Z CVPR 2026. Project page: https://refkxh.github.io/BiCo_Webpage/ Xianghao Kong Zeyu Zhang Yuwei Guo Zhuoran Zhao Songchun Zhang Anyi Rao http://arxiv.org/abs/2603.13739v1 UniVid: Pyramid Diffusion Model for High Quality Video Generation 2026-03-14T03:51:16Z Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper, we present a unified video generation model (UniVid) with hybrid conditions of the text prompt and reference image. Given these two available controls, our model can extract objects' appearance and their motion descriptions from textual prompts, while obtaining texture details and structural information from image clues to guide the video generation process. Specifically, we scale up the pre-trained text-to-image diffusion model for generating temporally coherent frames via introducing our temporal-pyramid cross-frame spatial-temporal attention modules and convolutions. To support bimodal control, we introduce a dual-stream cross-attention mechanism, whose attention scores can be freely re-weighted for interpolation of between single and two modalities controls during inference. Extensive experiments showcase that our UniVid achieves superior temporal coherence on T2V, I2V and (T+I)2V tasks. 2026-03-14T03:51:16Z Xinyu Xiao Binbin Yang Tingtian Li Yipeng Yu Sen Lei http://arxiv.org/abs/2603.13639v1 Adaptive Virtual Reality Museum: A Closed-Loop Framewor for Engagement-Aware Cultural Heritage 2026-03-13T22:46:19Z Static information presentation in VR cultural heritage often causes cognitive overload or under-stimulation. We introduce a closed-loop adaptive interface that tailors content depth to real-time visitor behavior through implicit multimodal sensing. Our approach continuously monitors gaze dwell, head kinematics, and locomotion to infer engagement via a transparent rule-based classifier, which drives a Large Language Model to dynamically modulate explanation complexity without interrupting exploration. We implemented a proof-of-concept in the Berat Ethnographic Museum and conducted a preliminary evaluation (N=16) comparing adaptive versus static content. Results indicate that adaptive participants demonstrated 2-3x increases in reading engagement and exploration time while maintaining high usability (SUS = 84.3). Technical validation confirmed sub-millisecond engagement inference latency on consumer VR hardware. These preliminary findings warrant larger-scale investigation and raise questions about engagement validation, AI transparency, and generative models in heritage contexts. We present this work-in-progress to spark discussion about implicit AI-driven adaptation in immersive cultural experiences. 2026-03-13T22:46:19Z 15 pages, 3 figures Joseph Damouni Wadia Tanus Naomi Unkelos-Shpigel http://arxiv.org/abs/2603.13597v1 DQ-Ladder: A Deep Reinforcement Learning-based Bitrate Ladder for Adaptive Video Streaming 2026-03-13T21:13:00Z Adaptive streaming of segmented video over HTTP typically relies on a predefined set of bitrate-resolution pairs, known as a bitrate ladder. However, fixed ladders often overlook variations in content and decoding complexities, leading to suboptimal trade-offs between encoding time, decoding efficiency, and video quality. This article introduces DQ-Ladder, a deep reinforcement learning (DRL)-based scheme for constructing time- and quality-aware bitrate ladders for adaptive video streaming applications. DQ-Ladder employs predicted decoding time, quality scores, and bitrate levels per segment as inputs to a Deep Q-Network (DQN) agent, guided by a weighted reward function of decoding time, video quality, and resolution smoothness. We leverage machine learning models to predict decoding time, bitrate level, and objective quality metrics (VMAF, XPSNR), eliminating the need for exhaustive encoding or quality metric computation. We evaluate DQ-Ladder using the Versatile Video Coding (VVC) toolchain (VVenC/VVdeC) on 750 video sequences across six Apple HLS-compliant resolutions and 41 quantization parameters. Experimental results against four baselines show that DQ-Ladder achieves BD-rate reductions of at least 10.3% for XPSNR compared to the HLS ladder, while reducing decoding time by 22%. DQ-Ladder shows significantly lower sensitivity to prediction errors than competing methods, remaining robust even with up to 20% noise. 2026-03-13T21:13:00Z Adaptive Video Streaming, Deep Reinforcement Learning, Q-Learning, Bitrate Ladder, Quality Prediction Reza Farahani Zoha Azimi Vignesh V Menon Hermann Hellwagner Radu Prodan Schahram Dustdar Christian Timmerer http://arxiv.org/abs/2603.12949v1 Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking 2026-03-13T12:46:27Z Robust invisible watermarks are widely used to support copyright protection, content provenance, and accountability by embedding hidden signals designed to survive common post-processing operations. However, diffusion-based image editing introduces a fundamentally different class of transformations: it injects noise and reconstructs images through a powerful generative prior, often altering semantic content while preserving photorealism. In this paper, we provide a unified theoretical and empirical analysis showing that non-adversarial diffusion editing can unintentionally degrade or remove robust watermarks. We model diffusion editing as a stochastic transformation that progressively contracts off-manifold perturbations, causing the low-amplitude signals used by many watermarking schemes to decay. Our analysis derives bounds on watermark signal-to-noise ratio and mutual information along diffusion trajectories, yielding conditions under which reliable recovery becomes information-theoretically impossible. We further evaluate representative watermarking systems under a range of diffusion-based editing scenarios and strengths. The results indicate that even routine semantic edits can significantly reduce watermark recoverability. Finally, we discuss the implications for content provenance and outline principles for designing watermarking approaches that remain robust under generative image editing. 2026-03-13T12:46:27Z Preprint Qian Qi Jiangyun Tang Jim Lee Emily Davis Finn Carter http://arxiv.org/abs/2510.27475v2 Referee: Reference-aware Audiovisual Deepfake Detection 2026-03-13T08:20:05Z Deepfakes generated by advanced generative models have rapidly posed serious threats, yet existing audiovisual deepfake detection approaches struggle to generalize to unseen manipulation methods. To address this, we propose a novel reference-aware audiovisual deepfake detection method, called Referee to capture fine-grained identity discrepancies. Unlike existing methods that overfit to transient spatiotemporal artifacts, Referee employs identity bottleneck and matching modules to model the relational consistency of speaker-specific cues captured by a single one-shot example as a biometric anchor. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art results on cross-dataset and cross-language evaluation protocols, including a 99.4% AUC on KoDF. These results highlight that explicitly correlating reference-based biometric priors is a key frontier for achieving generalized and reliable audiovisual forensics. The code is available at https://github.com/ewha-mmai/referee. 2025-10-31T13:54:33Z In Progress Hyemin Boo Eunsang Lee Jiyoung Lee http://arxiv.org/abs/2603.11647v2 OmniForcing: Unleashing Real-time Joint Audio-Visual Generation 2026-03-13T04:45:19Z Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com} 2026-03-12T08:17:36Z 14 pages Yaofeng Su Yuming Li Zeyue Xue Jie Huang Siming Fu Haoran Li Ying Li Zezhong Qian Haoyang Huang Nan Duan http://arxiv.org/abs/2601.00150v3 FCMBench: The First Large-scale Financial Credit Multimodal Benchmark for Real-world Applications 2026-03-13T04:16:11Z FCMBench is the first large-scale and privacy-compliant multimodal benchmark for real-world financial credit applications, covering tasks and robustness challenges from domain specific workflows and constraints. The current version of FCMBench covers 26 certificate types, with 5198 privacy-compliant images and 13806 paired VQA samples. It evaluates models on Perception and Reasoning tasks under real-world Robustness interferences, including 3 foundational perception tasks, 4 credit-specific reasoning tasks demanding decision-oriented visual evidence interpretation, and 10 real-world challenges for rigorous robustness stress testing. Moreover, FCMBench offers privacy-compliant realism with minimal leakage risk through in-house scenario-aware captures of manually synthesized templates, without any publicly released images. We conduct extensive evaluations of 28 state-of-the-art vision-language models spanning 14 AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1 score as a commercial model (65.16), Kimi-K2.5 achieves the best score as an open-source baseline (60.58). The mean and the std. of all tested models is 44.8 and 10.3 respectively, indicating that FCMBench is non-trivial and provides strong resolution for separating modern vision-language model capabilities. Robustness evaluations reveal that even top-performing models experience notable performance degradation under the designed challenges. We have open-sourced this benchmark to advance AI research in the credit domain and provide a domain-specific task for real-world AI applications. 2026-01-01T00:42:54Z Yehui Yang Dalu Yang Fangxin Shang Wenshuo Zhou Jie Ren Yifan Liu Haojun Fei Qing Yang Yanwu Xu Tao Chen