https://arxiv.org/api/9KRe7fpY1KEdOEwNc38lmb4Nnn8 2026-07-17T23:25:00Z 9772 60 15 http://arxiv.org/abs/2601.08828v2 Motion Attribution for Video Generation 2026-07-03T04:56:49Z

Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

2026-01-13T18:59:09Z See the project website at https://research.nvidia.com/labs/sil/projects/MOTIVE/ Xindi Wu Despoina Paschalidou Jun Gao Antonio Torralba Laura Leal-Taixé Olga Russakovsky Sanja Fidler Jonathan Lorraine http://arxiv.org/abs/2607.02912v1 See the Emotion: A Facial Emoji Proxy Modeling for EEG Emotion Recognition 2026-07-03T03:18:02Z

Despite the high accuracy of EEG-based emotion recognition, existing models remain opaque "black boxes", lacking semantic grounding between abstract neural features and human-interpretable states. In this paper, we reframe EEG explainability as a cross-modal generation task, shifting the paradigm from feature attribution to behavioral visualization. We introduce Facial Emoji Proxy Modeling, a novel framework that translates high-dimensional EEG signals into identity-anonymized facial emojis. Guided by the neuroscientific inspiration of neural-facial association, this approach grounds neural representations in the manifold of observable facial dynamics. Technically, our framework integrates FMENet, a specialized backbone modeling expression-relevant spatial synergies, and the Facial Emoji Learning Branch (FELB), which treats emoji reconstruction as a structured semantic regularizer. Extensive experiments on EAV and MMER benchmarks demonstrate that our method achieves state-of-the-art accuracy among EEG-only models. Crucially, it generates semantically faithful facial animations that provide a transparent, privacy-preserving window into the brain's emotional evolution, effectively allowing users to "see the emotion" directly from neural signals. Code is available at https://github.com/xian-sh/SeeEmotion

2026-07-03T03:18:02Z Accepted by ICML 2026 Jingjing Hu Guo Dan Haofan Cheng Ying Zeng Zhan Si Jinxing Zhou Meng Wang http://arxiv.org/abs/2606.12555v2 AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation 2026-07-02T19:10:03Z

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.

2026-06-10T18:06:27Z Zeyue Tian Lei Ke Zhaoyang Liu Ruibin Yuan Liumeng Xue Yujiu Yang Weijia Chen Xu Tan Qifeng Chen Wei Xue Yike Guo http://arxiv.org/abs/2602.22897v3 OmniGAIA: Towards Native Omni-Modal AI Agents 2026-07-02T15:39:17Z

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

2026-02-26T11:35:04Z Xiaoxi Li Wenxiang Jiao Jiarui Jin Haoxuan Li Hao Wang Shijian Wang Guanting Dong Jiajie Jin Yinuo Wang Yuan Lu Ji-Rong Wen Zhicheng Dou Zhouchen Lin http://arxiv.org/abs/2607.01901v1 SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs 2026-07-02T08:58:47Z

Effective brain disease diagnosis requires the synergy of brain connectivity patterns and high-level semantic knowledge. Existing methods, however, largely treat semantics from large language models (LLMs) as auxiliary features or supervision, limiting their direct role in decision-making and constraining classification stability and robustness. To overcome this, we propose a semantic-aligned brain network framework that actively integrates LLM-derived semantics into the prediction process. Specifically, ROI-level semantics are first incorporated via global self-attention to enrich node representations and provide whole-brain context. Multi-scale hypergraphs are then constructed to explicitly model functional subnetworks and multi-ROI interactions, addressing the locality limitations of traditional GNNs and capturing high-order dependencies. Finally, a decision-level semantic alignment mechanism selectively injects patient-specific textual embeddings into graph representations, enabling semantics to directly guide predictions without perturbing the underlying network structure. Experiments on public brain network datasets ABIDE and ADHD-200 demonstrate state-of-the-art performance, enhanced stability, and improved interpretability, particularly in small-sample settings.

2026-07-02T08:58:47Z Accepted to IEEE International Conference on Multimedia and Expo (ICME) 2026; Yidan Xu Xiangmin Han Rundong Xue Huihui Ye http://arxiv.org/abs/2511.01390v2 SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment 2026-07-02T07:36:28Z

Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic generation capabilities. However, the dense textual outputs from MLLMs may introduce conflicts with the original sparse captions. Furthermore, accurately quantifying semantic relevance between rich visual patches and concise textual descriptions remains a core challenge. To overcome these limitations, we introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity. Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches. Additionally, it leverages relevance-aware selection with mean value computation to highlight crucial patch-word correspondences, thereby improving cross-modal similarity assessment. Comprehensive experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance, surpassing existing approaches by 23\%-86\% in rSum across diverse model architectures, with notable enhancements in text-to-image retrieval scenarios. Our implementation is available at https://github.com/Sweet4tars/seps.git.

2025-11-03T09:41:32Z Xinyu Mao Junsi Li Haoji Zhang Yu Liang Ming Sun http://arxiv.org/abs/2607.01395v1 Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence 2026-07-01T18:54:00Z

At the heart of human visual perception lies the ability to maintain a continuous and coherent understanding of the external world. By integrating observations with accumulated experience, the human visual system can continuously adapt to variations in both the target and its surrounding environment, while preserving robust visual continuity as scene dynamics evolve. Human vision can therefore integrate prior knowledge, spatial geometry, and semantic context to understand complex scenes and their changes. As a core problem in computer vision, visual object tracking aims to bring machine perception closer to human visual perception. These capabilities are central to the task of Generic Object Tracking (GOT). In this task, a visual tracker is initialized only with the bounding box of an arbitrarily specified target in the first frame, and must continuously localize the target in subsequent dynamic visual streams. However, future events, observations, and real-world variations are inherently unpredictable; therefore, the model's generalization and online adaptation capabilities remain bottlenecks. Tracking reliability can deteriorate when the target undergoes severe deformation, is affected by complex distractors, encounters significant environmental changes, or belongs to a category unseen during training. This dissertation aims to narrow the gap between machine visual tracking systems and human visual perception by proposing a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models.

2026-07-01T18:54:00Z Ph.D. dissertation, National Yang Ming Chiao Tung University, 2026. arXiv admin note: substantial text overlap with arXiv:2602.14771 Shih-Fang Chen http://arxiv.org/abs/2607.02089v1 ESC: Emotional Self-Correction for Reliable Vision-Language Models 2026-07-01T14:25:43Z

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revisit this challenge through the lens of emotional cues, asking whether they can activate latent self-correction behaviors in VLMs without additional training. \textbf{We find that emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning}. Motivated by this finding, we propose \escabstract (\textbf{\underline{E}}motional \textbf{\underline{S}}elf-\textbf{\underline{C}}orrection), a training-free self-correction framework. ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotional feedback to encourage model to reflect, and produce a better revised response without additional training. Extensive experiments across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks show that ESC consistently improves reliability while preserving overall model utility. These results suggest that emotion can function not only as an ability to be recognized, but also as a practical control signal for scalable self-correction in VLMs. \textbf{We therefore believe that ESC provides a strong foundation for a new reliable human-like, emotion-integrated research direction.} Our project is publicly available at \textcolor{red}{https://genai4e.github.io/ESC/}.

2026-07-01T14:25:43Z ECCV Main Track 2026 (113 pages, 15 tables, 65 figures). Project Page: https://genai4e.github.io/ESC/? Tien-Huy Nguyen Minh-Nhat Nguyen Nguyen Nhat Huy Hung Viet Nguyen Huy Nguyen Minh Nhat Thanh-Huy Nguyen Cuong Tuan Nguyen Hoang M. Le Dat Nguyen Phat Kim Huynh Min Xu Ulas Bagci http://arxiv.org/abs/2607.00802v1 CellPrior-Net: Prior-Guided Nuclei Detection and Classification for H&E Whole-Slide Images 2026-07-01T11:31:09Z

Accurate nuclei detection and classification in hematoxylin and eosin (H and E) whole-slide images (WSIs) is a key task in computational pathology, particularly for quantitative analysis of the tumor microenvironment. However, this task remains highly challenging due to variations in nuclei morphology, staining procedures, scanners, organs, magnifications, and WSI artifacts. In addition, many existing pipelines rely on computationally demanding architectures and post-processing procedures, making gigapixel WSI analysis time consuming. In this work, CellPriorNet (CP Net) is proposed, an efficient nuclei detection and classification pipeline that utilizes a lightweight convolutional neural network architecture and hematoxylin (H) channel as prior information to enhance nuclei-aware feature learning. Extensive benchmarking was conducted against state of the art pipelines on 8 public and private datasets (total:10.4M nuclei) obtained from different organs, scanners, magnifications, and clinical centers. Experimental results demonstrate that CP Net achieves comparable performance while significantly reducing inference time. Furthermore, CellQuant Net was introduced, an end to end nuclei quantification pipeline, that integrates a quality assessment (QA) model to exclude regions with artifacts, followed by CP-Net cell detection and classification. The pipeline is publicly available on GitHub, and provides a potentially efficient and scalable framework for downstream computational pathology applications.

2026-07-01T11:31:09Z Submitted to Intelligence-Based Medicine Journal Falah Jabar Pasquale Lombardi Aria Torkpour Masoud Tafavvoghi Per Niklas Benzler Waaler Sigve Andersen Erna-Elise Paulsen Mette Pøhl Lill-Tove Rasmussen Busund Tom Donnem Elin Richardsen David J. Pinato Mehrdad Rakaee http://arxiv.org/abs/2607.00712v1 Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption 2026-07-01T09:59:28Z

Autoregressive (AR) streaming models have emerged as a powerful paradigm for long video generation. However, the linearly growing Key-Value (KV) cache poses a significant bottleneck, leading to memory overload and degraded inference throughput. A common compression method is to drop redundant KV tokens, which often breaks long-range dependencies, resulting in temporal flickering and identity loss. In this paper, we propose Instance-Specific Parametric Absorption (ISPA), a novel framework that shifts the KV cache compression from discarding to distilling. The core idea is to transit a subset of layers from Full-Attention (F-Layers) to memory-efficient Local-Attention (L-Layers) by "absorbing" historical context into the model's weights. Specifically, during a brief warmup phase, ISPA monitors the output discrepancy between global and local attention. At the transition point, we solve a closed-form least-squares problem to compute an instance-specific weight modulation that compensates for the missing history. Experiments across architectures (1.3B to 14B) demonstrate that ISPA can remove up to 50\% of the KV cache with near-lossless visual quality. We hope this perspective encourages future work to explore parametric memory consolidation beyond external token-level cache management for streaming generative models.

2026-07-01T09:59:28Z ECCV 2026 Camera Ready Xiaomeng Fu Jia Li Yiming Hu Yong Wang Hayden Kwok-Hay So Jiao Dai Xiangxiang Chu Jizhong Han http://arxiv.org/abs/2607.00576v1 Safe Alone, Unsafe Together: Safeguarding Against Implicit Toxicity When Benign Images Combine 2026-07-01T07:59:45Z

Multi-image content has become an increasingly prevalent form of visual communication in social media, giving rise to a new safety issue, multi-image implicit toxicity (MIIT), where each image appears benign in isolation, but harmful semantics emerge when the images are interpreted jointly. MIIT is particularly challenging for existing commercial moderation APIs and models due to the lack of explicit risky cues in each image. This paper aims to study how to identify MIIT. We first provide a formal definition of MIIT and analyze three key challenges for its detection. To alleviate the scarcity of data in this area, we construct MIIT-dataset, an image-only multi-image safety dataset covering seven representative risk categories through an automatic generation pipeline. Finally, we train MiShield with progressively distilled reasoning supervision, enabling it to produce safety judgments accompanied by explicit analyses of the correlated entities that result in the hazards. Experiments show that MiShield-8B models outperform representative moderation services and even larger-scale models, revealing its effectiveness and practical value for this widely used visual format. Warning: This paper contains potentially sensitive content.

2026-07-01T07:59:45Z 15 pages, 8 figures Jiaxian Lv Shiyao Cui Yingkang Wang Guoxin Wu Qingling Zhang Minlie Huang http://arxiv.org/abs/2604.01654v2 Moiré Video Authentication: A Physical Signature Against AI Video Generation 2026-07-01T06:41:33Z

Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.

2026-04-02T05:52:43Z Accepted to ECCV 2026. Project page and code: https://yuanqing-ai.github.io/physical_video_signature/ Yuan Qing Kunyu Zheng Lingxiao Li Boqing Gong Chang Xiao http://arxiv.org/abs/2606.01825v2 ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search 2026-07-01T05:50:06Z

Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.

2026-06-01T07:41:44Z 12 pages, 5 figures Chaodong Jia Zequn Xie Xibei Jia Sihang Cai Shulei Wang Tao Jin http://arxiv.org/abs/2607.00374v1 Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval 2026-07-01T03:20:06Z

Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks trained on image-text pairs. However, existing proxy tasks primarily enhance visual and textual representations to accommodate a predefined composition mechanism such as pseudo-word injection into a frozen text encoder or linear feature arithmetic. As a result, the composition function itself remains unlearned, limiting the model's ability to express diverse and fine-grained semantic modifications. To address this, we propose FoCo, which models composition as two coordinated stages: focusing on modification-relevant visual content, and then completing the target semantics. We realize these through two proxy tasks: text-anchored visual aggregation to selectively gather visual content guided by localized textual semantics, and context-conditioned semantic completion to transform these aggregated visuals with the remaining scene context into a coherent composed representation. The tasks are trained jointly with a cross-instance contrastive objective, encouraging semantic diversity and discouraging shortcut composition strategies. Extensive experiments on four ZS-CIR benchmarks show FoCo's state-of-the-art performance and improved generalization.

2026-07-01T03:20:06Z Accepted by ECCV 2026 Jingjing Zhang Lei Zhang Zheren Fu Zhendong Mao http://arxiv.org/abs/2606.31225v2 A First Exploration of Neuromorphic OT-CFM for Multi-Speaker VSR 2026-07-01T02:10:40Z

Visual Speech Recognition (VSR) tasks in complex multi-speaker scenarios are severely hindered by rapid head motions, occlusions, and subtle lip articulations. Traditional RGB-based methods struggle here due to low rates and motion blur of frames. To overcome these, we propose LipsFlow, a neuromorphic-inspired VSR framework that converts RGB videos into high-temporal-resolution event streams. For multi-speaker, we employ ByteTrack tracking and TalkNet active speaker detection to temporally segment scenes into single-speaker clips, enabling focused per-speaker analysis. By explicitly capturing microsecond-level articulatory dynamics via learnable event-based representations, LipsFlow achieves inherent robustness against visual degradation. To efficiently model these dense event-based features and adapt to speaker-specific articulatory patterns, we introduce Optimal Transport Conditional Flow Matching (OT-CFM). It enforces deterministic, straight-line trajectory generation in a semantic latent space, slashing inference latency to just two Ordinary Differential Equation (ODE) steps. Furthermore, we design a Dual-Level Semantic Supervision mechanism combining token-level BERT weight tying and sentence-level priors to resolve homophene ambiguities. Validated on competitive benchmarks, LipsFlow achieves a state-of-the-art WER of 22.3\% at 240 ms latency, establishing a highly robust and efficient paradigm for event-based VSR.

2026-06-30T07:05:48Z Accepted to ECCV 2026 Lin Chen Jingping Fang Hairui Liu Chenyang Xu Junhao Chen Xiaorui Li Weidong Cai Xiaoming Chen