https://arxiv.org/api/9KRe7fpY1KEdOEwNc38lmb4Nnn8 2026-03-24T11:41:37Z 9246 60 15 http://arxiv.org/abs/2603.12146v1 FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance 2026-03-12T16:45:53Z Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency. 2026-03-12T16:45:53Z Accepted by CVPR2026 Quanhao Li Zhen Xing Rui Wang Haidong Cao Qi Dai Daoguo Dong Zuxuan Wu http://arxiv.org/abs/2603.11947v1 Resurfacing Paralinguistic Awareness in Large Audio Language Models 2026-03-12T13:56:42Z Large Audio Language Models (LALMs) have expanded the interaction with human to speech modality, which introduces great interactive potential, due to the paralinguistic cues implicitly indicating the user context. However, building on the current content-centred paradigm, LALMs usually neglect such paralinguistic cues and respond solely based on query content. In this work, to resurface the paralinguistic awareness in LALMs, we introduce five diverse layer-wise analyses to jointly identify paralinguistic layers and semantic understanding layers. Based on these insights, we propose a paralinguistic-enhanced fine-tuning (PE-FT) protocol accordingly to equip LALMs with paralinguistic-aware capabilities, including (1) selective-layer fine-tuning, and (2) an auxiliary dual-level classification head. Our experiments demonstrate that PE-FT protocol efficiently and effectively resurfaces the paralinguistic awareness, even surpassing the performance of the all-layer fine-tuning strategy. 2026-03-12T13:56:42Z Submitted to Interspeech 2026 Hao Yang Minghan Wang Tongtong Wu Lizhen Qu Ehsan Shareghi Gholamreza Haffari http://arxiv.org/abs/2603.11876v1 On the Possible Detectability of Image-in-Image Steganography 2026-03-12T12:51:56Z This paper investigates the detectability of popular imagein-image steganography schemes [1, 2, 3, 4, 5]. In this paradigm, the payload is usually an image of the same size as the Cover image, leading to very high embedding rates. We first show that the embedding yields a mixing process that is easily identifiable by independent component analysis. We then propose a simple, interpretable steganalysis method based on the first four moments of the independent components estimated from the wavelet decomposition of the images, which are used to distinguish between the distributions of Cover and Stego components. Experimental results demonstrate the efficiency of the proposed method, with eight-dimensional input vectors attaining up to 84.6% accuracy. This vulnerability analysis is supported by two other facts: the use of keyless extraction networks and the high detectability w.r.t. classical steganalysis methods, such as the SRM combined with support vector machines, which attains over 99% accuracy. 2026-03-12T12:51:56Z Antoine Mallet CRIStAL Patrick Bas CRIStAL http://arxiv.org/abs/2508.10745v2 Agentic Design Review System 2026-03-12T11:37:40Z Evaluating graphic designs involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Evaluating designs in a holistic way involves aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (AgenticDRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. Towards evaluating this framework, we propose DRS-BENCH benchmark. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed-up with critical ablation experiments brings out the efficacy of Agentic-DRS in evaluating graphic designs and generating actionable feedback. We hope that this work will attract attention to this pragmatic, yet under-explored research direction. 2025-08-14T15:29:24Z Project Page: https://sayannag.github.io/AgenticDRS Sayan Nag K J Joseph Koustava Goswami Vlad I Morariu Balaji Vasan Srinivasan http://arxiv.org/abs/2501.15177v2 Audio-Language Models for Audio-Centric Tasks: A Systematic Survey 2026-03-12T05:37:23Z Audio-Language Models (ALMs), trained on paired audio-text data, are designed to process, understand, and reason about audio-centric multimodal content. Unlike traditional supervised approaches that use predefined labels, ALMs leverage natural language supervision to better handle complex real-world audio scenes with multiple overlapping events. While demonstrating impressive zero-shot and task generalization capabilities, there is still a notable lack of systematic surveys that comprehensively organize and analyze developments. In this paper, we present the first systematic review of ALMs with three main contributions: (1) comprehensive coverage of ALM works across speech, music, and sound from a general audio perspective; (2) a unified taxonomy of ALM foundations, including model architectures and training objectives; (3) establishment of a research landscape capturing mutual promotion and constraints among different research aspects, aiding in summarizing evaluations, limitations, concerns and promising directions. Our review contributes to helping researchers understand the development of existing technologies and future trends, while also providing valuable references for implementation in practical applications. 2025-01-25T11:15:06Z Under review Yi Su Jisheng Bai Qisheng Xu Kele Xu Yong Dou http://arxiv.org/abs/2603.11468v1 Stage-Adaptive Reliability Modeling for Continuous Valence-Arousal Estimation 2026-03-12T02:45:41Z Continuous valence-arousal estimation in real-world environments is challenging due to inconsistent modality reliability and interaction-dependent variability in audio-visual signals. Existing approaches primarily focus on modeling temporal dynamics, often overlooking the fact that modality reliability can vary substantially across interaction stages. To address this issue, we propose SAGE, a Stage-Adaptive reliability modeling framework that explicitly estimates and calibrates modality-wise confidence during multimodal integration. SAGE introduces a reliability-aware fusion mechanism that dynamically rebalances audio and visual representations according to their stage-dependent informativeness, preventing unreliable signals from dominating the prediction process. By separating reliability estimation from feature representation, the proposed framework enables more stable emotion estimation under cross-modal noise, occlusion, and varying interaction conditions. Extensive experiments on the Aff-Wild2 benchmark demonstrate that SAGE consistently improves concordance correlation coefficient scores compared with existing multimodal fusion approaches, highlighting the effectiveness of reliability-driven modeling for continuous affect prediction. 2026-03-12T02:45:41Z 8 pages, 3 figures, 2 pages Yubeen Lee Sangeun Lee Junyeop Cha Eunil Park http://arxiv.org/abs/2603.11042v1 V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation 2026-03-11T17:59:40Z Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/ 2026-03-11T17:59:40Z Project page: https://genjib.github.io/v2m_zero/ Yan-Bo Lin Jonah Casebeer Long Mai Aniruddha Mahapatra Gedas Bertasius Nicholas J. Bryan http://arxiv.org/abs/2603.11031v1 Chasing RATs: Tracing Reading for and as Creative Activity 2026-03-11T17:54:26Z Creativity research has privileged making over the interpretive labor that precedes and shapes it. We introduce Reading Activity Traces (RATs), a proposal that treats reading -- broadly defined to include navigating, interpreting, and curating media across interconnected sources -- as creative activity both for future artifacts and as a form of creation in its own right. By tracing trajectories of traversal, association, and reflection as inspectable artifacts, RATs render visible the creative work that algorithmic feeds and AI summarization increasingly compress and automate away. We illustrate this through WikiRAT, a speculative instantiation on Wikipedia, and open new ground for reflective practice, reader modeling, collective sensemaking, and understanding what is lost when human interpretation is automated -- towards designing intelligent tools that preserve it. 2026-03-11T17:54:26Z Sophia Liu Shm Garanganao Almeda http://arxiv.org/abs/2603.11147v1 Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints 2026-03-11T17:30:30Z Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains. 2026-03-11T17:30:30Z Minsak Nanang Adrian Hilton Armin Mustafa http://arxiv.org/abs/2601.13879v4 Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring 2026-03-11T15:28:13Z While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30\% on the DocVQA. 2026-01-20T11:45:38Z Dongxu Zhang Yiding Sun Cheng Tan Wenbiao Yan Ning Yang Jihua Zhu Haijun Zhang http://arxiv.org/abs/2512.21944v3 Data relativistic uncertainty framework for low-illumination anime scenery image enhancement 2026-03-11T11:14:16Z By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available. 2025-12-26T09:43:24Z Add data Yiquan Gao John See http://arxiv.org/abs/2603.10551v1 P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video 2026-03-11T08:59:53Z Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: https://longanwang-cs.github.io/PGSVC-webpage/ 2026-03-11T08:59:53Z MMSys 2026; Project Website: see https://longanwang-cs.github.io/PGSVC-webpage/ Longan Wang Yuang Shi Wei Tsang Ooi http://arxiv.org/abs/2603.11095v1 Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition 2026-03-11T07:46:25Z Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion. 2026-03-11T07:46:25Z 5 pages, 3 figures, accepted to ICASSP 2026 Inyong Koo yeeun Seong Minseok Son Jaehyuk Jang Changick Kim http://arxiv.org/abs/2603.10468v1 G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition 2026-03-11T06:40:01Z We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives. 2026-03-11T06:40:01Z submitted to Interspeech 2026 Jing Peng Ziyi Chen Haoyu Li Yucheng Wang Duo Ma Mengtian Li Yunfan Du Dezhu Xu Kai Yu Shuai Wang http://arxiv.org/abs/2603.11089v1 V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation 2026-03-11T05:56:14Z This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore-a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models. 2026-03-11T05:56:14Z Accepted at ICASSP2026 Nolan Chan Timmy Gang Yongqian Wang Yuzhe Liang Dingdong Wang