Local Multimodal Music Alignment from Global Supervision

2026-07-10T22:59:53Z

Understanding music requires understanding localized relationships across data modalities, e.g., how time in performance audio maps onto position in a score image. Yet supervision for such local correspondences is difficult to obtain-in practice, we often only have access to coarser global supervision like paired segments of audio and images. To address this gap, we propose FuSiLi (Fused Sinkhorn-Localized Similarity), a similarity score for multimodal contrastive learning operating directly on local image patch and audio frame features via Sinkhorn-based soft alignment. We show that FuSiLi (i) effectively learns local relationships, (ii) requires only global supervision, and (iii) retains the global alignment capabilities of conventional contrastive approaches. We fine-tune pretrained CLIP and CLAP encoders on pairs of raw sheet music images and audio using a hybrid contrastive objective combining FuSiLi with conventional global similarity. We evaluate on cross-modal retrieval and frame-level alignment tasks against a range of global and local baselines, showing that our approach outperforms them on local alignment while remaining competitive on retrieval.

Scalable Visual Pretraining for Language Intelligence

2026-07-10T17:57:03Z

The rapid progress of large foundation models has been driven predominantly by pretraining on large-scale text corpora. However, many forms of knowledge are conveyed through visual representations, where figures, typeset equations, and page layouts carry rich information that cannot be faithfully or completely captured by text alone. Yet current pretraining approaches discard these visual cues by converting visually rich sources, such as documents and web pages, into plain text for learning language intelligence. This paper challenges the default assumption that language models must be trained on text-only representations and shows that Visual Pretraining is a scalable learner for foundation model intelligence. To this end, we conduct a systematic study of unsupervised visual pretraining paradigms that directly leverage visual documents without text extraction. Across multiple backbones and benchmarks, visual pretraining on the same underlying corpora consistently outperforms text-only pretraining, offering an efficient pathway to scalable language intelligence.

Event Stream based Multi-Modal Video Anomaly Detection: A Benchmark Dataset and Algorithms

2026-07-10T05:58:52Z

Video anomaly detection (VAD) is critical for automated surveillance but remains fragile under challenging conditions such as illumination variations, fast motion, and complex backgrounds when relying solely on visible light videos. To address these limitations, we propose EVAD, an event enhanced VAD framework that jointly exploits conventional video and event streams captured by bio inspired event cameras. Event sensors asynchronously capture brightness changes with high temporal resolution, offering robustness to motion blur and extreme lighting, and providing motion salient cues complementary to video based visual information. To support multi modal VAD research, we construct a large scale visible event benchmark comprising 6.3 billion events and 376,368 video frames collected under diverse illumination levels, motion patterns, and background complexities, filling the gap of realistic and scalable datasets for event based anomaly detection. Building upon this dataset, we design a contrastive multi modal pretraining framework to learn discriminative event representations by aligning semantic embeddings across event streams, visible videos, and textual descriptions. An adaptive fusion module then dynamically integrates event based temporal cues with video based spatial semantics, improving robustness to environmental disturbances. Experiments on benchmarks and the proposed TJUTCM Pha dataset demonstrate that E VAD consistently outperforms methods, validating the effectiveness of event-based sensing for VAD in real world scenarios.

Beyond Metadata: CAPRA for Hidden Subgroup Analysis under Missing Metadata in Medical Imaging

2026-07-10T05:33:57Z

Medical imaging models are often deployed without the demographic, acquisition, and quality metadata needed for subgroup auditing. Once those metadata disappear, clinically critical failure modes can be masked by strong aggregate performance, and many robust-learning methods lose the group structure they rely on. We present CAPRA, a calibrated proxy-axis framework for hidden subgroup analysis under missing metadata. CAPRA predicts image-derived semantic axes, calibrates axis posteriors on a small metadata-labeled split via patient-level cross-fitting, and organizes those posteriors into a calibrated subgroup interface that supports both deployment-time failure analysis and downstream robust learning without requiring subgroup labels at deployment. Across fundus, dermoscopy, and chest radiography, CAPRA reveals disparity patterns missed by metadata-only slicing, remains informative under dataset shift, and produces subgroup partitions that align more closely with explicit failure axes than image-only or latent-slice baselines. The same interface can also be reused by downstream robust learners, although those gains are domain-dependent. Overall, CAPRA turns hidden subgroup analysis under missing metadata into a calibrated, interpretable, and reusable subgroup interface for deployment-time analysis and robust transfer.

Event-Based Token Sequences for Audio-Conditioned Music-Game Level Modeling

2026-07-10T04:42:20Z

Procedural generation of music game levels is an exciting yet challenging problem, as levels must translate musical structure into interactive sequences of timed gameplay events. Most existing approaches formulate this task by frame-based representations, dividing audio into uniform time grids and predicting events at each frame. This makes gameplay events implicit across many frames. As a result, it is hard to describe event-level timing relations and longer-range structure found in human-authored levels. We use procedural generation as a practical setting to study how musical cues map to interactive event sequences. Inspired by event-based symbolic music modeling, we propose a token-level sequence formulation that casts level generation as a multimodal sequence-to-sequence problem. Conditioned on an audio excerpt and level metadata, the model generates a token sequence alternating gameplay-event and beat-shift tokens. This explicitly represents actions and their relative timing in beat space. Based on this formulation, we build a Transformer model. It outperforms representative frame-level baselines under event-level evaluation. It also enables systematic analysis of how audio supports rhythm-aligned event prediction beyond metadata conditioning.

Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks

2026-07-08T20:42:46Z

With the growing adoption of VLMs, DMs, LLMs, and AFMs, these multimodal foundation models can inadvertently encode sensitive, copyrighted, biased, or unsafe cross-modal associations that originate from their training data. Retraining after deletion requests or policy updates is often impractical, and targeted forgetting remains difficult because knowledge is distributed across shared representations. Multimodal unlearning addresses this challenge by enabling selective removal across modalities while retaining overall utility. This survey offers a unified, system-oriented view of multimodal unlearning across vision, language, audio, and video, grounded in recent advances, emerging applications, and open problems. Our taxonomy enables systematic comparison across model architectures and modalities, clarifying trade-offs among deletion strength, retention, efficiency, reversibility, and robustness. This survey highlights open problems and practical considerations to support future research and deployment of multimodal unlearning. We release a curated repository: https://smsnobin77.github.io/Awesome-Multimodal-Unlearning/

Towards Robust Semantic Video Transmission over Block Erasure Channels

2026-07-08T18:05:46Z

This paper investigates semantic-aware neural joint source-channel coding (JSCC) for robust video transmission over block erasure channels. We propose a neural video compression framework exploring both spatial-domain and feature-domain designs. In the spatial domain, video frames are partitioned into blocks, enabling localized erasure handling and fine-grained robustness control via uniform erasure and two-level, semantic-guided non-uniform erasure strategies. In the feature domain, latent features are partitioned, enabling missing features to be semantically recovered while maintaining overall spatial consistency. Comprehensive experiments quantify reconstruction quality under varying uniform and non-uniform erasure probabilities. Our results show that spatial-domain JSCC excels at handling random localized losses, whereas feature-domain JSCC provides superior robustness to distributed erasures and maintains fidelity under low-loss scenarios. The analysis highlights the trade-offs between spatial continuity and semantic redundancy, offering insights for designing robust, task-aware video communication systems.

-8 dB SNR + 90% Packet Loss: MamVSC -- CSI-Guided Semantic Mamba for Extreme-Robust Video Semantic Communication

2026-07-08T11:33:38Z

Semantic communication, leveraging joint source-channel coding, is designed to mitigate semantic distortion introduced by the channel. However, most current studies focus solely on semantic deviation distortion caused by physical wireless channels, while overlooking semantic erasure distortion due to packet loss. A CSI-Guided Mamba-based video semantic wireless digital communication system (MamVSC) employing semantic grouping is proposed to simultaneously address both semantic deviation and erasure distortions. In this system, a semantic Mamba module, guided by channel state information (CSI) feedback, is utilized to dynamically adjust the granularity of extracted semantic information, adapting to channel conditions. Furthermore, a Semantic Channel Codec based on dynamic Semantic clustering centers is introduced, where the distance between semantic vectors within the same semantic class and their corresponding Semantic clustering center is dynamically adjusted according to channel conditions, enhancing robustness against channel noise. Additionally, a adaptive packet loss recovery module, dynamically adaptive to the CSI, is proposed. The system achieves an MS-SSIM greater than 0.6 and a PSNR exceeding 21 dB at an SNR of -8 dB and a packet loss rate of 90% in AWGN channel.

DYNA-PRUNER: Input-Adaptive Data-Model Co-Pruning for Efficient and Scalable Spatio-Temporal Media Prediction

2026-07-08T10:07:41Z

Spatio-temporal prediction supports radar/satellite nowcasting and city-scale traffic monitoring, but modern models are often too expensive for real-time deployment. This stems from a mismatch between dense computation and strong input-dependent redundancy (e.g., calm seas or clear skies). To enable automated, resource-aware architecture optimization in scalable media analysis, we propose Dyna-Pruner, an end-to-end framework for input-dependent co-pruning of data and model structure. A shared-importance synchronization mechanism generates coupled masks that prune redundant regions and their corresponding computational units (e.g., convolutional filters), yielding per-sample sparse sub-networks at inference time. Experiments on WeatherBench, SEVIR, and TaxiBJ show seamless integration with CNN, RNN, and Transformer backbones, reducing FLOPs by up to $70\%$ and achieving a $2.5\times$ speedup on NVIDIA Jetson AGX Orin with negligible accuracy loss ($<1\%$).

DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

2026-07-08T06:29:12Z

Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens as a soft temporally co-registered segmentation prior. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains competitive or superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.

Unveiling the Visual Counting Bottleneck in Vision-Language Models

2026-07-08T01:28:50Z

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

2026-07-07T14:14:24Z

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

Vertigo Vertigo: Reconstructing a Cinematic Ideal through its Predictive AI Double

2026-07-07T12:56:43Z

Vertigo Vertigo is a scene-for-scene AI reconstruction of Hitchcock's Vertigo (1958), generated from only 2.78% of the original film's frames. Using this sparse set of keyframe anchors, we perform first-last frame interpolation via a large video diffusion model to predict the intervening sequences. Vertigo is itself a film about the obsessive reconstruction of an artificial ideal; Vertigo Vertigo extends this logic to the material of the film, treating the canonical text as a probe for the normative conventions of classical cinema encoded within generative systems. Evaluated through computational analysis and critical feedback from media theorists (Lev Manovich, Shane Denson, Kevin L. Ferguson), the artifact demonstrates remarkable structural fidelity: 73.1% of frames are recognizable as plausible renditions of Vertigo and only 3.6% fail catastrophically. This fidelity suggests that cinematic norms are deeply compressed within the model's latent priors. Aesthetically, the reconstruction is rendered as an unstable overlay between the original film and its predictive shadow, fueling a persistent doubt in the viewer's perception of authenticity -- a 21st-century vertigo. The work argues that generative media is not a paradigm shift from cinema but an acceleration of its logic of desire and false authenticity, extending from classical Hollywood through to the predictive media environments now reshaping contemporary perception.

Tuning-Free Latent Diffusion Models for Ultrahigh-Resolution Image Editing

2026-07-07T10:52:10Z

Recent diffusion-based generative models have shown impressive performance in image generation and editing. However, due to memory limitations and the high cost of collecting high-resolution training images, existing methods are typically restricted to inputs with linear resolutions below 1K. In contrast, photos captured by modern mobile devices often reach linear resolutions up to 8K, revealing a significant gap between current capabilities and real-world demands. Simply upscaling low-resolution edited results often results in visually enlarged but blurry images that lack fine details. This paper introduces UltraDiffEdit, a novel, tuning-free image editing framework that extends off-the-shelf latent diffusion models (LDMs) to ultrahigh resolutions. UltraDiffEdit employs a multi-scale progressive editing strategy, iteratively blending high-resolution edited content with unedited areas in a coarse-to-fine manner. We employ multi-patch encoding to preserve both edited and unedited visual details within the latent space. To mitigate editing artifacts, our global-local consistency denoising technique consistently integrates edited and unedited latent features, ensuring smooth transition at editing boundaries from the latent representation to the final image. We also introduce a patch-based hybrid sampling approach that captures local, intermediate, and global features, ensuring semantic coherence and enhancing fine detail during denoising. We conduct extensive experiments demonstrating UltraDiffEdit's superior editing quality and flexibility: it can handle image resolutions up to 8K using only a single NVIDIA GeForce RTX 3090 GPU. The source code is publicly available at https://github.com/LonglongaaaGo/UltraDiffEdit.

WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation

2026-07-07T10:27:31Z

As web agents increasingly demonstrate capabilities in automated task execution, the development of robust evaluation frameworks for assessing their navigation and task completion performance has emerged as a critical research priority. However, existing benchmarks exhibit fundamental limitations. First, they suffer from insufficient scale and limited domain diversity, constraining comprehensive evaluation of cross-domain generalization. Second, prevailing LLM-as-Judge evaluation methodologies inadequately capture fine-grained interaction semantics, particularly regarding precise query formulation and filtering operations. Third, current benchmarks predominantly emphasize navigation success metrics while neglecting critical requirements for real-world deployment scenarios. To address these limitations, we introduce WebRetriever, a large-scale benchmark encompassing 800 websites and 1,550 tasks across diverse domains, including consumer, professional, and enterprise sectors, with comprehensive coverage of user intent patterns. We propose NavEval (Navigation Evaluation), a novel LLM-as-Judge framework that leverages rich interaction context beyond visual screenshots, achieving state-of-the-art alignment with human judgment across multiple evaluation datasets. Furthermore, we establish three complementary evaluation protocols that collectively provide holistic assessment of web agent capabilities: navigation proficiency, knowledge-assisted interaction, and end-to-end task completion with information extraction. Extensive experimental analysis reveals substantial performance disparities across evaluation protocols, demonstrating that navigation success alone is an insufficient predictor of real-world application effectiveness. WebRetriever delivers fine-grained diagnostic insights into agent capabilities and establishes a rigorous foundation for advancing web agent research and development.