https://arxiv.org/api/lBKnlgLfsRT/wyVLcVtJxcGuTkE 2026-07-17T22:31:20Z 9772 45 15 http://arxiv.org/abs/2511.19474v6 Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks 2026-07-07T10:11:01Z

Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

2025-11-22T07:37:21Z Accepted by ECCV 2026 Jie Li Hongyi Cai Mingkang Dong Muxin Pu Shan You Fei Wang Tao Huang http://arxiv.org/abs/2607.05971v1 Multimodal Video-to-Music Recommendation via Semantic Retrieval and Temporal Reranking 2026-07-07T08:04:56Z

We present VTMR, a two-stage framework for Video-To-Music Recommendation. In Stage~1, VTMR aligns comprehensive video and music signals in a joint audio-visual-text representation space and efficiently retrieves semantically compatible candidates using coarse global embeddings. In Stage~2, it reranks the retrieved candidates by attending to the temporal sequences of both video and music, thereby capturing fine-grained temporal correspondence. Evaluated on the video-to-music recommendation task, the multimodal retrieval stage improves R@10 from 14.2 to 15.9 and Median Rank from 75 to 58 over the strongest baseline; the temporal reranker further boosts R@10 to 18.3 and Median Rank to 46, demonstrating complementary gains from richer query encoding and temporal alignment. A human preference study confirms that VTMR is on par with a commercial baseline in overall preference, while outperforming a generative baseline in music quality.

2026-07-07T08:04:56Z Accepted for publication at The Machine Learning for Audio workshop at ICML 2026 Seungheon Doh Minhee Lee Sangmoon Lee Ben Sangbae Chon Juhan Nam http://arxiv.org/abs/2312.15320v3 GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Text 2026-07-06T13:55:01Z

Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests, and genetic tests over a prolonged period of time, a process commonly described as the diagnostic odyssey. Addressing this odyssey has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features that artificial intelligence algorithms can use to facilitate clinical diagnosis, to prioritize candidate diseases for further laboratory or genetic testing, and to support the phenotype-driven reinterpretation of genome or exome sequencing data. Existing methods that use frontal facial photographs were built on conventional convolutional neural networks, rely exclusively on facial images, and cannot capture non-facial phenotypic traits or demographic information that are essential for accurate diagnosis. Here we introduce GestaltMML, a multimodal machine learning approach based solely on the Transformer architecture. It integrates facial images, demographic information (age, sex, ethnicity), and clinical notes (optionally a list of Human Phenotype Ontology terms) to improve prediction accuracy. We evaluate GestaltMML on 528 diseases from the GestaltMatcher Database and on several in-house and published cohorts, including Beckwith-Wiedemann syndrome, Sotos syndrome, NAA10-related neurodevelopmental syndrome, Cornelia de Lange syndrome, and KBG syndrome. GestaltMML improves on the state-of-the-art image-only ensembled model, narrows the diagnostic accuracy gap for patients from under-represented ancestries, and clarifies when multimodal fusion is beneficial and when image-only inference is preferable. The results suggest that GestaltMML can greatly narrow the candidate diagnoses of rare diseases and may facilitate the reinterpretation of sequencing data.

2023-12-23T18:40:25Z Preprint updated Da Wu Zhanliang Wang Hongzhuo Chen Jingye Yang Cong Liu Tzung-Chien Hsieh Elaine Marchi Justin Blair Peter Krawitz Chunhua Weng Wendy Chung Gholson J. Lyon Ian D. Krantz Jennifer M. Kalish Kai Wang http://arxiv.org/abs/2607.04851v1 SleepBand: Single-Source Domain Generalization for Sleep Staging via Physiologically Structured Spectral Modeling 2026-07-06T09:21:54Z

Generalizing sleep staging models to unseen datasets is challenging, and typical domain generalization (DG) methods often rely on multiple source domains or domain labels that are rarely available in practice. We tackle the stricter and more practical setting of single-source domain generalization: training on a single labeled source dataset, without domain labels or access to target data. We present SleepBand, a physiology-guided framework that embeds oscillatory priors via a learnable Morlet filter bank and a structured integration-and-recalibration pipeline. This anchors representations to domain-invariant sleep rhythms (e.g., slow waves, spindles), reducing reliance on dataset-specific artefacts. On five public datasets, SleepBand achieves state-of-the-art SDG performance and remains competitive under leave-one-domain-out (multi-source) DG. Analyses show that the learned filters align with canonical neurophysiology and that robustness stems from focusing on narrowband, physiologically meaningful cues. Our results suggest that principled, physiology-aware inductive biases are a promising path for robust single-domain sleep staging. Code is available at https://github.com/lzcn/sleep-band

2026-07-06T09:21:54Z Zhi Lu Yang Hu Yan Chen http://arxiv.org/abs/2607.04839v1 Discovering shared interpretable operations in image compression autoencoders 2026-07-06T09:10:58Z

With the increasing adoption of deep learning for applications such as image compression, improvements in the rate-distortion trade-off have been achieved at the cost of increasingly larger and more opaque ''black-box'' models. Autoencoders are among the most widely used architectures for this task; however, without a clear understanding of their internal behavior, these models tend to grow in complexity to achieve more performance gains. In this paper, we investigate whether universal behaviors can be detected from the internal operations of bias-free autoencoders through Jacobian analysis. If such behaviors exist, they may be extracted to design low-complexity image compression models inspired by high-complexity deep learning architectures.

2026-07-06T09:10:58Z Caroline Mazini Rodrigues COMPACT Nicolas Keriven CNRS, IRISA, COMPACT Thomas Maugey Sirocco, Inria-EPFL, COMPACT http://arxiv.org/abs/2607.04606v1 CompressedVQA-AEV: Full-Reference and No-Reference Quality Assessment Models for Asymmetric Encoded Videos 2026-07-06T02:23:39Z

This report presents our solutions to the QoMEX 2026 Grand Challenge on Video Quality Assessment for Asymmetric Encoded Videos, comprising a full-reference (FR) model, CompressedVQA-AEV-FR, and a no-reference (NR) model, CompressedVQA-AEV-NR. The FR approach leverages a Swin-B backbone to extract multi-stage similarity statistics between reference and distorted videos for quality prediction. For the NR setting, our model employs complementary frame-level encoders based on SigLIP2 and Swin-B, followed by temporal mean pooling and cross-fold ensembling to estimate perceptual quality without reference data. Our CompressedVQA-AEV-FR achieves first place in the FR track of QoMEX 2026 Grand Challenge, while CompressedVQA-AEV-NR secures fourth place in the NR track, demonstrating the effectiveness of our proposed models. The code is available at https://github.com/sunwei925/CompressedVQA-AEV.

2026-07-06T02:23:39Z CompressedVQA-AEV-FR achieves first place in the FR track of QoMEX 2026 Grand Challenge Wei Sun Xingwei Liu Dandan Zhu Xiangyang Zhu Weixia Zhang Guangtao Zhai http://arxiv.org/abs/2607.04553v1 Lights, Camera, Carbon: Architectural Scaling Laws for Video Generation Energy Consumption 2026-07-05T23:58:29Z

We present a bidirectional framework for estimating the energy consumption of text-to-video (T2V) and text-to-video-audio (T2VA) models from architectural first principles and observable generation parameters such as resolution and duration, requiring no access to weights, model size, or implementation details. Forward, it predicts energy from generation parameters and architectural principles; backward, it recovers architectural scaling behavior from observed inference times, with accuracy serving as a criterion for architectural validity. Building on the established compute-bound nature of video diffusion models, we demonstrate that each model's energy profile obeys theoretically derived scaling laws, decomposing into quadratic and linear terms whose coefficients directly reflect the underlying architectural complexity. Validated across six open-source models spanning 8.3B-27B parameters and three GPU configurations, this decomposition achieves below 3% MAPE across all architectures. This approach offers a standardized, empirically and theoretically grounded framework for sustainability benchmarking across T2V models and architectures.

2026-07-05T23:58:29Z 17 pages Nidhal Jegham Boris Gamazaychikov Sasha Luccioni http://arxiv.org/abs/2607.04438v1 ResearchStudio-Reel: Automate the Last Mile of Research from Paper to Poster, Video, and Blog 2026-07-05T17:59:33Z

Research dissemination, turning a paper into a poster, a talk video, and a blog post, is still a manual last mile. Prior automation treats each artifact in isolation that each re-extract the paper from scratch, usually ship one-way renders the author cannot reopen in PowerPoint or Word, and gates quality on soft VLM-preference scores that plateau while load-bearing sections still read as empty. We argue this last mile is best built as a composition of skills: thin agent-readable contracts that share one upstream extractor and wrap deterministic primitives in a measured-fill loop whose exits are hard pass/fail render gates. We instantiate this as ResearchStudio-Reel, five Claude Code and Codex skills organized into one shared extractor (Paper2Assets), three editable generators (Paper2Poster, Paper2Video, Paper2Blog), and one interactive convergence layer (Paper2Reel). Paper2Assets extracts each paper once into a shared bundle that can be reused by every downstream skill; The three generators produce a print-ready poster, a synchronized talk video, and a bilingual blog that stay factually consistent and round-trip through PowerPoint or Word; Paper2Reel then binds all three into a self-contained HTML viewer whose section-level clicks jump the video, slides, captions, and blog to matching content. On the Paper2Poster benchmark, our posters lead every aesthetic and information sub-criterion against both prior automated systems and single-shot frontier LLMs, surpassing the authors' own on aesthetics under two held-out VLM judges and winning overall on 84% to 93% of papers; capability audits further show that, by uniquely pairing narration-aligned on-slide highlights with a bilingual blog gated by layout-aware DOCX repair, ResearchStudio-Reel is the only pipeline to ship all three editable artifacts. Project is available at https://aka.ms/ResearchStudio

2026-07-05T17:59:33Z Lingao Xiao Yalun Dai Yangyu Huang Qihao Zhao Wenshan Wu Hugo He Ruishuo Chen Jin Jiang Qianli Ma Jiahuan Zhang Xin Zhang Ying Xin Yang Ou Yan Xia Scarlett Li Longbo Huang Zhipeng Zhang Yang He Yap Kim Hui Yan Lu http://arxiv.org/abs/2607.04425v1 UI-MOPD: Multi-Platform On-Policy Distillation for Continual GUI Agent Learning 2026-07-05T17:37:12Z

Recent advances in multimodal foundation models and agent systems have driven GUI agents from single-platform task execution toward cross-platform interaction. However, building multi-platform GUI agents remains challenging. On one hand, high-quality and executable cross-platform interaction trajectories are still scarce, and existing data often suffer from limited platform coverage. On the other hand, different platforms exhibit distinct interaction conventions, making joint or continual training prone to behavioral pattern mixing, platform-specific capability degradation, and catastrophic forgetting. To address these challenges, we construct Uni-GUI, a high-quality cross-platform GUI interaction dataset, and propose UI-MOPD, the first method that incorporates multi-teacher on-policy distillation into continual learning for GUI agents. UI-MOPD dynamically selects a platform-specific teacher according to the current environment and transfers platform-specific behavioral priors to a shared policy through platform-conditioned distillation, enabling adaptation to new platforms while preserving capabilities on existing ones. Experiments on OSWorld and MobileWorld show that UI-MOPD achieves task success rates of 38.2% and 12.0%, respectively, demonstrating its effectiveness in balancing cross-platform capability retention and new-platform adaptation. Project page: https://elispectre.github.io/UI-MOPD/.

2026-07-05T17:37:12Z Technical report. 25 pages, 5 figures, 7 tables Niu Lian Alan Chen Zhehao Yu Chengzhen Duan Fazhan Liu Hui Liu Pei Fu Jian Luan Yaowei Wang Shu-Tao Xia Jinpeng Wang http://arxiv.org/abs/2502.12096v5 Token Communications: A Large Model-Driven Framework for Cross-modal Context-aware Semantic Communications 2026-07-03T18:33:04Z

In this paper, we introduce token communications (TokCom), a large model-driven framework to leverage cross-modal context information in generative semantic communications (GenSC). TokCom is a new paradigm, motivated by the recent success of generative foundation models and multimodal large language models (GFM/MLLMs), where the communication units are tokens, enabling efficient transformer-based token processing at the transmitter and receiver. In this paper, we introduce the potential opportunities and challenges of leveraging context in GenSC, explore how to integrate GFM/MLLMs-based token processing into semantic communication systems to leverage cross-modal context effectively at affordable complexity, present the key principles for efficient TokCom at various layers in future wireless networks. In a typical image semantic communication setup, we demonstrate a significant improvement of the bandwidth efficiency, achieved by TokCom by leveraging the context information among tokens. Finally, the potential research directions are identified to facilitate adoption of TokCom in future wireless networks.

2025-02-17T18:14:18Z Accepted at IEEE Wireless Communications Magazine Li Qiao Mahdi Boloursaz Mashhadi Zhen Gao Rahim Tafazolli Mehdi Bennis Dusit Niyato http://arxiv.org/abs/2607.03494v1 Towards Standardized Light Field Quality Assessment: Hybrid Subjective Benchmarking and Objective Metric Evaluation 2026-07-03T17:03:45Z

Benchmarking immersive media coding solutions, especially in the standardization context, requires reliable and reproducible subjective quality assessment (QA) procedures, along with objective quality metrics that remain accurate across different distortion types. This paper presents a standardized workflow for light field QA, developed and deployed in the context of JPEG Pleno standardization activities, which integrates benchmark generation, a hybrid subjective evaluation, and objective metric analysis into a common workflow. The benchmark is designed to encompass not only traditional coding-only artifacts but also distortions that arise in processing pipelines in which light field encoding is accompanied with view synthesis and reconstruction techniques. A hybrid subjective method is proposed enabling fine-grained assessment by combining reference-anchored quality rating with targeted pairwise refinement in perceptually ambiguous regions. The reliability of subjective scores is verified using statistical consistency analyses between observers of two cohorts. Finally, a large set of objective metrics is systematically evaluated in terms of global prediction accuracy, local agreement in ambiguous quality regions, and robustness across distortion families. The results show that several metrics achieve strong agreement for coding-only stimuli, but their performance consistently drops when view synthesis distortions are included. The analysis further highlights the importance of view-pooling strategy in the design of future light field quality metrics. The work provides a reproducible and standardization-ready framework for fine-grained light field QA, while identifying key limitations of current objective metrics under emerging coding pipelines. The subjectively annotated dataset is publicly available at https://plenodb.jpeg.org/lfqa/objectivecfp.

2026-07-03T17:03:45Z Saeed Mahmoudpour Mylene C. Q. Farias Gi-Mun Um Myllena A. Prado Ismael Seidel Leonardo de Sousa Marques Leonardo Andrade Shengyang Zhao Carla L Pagliari http://arxiv.org/abs/2501.15177v3 Audio-Language Models for Audio-Centric Tasks: A Systematic Survey 2026-07-03T16:46:09Z

Audio-Language Models (ALMs), trained on paired audio-text data, are designed to process, understand, and reason about audio-centric multimodal content. Unlike traditional supervised approaches that use predefined labels, ALMs leverage natural language supervision to better handle complex real-world audio scenes with multiple overlapping events. While demonstrating impressive zero-shot and task generalization capabilities, there is still a notable lack of systematic surveys that comprehensively organize and analyze developments. In this paper, we present the first systematic review of ALMs with three main contributions: (1) comprehensive coverage of ALM works across speech, music, and sound from a general audio perspective; (2) a unified taxonomy of ALM foundations, including model architectures and training objectives; (3) establishment of a research landscape capturing mutual promotion and constraints among different research aspects, aiding in summarizing evaluations, limitations, concerns and promising directions. Our review contributes to helping researchers understand the development of existing technologies and future trends, while also providing valuable references for implementation in practical applications.

2025-01-25T11:15:06Z Under review Yi Su Jisheng Bai Qisheng Xu Kele Xu Yong Dou http://arxiv.org/abs/2607.03296v1 Taste-aware music retrieval from audio embeddings 2026-07-03T13:06:40Z

Crossmodal correspondences between sound and taste are well established in psychology and neuroscience, but largely absent from content-based multimedia retrieval. We formalise taste-from-audio prediction as a content-based music information retrieval benchmark over a perceptually validated multi-source corpus, comparing ten frozen audio encoders from the four HEAR families under a shared multi-task regression head, with gated late-fusion as a configurable variant. In order to assess the effectiveness of the models, we compute absolute error and rank correlation. The strongest systems predict the five tastes within a macro RMSE of 0.134; on held-out real music their error is less than half a single rater's deviation from the consensus (RMSE 0.13 vs. 0.28), so the model tracks the group consensus more closely than an average human rater, and well below the previous state of the art baseline (0.219). On absolute error the encoders are statistically flat, with a single VGGish matching the best fusion, but gated late-fusion's advantage is confined to rank correlation (macro Pearson r 0.724 vs. 0.666). Operationalised as a content-based retrieval index, the predicted taste space ranks a 309-item pool far more faithfully than a CLAP-text baseline, which sits at chance; ridge probes and an audio-bandstop knockout read the strongest representations against documented sound-taste correspondences.

2026-07-03T13:06:40Z Accepted for publication in the proceedings of MusiCHER-2026, Special Session of IEEE CBMI 2026 Matteo Spanio Antonio Rodà http://arxiv.org/abs/2602.19778v5 Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation 2026-07-03T05:19:13Z

Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord annotations, which are costly to acquire. At the same time, open-weight pre-trained models are more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use the pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available. To prevent catastrophic forgetting of the representations learned in the first stage, we apply selective knowledge distillation (KD) from the teacher as a regularizer. In our experiments, two models (BTC, 2E1D) were used as students. In Stage 1, using only pseudo-labels, the BTC student achieves about 99% of the teacher's performance, while the 2E1D model achieves about 97% of the teacher's performance across seven standard mir_eval metrics. After continual training with labeled data in Stage 2, the resulting BTC student model consistently surpasses both the traditional supervised learning baseline and the original pre-trained teacher model across all metrics. The resulting 2E1D student model also outperforms the supervised baseline and approaches teacher-level performance, with both models demonstrating substantial gains on rare chord qualities.

2026-02-23T12:32:53Z 8 pages, 6 figures, 4 tables. Accepted to DAFx26 Nghia Phan Rong Jin Gang Liu Xiao Dong http://arxiv.org/abs/2607.02963v1 Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning 2026-07-03T05:13:19Z

Dense video captioning aims to generate temporally grounded descriptions of video events, benefiting both event-level video understanding and generation. In this domain, autoregressive video large language models have emerged as a prevalent paradigm due to their strong generative and cross-modal modeling capacity. However, generating dense captions under the token-by-token paradigm severely limits inference efficiency and hinders scalability as video length and event density increase. In this work, we propose a parallelized autoregressive framework that not only improves generation efficiency but also enhances temporally grounded captioning performance. Our key insight is to exploit the weak local dependencies across temporally distinct events to restructure the causal dependency graph, thereby enabling lossless parallel generation. Specifically, tokens with weak cross-event dependencies can be decoded in parallel, while tightly coupled tokens within each event retain sequential decoding to preserve local semantic coherence. To realize this insight, we introduce two key components for lossless parallel decoding: (1) a latent global planning mechanism that automatically learns the event-level structure and produces compact tokens encoding global inter-event causality while adaptively aggregating event-level audio-visual semantics, guiding subsequent dependency restructuring and parallel decoding; and (2) an event-factorized parallel decoding mechanism that effectively balances local focus with global inter-event awareness. Experiments on various benchmarks demonstrate the clear advantage of our approach in both efficiency and performance in omni-modal event grounding and captioning. Project website: https://github.com/showlab/PadCaptioner.

2026-07-03T05:13:19Z ECCV 2026. Project website: https://github.com/showlab/PadCaptioner Wenzheng Zeng Siyi Jiao Chen Gao Hwee Tou Ng Mike Zheng Shou