https://arxiv.org/api/fydGq8NXkxXKFYlH4F9K8e9IfsQ 2026-07-19T11:54:26Z 9772 120 15 http://arxiv.org/abs/2509.17901v4 Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy 2026-06-22T00:40:59Z

Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines, not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers about 76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25-fold token reduction (25 Hz to 1 Hz). Across 10 benchmarks, with and without filtering, audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will open-source our work at https://github.com/naver-ai/unimambamia-av.

2025-09-22T15:28:54Z Accepted to Interspeech 2026 Geewook Kim Minjoon Seo http://arxiv.org/abs/2606.22699v1 Catching Lies Without Sending the Video: Privacy-Preserving Multimodal Deception Detection 2026-06-21T22:30:55Z

Frontier multimodal models can guess whether a person is lying from a testimony video. To do so, they stream that raw face and voice to a third-party model. We ask whether the heavy media is needed at all. On the Real-life Trial Deception dataset, Whissle on-device speech and vision stack extracts a compact digest: transcript, emotion, age, gender, intent distributions, a deception intent filter, fluency and rhythm, per-frame facial behaviour, and prosody. Under speaker-independent evaluation, we report three findings. A small classifier on this digest reaches AUC 0.741, matching Gemini 2.5 Pro on full video. Handing the digest to a frontier LLM reaches AUC 0.755 with Claude Opus 4.8 at 7.8X fewer input tokens, with no media leaving the device. The reported 75% accuracy is a speaker-leakage artifact. We release code and experiments.

2026-06-21T22:30:55Z Nikita Sharma Pranav Sara Karan Singla http://arxiv.org/abs/2606.22592v1 Illuminating English Letters Using a Flying Light Speck 2026-06-21T16:58:01Z

This paper presents the design and implementation of a Flying Light Speck (FLS) to illuminate English letters. The FLS uses its onboard camera and computing to localize and follow a trajectory to illuminate a letter. We evaluate the illuminations quantitatively and qualitatively. The latter is based on an IRB approved human subject study with 20 participants. The obtained results show a 42 to 56 millimeter error that impacts the detection of letters. A key finding is that the order in which the illumination of letters is presented to subjects has a significant effect on detection duration.

2026-06-21T16:58:01Z Appeared in Proceedings of the 3rd International Workshop on UAVs in Multimedia: Capturing the World from a New Perspective (UAVM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 5 pages Hamed Alimohammadzadeh Shahram Ghandeharizadeh 10.1145/3728482.3757388 http://arxiv.org/abs/2606.22550v1 Training-Free Semantic Correction for Autoregressive Visual Models 2026-06-21T15:10:32Z

Autoregressive visual models (AVMs) based on next-scale prediction have emerged as a prominent paradigm for image and video synthesis. However, decomposing the generation process into discrete scales with varying granularities in AVM makes semantic errors difficult to identify and correct, thereby undermining the quality of the final output. Prior efforts to enhance AVM can be categorized into training-based and training-free approaches. Although training-based efforts to enhance AVM generation quality come at substantial computational cost, existing training-free methods neglect intermediate generation states, leaving semantic errors undiagnosed and allowing them to accumulate into the final output. In this paper, we focus on training-free paradigms and propose Gazer, a framework that integrates multimodal large language model feedback into the AVM sampling loop for in-generation semantic correction. Concretely, Gazer operates via two cooperating stages: the Reflective Diagnosis stage diagnoses semantic errors from intermediate states, while the Semantic Correction stage rewinds and rectifies the generation trajectory to realign with the target prompt. Experiments on compositional image and video benchmarks demonstrate that Gazer improves semantic alignment and compositional accuracy across multiple AVMs without additional training.

2026-06-21T15:10:32Z Junhao Chen Chanyu Zhu Zheqi Lv Keting Yin Shengyu Zhang http://arxiv.org/abs/2606.22499v1 Line Drawings using LightBenders: Authoring and Illuminating 2026-06-21T13:43:16Z

This study presents the hardware and software architecture of a transformative system for illuminating line drawings and letterforms. These mid-air illuminations are indoors and might be animated. The hardware contribution is a drone equipped with servo-actuated rod joints and a dense, addressable LED strip that enables arbitrary orientation, a LightBender. The software contributions are threefold. First, the system implements algorithms and heuristics to estimate the minimum number of LightBenders required to render a line drawing or letterform, stagger swarm formations to mitigate LightBender downwash, generate Swarm Flight and Lighting (SFL) files, and execute these files using a swarm of LightBenders to illuminate line drawings and letterforms. Second, a Blender add-on enables users to register LightBenders, author graphics and animations represented by swarms of LightBenders, and deploy the swarm for illumination through one-click functions. Third, users may import SVG files into either the Blender add-on or a standalone LB-Author tool to illuminate line drawings directly from vector graphics. We present results from an IRB-approved human subject study (n=21) to evaluate the impact of LightBender misalignment on the perceived illuminations. Obtained results demonstrate that the system's 10.1 mm maximum misalignment is perceptually acceptable across tested illuminations, with a median quality rating of 8 on a 0-10 scale.

2026-06-21T13:43:16Z Hamed Alimohammadzadeh Shahram Ghandeharizadeh http://arxiv.org/abs/2603.13312v3 Design-MLLM: A Reinforcement Alignment Framework for Verifiable and Aesthetic Interior Design 2026-06-21T12:55:55Z

Interior design is a requirements-to-visual-plan generation process that must simultaneously satisfy verifiable spatial feasibility and comparative aesthetic preferences. While recent multimodal large language models (MLLMs) offer a unified foundation for interpreting user intent and producing design rationales, our empirical analysis reveals a persistent contradiction in real-world deployment: MLLMs often produce layouts that are unbuildable and aesthetically inconsistent. These findings indicate that simply adding in-domain text is insufficient; effective interior design requires an alignment mechanism that separates hard constraints from soft preferences and coordinates them during optimization. To address this, we propose Design-MLLM, a reinforcement alignment framework that optimizes a feasibility-first preference objective via a dual-branch, aesthetic-oriented reward. Specifically, Design-MLLM (i) explicitly evaluates spatial feasibility using programmatic constraint checks, (ii) assesses aesthetic preference only among feasible candidates to avoid visually appealing but unexecutable shortcuts, and (iii) performs group-relative optimization to obtain stable preference signals. Through this process, Design-MLLM learns a controllable policy that consistently selects and generates solutions that are both executable and aesthetically coherent, rather than occasionally producing visually appealing but infeasible designs. Extensive experiments on various benchmark datasets demonstrate the advantages of Design-MLLM.

2026-03-04T06:51:28Z Yuxuan Yang Xiaotong Mao Jingyao Wang http://arxiv.org/abs/2605.30994v3 Dynamic Interaction-Aware and Causality-Disentangled Framework for Multimodal Sentiment Analysis 2026-06-21T11:57:58Z

Although Multimodal Sentiment Analysis (MSA) effectively leverages rich information from language, visual, and acoustic modalities, existing methods still face two core challenges: 1) static conflict suppression mechanisms fail to adapt to dynamic variations across samples, and 2) the inherent sentimental bias within the language modality, which can misguide learning from other modalities, remains entangled. To this end, we propose a Dynamic Multimodal Causal Disentanglement and Adaptive Fusion Framework (MCAF). Its cornerstone is the Multi-Granularity Causal Dynamic Router and a Conditional Diffusion Denoising Module. First, we introduce a causal intervention module based on the information bottleneck principle, which builds a Structural Causal Model to disentangle sentimental bias from language features, yielding a "de-confounded" language representation as a pure guiding signal. Second, we devise a Dynamic Multimodal Router that evaluates the interaction states (complementary, conflicting, or redundant) among visual, acoustic, and de-confounded language signals in real-time across three levels: feature, temporal, and modality, then adaptively allocates weights and routes information flow for fine-grained regulation. Finally, a lightweight Conditional Diffusion Denoising Module performs iterative denoising on the fused joint representation to explicitly filter out residual irrelevant information, generating a robust hyper-modality representation. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks show that MCAF sets new state-of-the-art on key classification metrics, achieving an Acc-2/F1 of 86.52%/86.51% on MOSI and 86.72%/86.65% on MOSEI, while remaining highly competitive on others. Comprehensive analyses and visualizations further validate its efficacy in dynamically perceiving interactions, disentangling bias, and enhancing interpretability.

2026-05-29T08:28:48Z Guangyuan Dong Ziwei Hong Shenghao Liu Chenyu Wu Yuanyuan Fang Zihao Li Xudong Zhang Bingchen Liu Yuchen Zhang Haitao Ding Zhenzhou Zhou Ziyu Song http://arxiv.org/abs/2606.24916v1 SPORT: Spherical-PSNR-Optimized tRuncaTion for Power-Efficient 360-Degree Video Systems 2026-06-19T20:49:35Z

Memory bandwidth accounts for 30-40% of total power consumption in standalone virtual reality (VR) headsets, yet existing systems typically store the entire 360-degree frame at a uniform resolution regardless of viewer gaze. This paper presents SPORT (Spherical-PSNR Optimized tRuncaTion), a bit-truncation framework that reduces display-path memory power by storing only the most significant bits of pixels outside the user's field of view (FoV). Specifically, a new bit-truncation framework is developed to use weighted-to-spherically-uniform PSNR (WS-PSNR) directly in the optimization constraint, eliminating the metric inconsistency that arises when standard PSNR is used for a WS-PSNR quality target. Also, gaze-predictive tile classification compensates for the 9.33 ms end-to-end pipeline latency, reducing boundary misclassifications by 5.2 percentage points at a cost of only 0.01 ms. In addition, the developed SPORT-B variant, which keeps the FoV lossless, achieves 47.9% memory power saving and 47.9% bandwidth reduction across different 4K video sequences while satisfying all three per-region WS-PSNR thresholds and maintaining SSIM = 1.000 in the attended region. The full adaptive variant SPORT-A reaches 51.6% power saving, 3.1percentage points more than a PSNR-based optimizer at equal measured quality. SPORT is validated on the TrunMEM360 flexible SRAM Application-Specific Integrated Circuit (ASIC) fabricated in SkyWater 130 nm CMOS, confirming byte-exact silicon-software agreement, with WS-PSNR and SSIM matching within 0.1 dB and 0.001. CACTI-based analysis confirms 48.72% DRAM leakage reduction and 36.4%/36.7% read/write energy reduction. The total motion-to-photon latency of 9.33 ms satisfies the 20 ms VR comfort budget with a 53.3% safety margin.

2026-06-19T20:49:35Z Md. Sajjad Hossain Hasibur Rahman Hemel Kyle Mooney Yiwen Xu William Oswald Mario Renteria-Pinon Hritom Das Zhenlin Pei Jinhui Wang Na Gong http://arxiv.org/abs/2606.21655v1 PaaF: Raising the perceived quality of INR-Based Image Compression 2026-06-19T18:00:37Z

Implicit Neural Representations (INRs) have recently emerged as a promising paradigm for image compression, offering a fundamentally different approach from traditional and learned codecs. Nevertheless, INR-based methods for image compression suffer from long encoding times and a consistent performance gap in classic quality metrics such as PSNR. In this work, we explore the potential of purely INR-based compression methods and we propose PaaF (Picture as a Function), a novel INR-based image codec that introduces improved architectural design, adaptive quantization, and an efficient entropy coding scheme. These components are designed to enhance rate-distortion performance while preserving the simplicity and parallelizability of INR-based decoding. Experimental results demonstrate consistent improvements over existing INR-based methods in both quantitative metrics and perceptual quality. These findings highlight the potential of INR-based approaches and contribute to narrowing the gap between functional representations and more established compression paradigms.

2026-06-19T18:00:37Z Lorenzo Catania Dario Allegra http://arxiv.org/abs/2606.20847v1 LLM-Driven Heuristic Frame-Level Quantization Parameter Adaptation for VVenC 2026-06-18T18:32:47Z

Optimal frame-level quantization parameter (QP) allocation remains a persistent challenge in modern video encoders. The fixed-QP scheme widely adopted in practical systems is inherently content-agnostic, while classical Lagrangian rate-distortion optimization (RDO) methods often suffer from inaccurate multiplier settings. In this paper, we explore the use of large language models (LLMs) to automatically design RDO heuristics for frame-level QP adaptation. We construct a closed-loop evolutionary framework in which the LLM iteratively proposes RDO heuristics as algorithmic ideas with executable code, and these candidates are evaluated directly through encoding with the Fraunhofer Versatile Video Encoder (VVenC), where each heuristic acts as a scoring function that compares different QP choices based on the encoding statistics of past frames and current candidates. Experimental results across multiple test sets show that the evolved heuristic achieves promising rate-distortion improvements over both the fixed-QP scheme and the Lagrangian baseline. Further analysis reveals that the LLM can autonomously discover an adaptive heuristic that penalizes QP fluctuations via entropy-based terms, providing new insights into the design of RDO algorithms

2026-06-18T18:32:47Z Liqiang He Yingwen Zhang Riyu Lu Meng Wang Shiqi Wang http://arxiv.org/abs/2606.20094v1 MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer 2026-06-18T11:13:50Z

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

2026-06-18T11:13:50Z Nefeli Andreou Angel Martínez-González Sabine Sternig Matthieu Guillaumin Epameinondas Antonakos Michael Opitz http://arxiv.org/abs/2606.19658v1 Denoising Implicit Feedback for Cold-start Recommendation 2026-06-17T23:50:45Z

Implicit feedback is widely used in recommender systems due to its accessibility and generality, yet it usually presents noisy samples (e.g., clickbait, position bias). Meanwhile, recommenders inevitably face the item cold-start problem due to the continuous influx of new items. We identify that cold items are more prone to noisy samples due to the aforementioned factors, and researchers often overlook the significance of denoising implicit feedback for cold items. Previous denoising studies usually identify noisy samples based on heuristic patterns, such as higher loss values, and mitigate noise through sample selection or re-weighting. However, these methods have limited adaptability and are ineffective in cold-start scenarios. To achieve denoising implicit feedback for cold-start recommendation, we propose a model-agnostic denoising method called DIF. First, user preferences for content remain stable, which allows us to infer pseudo-labels indicating whether a user is interested in a cold item through content-similar warm items. Furthermore, to improve pseudo-label accuracy, we model the confidence of pseudo-labels based on the content similarity between the cold item and warm items, and then aggregate multiple pseudo-labels for each sample. Finally, we explicitly estimate the uncertainty of the noisy sample label by considering its relative entropy and the cold-start status of the item, which adaptively guides the role of pseudo-labels to correct the noisy labels at the sample level. DIF's superiority is supported by both theoretical justification and extensive experiments on real-world datasets. The method has been deployed on a billion-user scale short video application Kuaishou and has significantly improved various commercial metrics within cold-start scenarios.

2026-06-17T23:50:45Z Accepted by KDD 2026 ADS Track Gaode Chen Shicheng Wang Shikun Li Rui Huang Xinghua Zhang Yunze Luo Shipeng Li Shiming Ge Ruina Sun Yinjie Jiang Jun Zhang http://arxiv.org/abs/2602.15707v2 Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU 2026-06-17T18:48:53Z

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

2026-02-17T16:41:51Z 5 figures. 5 more in appendix Rehana Mahfuz Yinyi Guo Erik Visser Phanidhar Chinchili http://arxiv.org/abs/2606.06926v2 SVHighlights: Towards Extremely Long Sport Video Highlight Detection 2026-06-17T14:48:14Z

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +2.50 in HIT@1, +4.04 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

2026-06-05T05:42:19Z Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/ Donggyu Lee Youngbin Ki Jeonghun Kang Taehwan Kim 10.1145/3770855.3817564 http://arxiv.org/abs/2605.26672v2 Can We Hear from Events? Generating Speech from Event Camera 2026-06-17T13:28:50Z

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

2026-05-26T08:11:27Z Jingping Fang Lin Chen Chenyang Xu Tong Zhao Weidong Cai Xiaoming Chen