https://arxiv.org/api/kwzCb7oBM5/nmYeL+2c1oleLc8k2026-04-07T08:31:17Z18776715015http://arxiv.org/abs/2603.26064v2MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection2026-04-05T15:20:21ZNon-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.2026-03-27T04:11:02ZPeiyuan JiangYao LiuYanglei GanJiaye YangLu LiuDaibing YaoQiao Liuhttp://arxiv.org/abs/2602.23013v2SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling2026-04-05T15:06:06ZDetecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 97.1% and 97.5% on the MVTec-AD dataset, and 93.4% and 98.2% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.2026-02-26T13:52:57ZAccepted to CVPR 2026Camile LenderingErkut AkdagEgor Bondarevhttp://arxiv.org/abs/2603.20475v2CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution2026-04-05T15:01:14ZVision-language models (VLMs) can answer spatial relation queries, yet a correct answer does not reveal whether the model truly uses directional evidence or merely exploits object layout. We present CREG (Compass Relational Evidence Graph), a training-free diagnostic framework that converts any token-level attribution map into a reference-parameterized compass distribution and evaluates it with Direction Alignment Error (DAE) and Edge Accuracy (EA). Across three VLMs and two primary benchmarks with native boxes (COCO-Pairs and VG-Spatial), plus supplementary VSR, CREG enables direct comparison of heterogeneous attribution methods on a shared directional scale; Chefer et al. is usually the strongest plug-in, indicating that the framework is not tied to our contrastive Grad-Act signal. Using CREG to probe VLM spatial attribution, we find that attribution is largely layout-driven: changing the queried direction leaves compass outputs near random, and re-centering the projection provides no advantage for the true reference origin. At the same time, CREG detects a limited residual directional component once image identity is controlled. This residual structure is practically useful: lower DAE predicts VLM correctness (AUC up to 0.65) and supports selective prediction and test-time re-ranking, improving accuracy by 14.0 percentage points on COCO-Pairs. CREG provides a unified way to measure directional organization in VLM attribution, making layout bias and residual relational signal explicit and quantifiable.2026-03-20T20:09:19ZKaizhen Tanhttp://arxiv.org/abs/2604.04142v1OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models2026-04-05T15:00:29ZPost training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.2026-04-05T15:00:29ZLiyu ZhangKehan LiTingrui HanTao ZhaoYuxuan ShengShibo HeChao Lihttp://arxiv.org/abs/2604.04136v1Rethinking Exposure Correction for Spatially Non-uniform Degradation2026-04-05T14:41:59ZReal-world exposure correction is fundamentally challenged by spatially non-uniform degradations, where diverse exposure errors frequently coexist within a single image. However, existing exposure correction methods are still largely developed under a predominantly uniform assumption. Architecturally, they typically rely on globally aggregated modulation signals that capture only the overall exposure trend. From the optimization perspective, conventional reconstruction losses are usually derived under a shared global scale, thus overlooking the spatially varying correction demands across regions. To address these limitations, we propose a new exposure correction paradigm explicitly designed for spatial non-uniformity. Specifically, we introduce a Spatial Signal Encoder to predict spatially adaptive modulation weights, which are used to guide multiple look-up tables for image transformation, together with an HSL-based compensation module for improved color fidelity. Beyond the architectural design, we propose an uncertainty-inspired non-uniform loss that dynamically allocates the optimization focus based on local restoration uncertainties, better matching the heterogeneous nature of real-world exposure errors. Extensive experiments demonstrate that our method achieves superior qualitative and quantitative performance compared with state-of-the-art methods. Code is available at https://github.com/FALALAS/rethinkingEC.2026-04-05T14:41:59ZAo LiJiawei SunLe DongZhenyu WangWeisheng Donghttp://arxiv.org/abs/2604.04135v1NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results2026-04-05T14:38:15ZThis paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.2026-04-05T14:38:15ZShuhong LiuChenyu BaoZiteng CuiXuangeng ChuBin RenLin GuXiang ChenMingrui LiLong MaMarcos V. CondeRadu TimofteYun LiuRyo UmagamiTomohiro HashimotoZijian HuYuan GanTianhan XuYusuke KuroseTatsuya HaradaJunwei YuanGengjia ChangXining GeMache YouQida CaoZeliang LiXinyuan HuHongde GuChangyue ShiJiajun DingZhou YuJun YuSeungsang OhFei WangDonggun KimZhiliang WuSeho AhnXinye ZhengKun LiYanyan WeiWeisi LinDizhe ZhangYuchao ChenMeixi SongHanqing WangHaoran FengLu QiJiaao ShanYang GuJiacheng LiuShiyu LiuKui JiangJunjun JiangRunyu ZhuSixun DongQingxia YeZhiqiang ZhangZhihua XuZhiwei WangPhan The SonZhimiao ShiZixuan GuoXueming FuLixia HanChanghe LiuZhenyu ZhaoManabu TsukadaZheng ZhangZihan ZhaiTingting LiZiyang ZhengYuhao LiuDingju WangJeongbin YouYounghyuk KimIl-Youp KwakMingzhe LyuJunbo YangWenhan YangHongsen ZhangJinqiang CuiHong ZhangHaojie GuoHantang LiQiang ZhuBowen HeXiandong MengDebin ZhaoXiaopeng FanWei ZhouLinzhe JiangLinfeng LiLouzhe XuQi XuHang SongChenkun GuoWeizhi NieYufei LiXingan ZhanZhanqi ShiDufeng ZhangBoyuan TianJingshuo ZengGang HeYubao FuWeijie WangCunchuan Huanghttp://arxiv.org/abs/2604.04133v1Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks2026-04-05T14:29:35ZThere is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.2026-04-05T14:29:35ZRubén Moreno-AguadoAlba MagallónVictor MorenoYingying FangGuang Yanghttp://arxiv.org/abs/2604.04127v1SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection2026-04-05T14:15:39ZShip detection in Synthetic Aperture Radar (SAR) imagery is fundamentally challenged by inherent coherent speckle noise, complex coastal clutter, and the prevalence of small-scale targets. Conventional detectors, primarily designed for optical imagery, often exhibit limited robustness against SAR-specific degradation and suffer from the loss of fine-grained ship signatures during spatial downsampling. To address these limitations, we propose SARES-DEIM, a domain-aware detection framework grounded in the DEtection TRansformer (DETR) paradigm. Central to our approach is SARESMoE (SAR-aware Expert Selection Mixture-of-Experts), a module leveraging a sparse gating mechanism to selectively route features toward specialized frequency and wavelet experts. This sparsely-activated architecture effectively filters speckle noise and semantic clutter while maintaining high computational efficiency. Furthermore, we introduce the Space-to-Depth Enhancement Pyramid (SDEP) neck to preserve high-resolution spatial cues from shallow stages, significantly improving the localization of small targets. Extensive experiments on two benchmark datasets demonstrate the superiority of SARES-DEIM. Notably, on the challenging HRSID dataset, our model achieves a mAP50:95 of 76.4% and a mAP50 of 93.8%, outperforming state-of-the-art YOLO-series and specialized SAR detectors.2026-04-05T14:15:39Z10 pages, 4 figures, published to JSTARS(IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)Fenghao SongShaojing YangXi Zhouhttp://arxiv.org/abs/2511.17362v3ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP2026-04-05T13:47:04ZDespite its remarkable success in zero-shot image-text matching, CLIP remains highly vulnerable to adversarial perturbations on images. As adversarial fine-tuning is prohibitively costly, recent works explore various test-time defense strategies; however, these approaches still exhibit limited robustness. In this work, we revisit this problem and propose a simple yet effective strategy: Augmentation-based Test-time Adversarial Correction (ATAC). Our method operates directly in the embedding space of CLIP, calculating augmentation-induced drift vectors to infer a semantic recovery direction and correcting the embedding based on the angular consistency of these latent drifts. Across a wide range of benchmarks, ATAC consistently achieves remarkably high robustness, surpassing that of previous state-of-the-art methods by nearly 50\% on average, all while requiring minimal computational overhead. Furthermore, ATAC retains state-of-the-art robustness in unconventional and extreme settings and even achieves nontrivial robustness against adaptive attacks. Our results demonstrate that ATAC is an efficient method in a novel paradigm for test-time adversarial defenses in the embedding space of CLIP. Code is available at: https://github.com/kylin0421/ATAC2025-11-21T16:30:06Z16 pagesLinxiang SuAndrás Baloghhttp://arxiv.org/abs/2506.19591v2Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications2026-04-05T13:32:55ZCloud cover in multispectral imagery (MSI) poses significant challenges for early season crop mapping, as it leads to missing or corrupted spectral information. Synthetic aperture radar (SAR) data, which is not affected by cloud interference, offers a complementary solution, but lack sufficient spectral detail for precise crop mapping. To address this, we propose a novel framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT), to reconstruct MSI data in cloud-covered regions by leveraging the temporal coherence of MSI and the complementary information from SAR from the attention mechanism. Comprehensive experiments, using rigorous reconstruction evaluation metrics, demonstrate that Time-series ViT framework significantly outperforms baselines that use non-time-series MSI and SAR or time-series MSI without SAR, effectively enhancing MSI image reconstruction in cloud-covered regions.2025-06-24T13:00:36ZThis paper has been accepted as a conference paper at the 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)Lujun LiYiqun WangRadu State10.1109/IGARSS55030.2025.11243992http://arxiv.org/abs/2604.04117v1Efficient Onboard Spacecraft Pose Estimation with Event Cameras and Neuromorphic Hardware2026-04-05T13:31:44ZReliable relative pose estimation is a key enabler for autonomous rendezvous and proximity operations, yet space imagery is notoriously challenging due to extreme illumination, high contrast, and fast target motion. Event cameras provide asynchronous, change-driven measurements that can remain informative when frame-based imagery saturates or blurs, while neuromorphic processors can exploit sparse activations for low-latency, energy-efficient inferences. This paper presents a spacecraft 6-DoF pose-estimation pipeline that couples event-based vision with the BrainChip Akida neuromorphic processor. Using the SPADES dataset, we train compact MobileNet-style keypoint regression networks on lightweight event-frame representations, apply quantization-aware training (8/4-bit), and convert the models to Akida-compatible spiking neural networks. We benchmark three event representations and demonstrate real-time, low-power inference on Akida V1 hardware. We additionally design a heatmap-based model targeting Akida V2 and evaluate it on Akida Cloud, yielding improved pose accuracy. To our knowledge, this is the first end-to-end demonstration of spacecraft pose estimation running on Akida hardware, highlighting a practical route to low-latency, low-power perception for future autonomous space missions.2026-04-05T13:31:44ZAI4SPACE workshop at CVPR 2026Arunkumar RathinamJules LecomteJost ReelsenGregor LenzAxel von ArnimDjamila Aouadahttp://arxiv.org/abs/2507.02212v2SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers2026-04-05T13:23:04ZGraphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. Although recent research increasingly incorporates visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Designing effective GAs requires advanced visualization skills, hindering their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, specifically designed to support GA selection and recommendation, and to facilitate research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA Recommendation, identifying figures within a given paper well-suited as GAs, and 2) Inter-GA Recommendation, retrieving GAs from other papers to inspire new GA designs. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric for fine-grained analysis of model behavior. CAR addresses limitations of traditional rank-based metrics by considering that not only an explicitly labeled GA but also other in-paper figures may plausibly serve as GAs. Benchmark results demonstrate the viability of our tasks and the effectiveness of CAR. Collectively, these establish a foundation for advancing scientific communication within AI for Science.2025-07-03T00:21:38Z28 pages, 21 figures, 9 tables. Accepted to CVPR Findings 2026. Project page: https://iyatomilab.github.io/SciGA/Takuro KawadaShunsuke KitadaSota NemotoHitoshi Iyatomihttp://arxiv.org/abs/2601.21670v2Improving Multimodal Learning with Dispersive and Anchoring Regularization2026-04-05T13:19:29ZMultimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion.
We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms.
Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.2026-01-29T13:03:50ZZixuan XiaHao WangPengcheng WengYanyu QianYangxin XuWilliam DanFei Wanghttp://arxiv.org/abs/2604.04108v1Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation2026-04-05T13:02:18ZEmbodied agents must explore partially observed environments while maintaining reliable long-horizon memory. Existing graph-based navigation systems improve scalability, but they often treat unexplored regions as semantically unknown, leading to inefficient frontier search. Although vision-language models (VLMs) can predict frontier semantics, erroneous predictions may be embedded into memory and propagate through downstream inferences, causing structural error accumulation that confidence attenuation alone cannot resolve. These observations call for a framework that can leverage semantic predictions for directed exploration while systematically retracting errors once new evidence contradicts them. We propose Hypothesis Graph Refinement (HGR), a framework that represents frontier predictions as revisable hypothesis nodes in a dependency-aware graph memory. HGR introduces (1) semantic hypothesis module, which estimates context-conditioned semantic distributions over frontiers and ranks exploration targets by goal relevance, travel cost, and uncertainty, and (2) verification-driven cascade correction, which compares on-site observations against predicted semantics and, upon mismatch, retracts the refuted node together with all its downstream dependents. Unlike additive map-building, this allows the graph to contract by pruning erroneous subgraphs, keeping memory reliable throughout long episodes. We evaluate HGR on multimodal lifelong navigation (GOAT-Bench) and embodied question answering (A-EQA, EM-EQA). HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, and shows consistent improvements on both QA benchmarks. Diagnostic analysis reveals that cascade correction eliminates approximately 20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5x, with specular and transparent surfaces accounting for 67% of corrected prediction errors.2026-04-05T13:02:18ZPeixin ChenGuoxi ZhangJianwei MaQing Lihttp://arxiv.org/abs/2604.04098v1A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming2026-04-05T12:36:23ZPrecision livestock farming requires accurate and timely heat stress prediction to ensure animal welfare and optimize farm management. This study presents a physics-informed digital twin (DT) framework combined with an uncertainty-aware, expert-weighted stacked ensemble for multimodal forecasting of Core Body Temperature (CBT) in dairy cattle. Using the high-frequency, heterogeneous MmCows dataset, the DT integrates an ordinary differential equation (ODE)-based thermoregulation model that simulates metabolic heat production and dissipation, a Gaussian process for capturing cow-specific deviations, a Kalman filter for aligning predictions with real-time sensor data, and a behavioral Markov chain that models activity-state transitions under varying environmental conditions. The DT outputs key physiological indicators, such as predicted CBT, heat stress probability, and behavioral state distributions are fused with raw sensor data and enriched through multi-scale temporal analysis and cross-modal feature engineering to form a comprehensive feature set. The predictive methodology is designed in a three-stage stacked ensemble, where stage 1 trains modality-specific LightGBM 'expert' models on distinct feature groups, stage 2 collects their predictions as meta-features, and at stage 3 Optuna-tuned LightGBM meta-model yields the final CBT forecast. Predictive uncertainty is quantified via bootstrapping and validated using Prediction Interval Coverage Probability (PICP). Ablation analysis confirms that incorporating DT-derived features and multimodal fusion substantially enhances performance. The proposed framework achieves a cross-validated R2 of 0.783, F1 score of 84.25% and PICP of 92.38% for 2-hour ahead forecasting, providing a robust, uncertainty-aware, and physically principled system for early heat stress detection and precision livestock management.2026-04-05T12:36:23ZRiasad AlviMohaimenul Azam Khan RaiaanSadia Sultana ChowaArefin Ittesafun AbianReem E MohamedMd Rafiqul IslamYakub SebastianSheikh Izzal AzidSami Azam