https://arxiv.org/api/WWaw6eiMDBRJlasFfTtweXZ2+bA 2026-04-01T08:35:29Z 30436 60 15 http://arxiv.org/abs/2603.24578v1 Vision-Language Models vs Human: Perceptual Image Quality Assessment 2026-03-25T17:54:07Z

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

2026-03-25T17:54:07Z Imran Mehmood Imad Ali Shah Ming Ronnier Luo Brian Deegan http://arxiv.org/abs/2508.01981v6 Deep Feature-specific Imaging 2026-03-25T17:13:56Z

Modern photon-counting sensors are increasingly dominated by Poisson noise, yet conventional Feature-Specific Imaging (FSI), based on Principal Component Analysis (PCA), is optimized for additive Gaussian noise and variance preservation rather than task-specific objectives, leading to suboptimal performance and a loss of its advantages under Poisson noise. To address this, we introduce DeepFSI, a novel end-to-end optical-electronic framework. DeepFSI "unfreezes" PCA-derived masks, enabling a deep neural network to learn globally optimal measurement masks by computing gradients directly under realistic Poisson and additive noise conditions. Simulations and hardware experiments demonstrate that DeepFSI achieves improved classification accuracy and stronger transfer robustness compared to PCA-based FSI across varying photon budgets, particularly in Poisson-noise-dominant environments. DeepFSI also exhibits enhanced robustness to design choices and performs well under additive Gaussian noise, representing a significant advance for noise-robust computational imaging in photon-limited applications.

2025-08-04T01:39:29Z Yizhou Lu Andreas Velten http://arxiv.org/abs/2312.00357v2 A Generalizable Deep Learning System for Cardiac MRI 2026-03-25T15:08:04Z

Cardiac MRI allows for a comprehensive assessment of myocardial structure, function and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep-learning model is trained via self-supervised contrastive learning, in which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank and two additional publicly available external datasets. We explore emergent capabilities of our system and demonstrate remarkable performance across a range of tasks, including the problem of left-ventricular ejection fraction regression and the diagnosis of 39 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep-learning system is capable of not only contextualizing the staggering complexity of human cardiovascular disease but can be directed towards clinical problems of interest, yielding impressive, clinical-grade diagnostic accuracy with a fraction of the training data typically required for such tasks.

2023-12-01T05:27:29Z Published in Nature Biomedical Engineering; Supplementary Appendix available on publisher website. Code: https://github.com/rohanshad/cmr_transformer Nat. Biomed. Eng (2026) Rohan Shad Cyril Zakka Dhamanpreet Kaur Mrudang Mathur Robyn Fong Joseph Cho Ross Warren Filice John Mongan Kimberly Kalianos Nishith Khandwala David Eng Matthew Leipzig Walter R. Witschey Alejandro de Feria Victor A. Ferrari Euan A. Ashley Michael A. Acker Curtis Langlotz William Hiesinger 10.1038/s41551-026-01637-3 http://arxiv.org/abs/2603.19995v2 Goal-Oriented Framework for Optical Flow-based Multi-User Multi-Task Video Transmission 2026-03-25T13:39:58Z

Efficient multi-user multi-task video transmission is an important research topic within the realm of current wireless communication systems. To reduce the transmission burden and save communication resources, we propose a goal-oriented semantic communication framework for optical flow-based multi-user multi-task video transmission (OF-GSC). At the transmitter, we design a semantic encoder that consists of a motion extractor and a patch-level optical flow-based semantic representation extractor to effectively identify and select important semantic representations. At the receiver, we design a transformer-based semantic decoder for high-quality video reconstruction and video classification tasks. To minimize the communication time, we develop a deep deterministic policy gradient (DDPG)-based bandwidth allocation algorithm for multi-user transmission. For video reconstruction tasks, our OF-GSC framework achieves a significant improvement in the received video quality, as evidenced by a 13.47% increase in the structural similarity index measure (SSIM) score in comparison to DeepJSCC. For video classification tasks, OF-GSC achieves a Top-1 accuracy slightly surpassing the performance of VideoMAE with only 25% required data under the same mask ratio of 0.3. For bandwidth allocation optimization, our DDPG-based algorithm reduces the maximum transmission time by 25.97% compared with the baseline equal-bandwidth allocation scheme.

2026-03-20T14:44:54Z Yujie Xu Shutong Chen Nan Li Yansha Deng Jinhong Yuan Robert Schober http://arxiv.org/abs/2306.17466v5 MedAugment: Universal Automatic Data Augmentation Plug-in for Medical Image Analysis 2026-03-25T13:03:58Z

Data augmentation (DA) has been widely leveraged in computer vision to alleviate data shortage, while its application in medical imaging faces multiple challenges. The prevalent DA approaches in medical image analysis encompass conventional DA, synthetic DA, and automatic DA. However, these approaches may result in experience-driven design and intensive computation costs. Here, we propose a suitable yet general automatic DA method for medical images termed MedAugment. We propose pixel and spatial augmentation spaces and exclude the operations that can break medical details and features. Besides, we propose a sampling strategy by sampling a limited number of operations from the two spaces. Moreover, we present a hyperparameter mapping relationship to produce a rational augmentation level and make the MedAugment fully controllable using a single hyperparameter. These configurations settle the differences between natural and medical images. Extensive experimental results on four classification and four segmentation datasets demonstrate the superiority of MedAugment. Compared with existing approaches, the proposed MedAugment prevents producing color distortions or structural alterations while involving negligible computational overhead. Our method can serve as a plugin without an extra training stage, offering significant benefits to the community and medical experts lacking a deep learning foundation. The code is available at https://github.com/NUS-Tim/MedAugment.

2023-06-30T08:22:48Z Knowledge-Based Systems Accepted Zhaoshan Liu Qiujie Lv Yifan Li Ziduo Yang Lei Shen 10.1016/j.knosys.2026.115828 http://arxiv.org/abs/2603.24109v1 Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series 2026-03-25T09:14:42Z

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.

2026-03-25T09:14:42Z Iris Dumeur CB Jérémy Anger CB Gabriele Facciolo CB http://arxiv.org/abs/2603.24633v1 Coronary artery calcification assessment in National Lung Screening Trial CT images (DeepCAC2) 2026-03-25T08:46:18Z

Coronary artery calcification (CAC) is a strong predictor of cardiovascular risk but remains underutilized in clinical routine thoracic imaging due to the need for dedicated imaging protocols and manual annotation. We present DeepCAC2, a publicly available dataset containing automated CAC segmentations, coronary artery calcium scores, and derived risk categories generated from low-dose chest CT scans of the National Lung Screening Trial (NLST). Using a fully automated deep learning pipeline trained on expert-annotated cardiac CT data, we processed 127,776 CT scans from 26,228 individuals and generated standardized CAC segmentations and risk estimates for each acquisition. We already provide a public dashboard as a simple tool to visually inspect a random subset of 200 NLST patients of the dataset. The dataset will be released with DICOM-compatible segmentation objects and structured metadata to support reproducible downstream analysis. The deep learning pipeline will be made publicly available as a DICOM-compatible MHub.ai container. DeepCAC2 provides a transparent, large-scale, public, fully reproducible resource for research in cardiovascular risk assessment, opportunistic screening, and imaging biomarker development.

2026-03-25T08:46:18Z Leonard Nürnberg Simon Bernatz Borek Foldyna Michael T. Lu Andrey Fedorov Hugo JWL Aerts http://arxiv.org/abs/2512.11715v2 EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing 2026-03-25T08:44:56Z

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

2025-12-12T16:51:19Z Wei Chow Linfeng Li Lingdong Kong Zefeng Li Qi Xu Hang Song Tian Ye Xian Wang Jinbin Bai Shilin Xu Xiangtai Li Junting Pan Shaoteng Liu Ran Zhou Tianshu Yang Songhua Liu http://arxiv.org/abs/2603.24026v1 Blind Quality Enhancement for G-PCC Compressed Dynamic Point Clouds 2026-03-25T07:39:25Z

Point cloud compression often introduces noticeable reconstruction artifacts, which makes quality enhancement necessary. Existing approaches typically assume prior knowledge of the distortion level and train multiple models with identical architectures, each designed for a specific distortion setting. This significantly limits their practical applicability in scenarios where the distortion level is unknown and computational resources are limited. To overcome these limitations, we propose the first blind quality enhancement (BQE) model for compressed dynamic point clouds. BQE enhances compressed point clouds under unknown distortion levels by exploiting temporal dependencies and jointly modeling feature similarity and differences across multiple distortion levels. It consists of a joint progressive feature extraction branch and an adaptive feature fusion branch. In the joint progressive feature extraction branch, consecutive reconstructed frames are first fed into a recoloring-based motion compensation module to generate temporally aligned virtual reference frames. These frames are then fused by a temporal correlation-guided cross-attention module and processed by a progressive feature extraction module to obtain hierarchical features at different distortion levels. In the adaptive feature fusion branch, the current reconstructed frame is input to a quality estimation module to predict a weighting distribution that guides the adaptive weighted fusion of these hierarchical features. When applied to the latest geometry-based point cloud compression (G-PCC) reference software, i.e., test model category13 version 28, BQE achieved average PSNR improvements of 0.535 dB, 0.403 dB, and 0.453 dB, with BD-rates of -17.4%, -20.5%, and -20.1% for the Luma, Cb, and Cr components, respectively.

2026-03-25T07:39:25Z Tian Guo Hui Yuan Chang Sun Wei Zhang Raouf Hamzaoui Sam Kwong http://arxiv.org/abs/2603.26785v1 Beyond Benchmarks: A Framework for Post Deployment Validation of CT Lung Nodule Detection AI 2026-03-25T06:35:54Z

Background: Artificial intelligence (AI) assisted lung nodule detection systems are increasingly deployed in clinical settings without site-specific validation. Performance reported under benchmark conditions may not reflect real-world behavior when acquisition parameters differ from training data. Purpose: To propose and demonstrate a physics-guided framework for evaluating the sensitivity of a deployed lung nodule detection model to systematic variation in CT acquisition parameters. Methods: Twenty-one cases from the publicly available LIDC-IDRI dataset were evaluated using a MONAI RetinaNet model pretrained on LUNA16 (fold 0, no fine-tuning). Five imaging conditions were tested: baseline, 25% dose reduction, 50% dose reduction, 3 mm slice thickness, and 5 mm slice thickness. Dose reduction was simulated via image-domain Gaussian noise; slice thickness via moving average along the z-axis. Detection sensitivity was computed at a confidence threshold of 0.5 with a 15 mm matching criterion. Results: Baseline sensitivity was 45.2% (57/126 consensus nodules). Dose reduction produced slight degradation: 41.3% at 25% dose and 42.1% at 50% dose. The 5 mm slice thickness condition produced a marked drop to 26.2% - a 19 percentage point reduction representing a 42% relative decrease from baseline. This finding was consistent across confidence thresholds from 0.1 to 0.9. Per-case analysis revealed heterogeneous performance including two cases with complete detection failure at baseline. Conclusion: Slice thickness represents a more fundamental constraint on AI detection performance than image noise under the conditions tested. The proposed framework is reproducible, requires no proprietary scanner data, and is designed to serve as the basis for ongoing post-deployment QA in resource-constrained environment.

2026-03-25T06:35:54Z Daniel Soliman http://arxiv.org/abs/2603.23965v1 MonoSIM: An open source SIL framework for Ackermann Vehicular Systems with Monocular Vision 2026-03-25T05:59:30Z

This paper presents an open-source Software-in-the-Loop (SIL) simulation platform designed for autonomous Ackerman vehicle research and education. The proposed framework focuses on simplicity, while making it easy to work with small-scale experimental setups, such as the XTENTH-CAR platform. The system was designed using open source tools, creating an environment with a monocular camera vision system to capture stimuli from it with minimal computational overhead through a sliding window based lane detection method. The platform supports a flexible algorithm testing and validation environment, allowing researchers to implement and compare various control strategies within an easy-to-use virtual environment. To validate the working of the platform, Model Predictive Control (MPC) and Proportional-Integral-Derivative (PID) algorithms were implemented within the SIL framework. The results confirm that the platform provides a reliable environment for algorithm verification, making it an ideal tool for future multi-agent system research, educational purposes, and low-cost AGV development. Our code is available at https://github.com/shantanu404/monosim.git.

2026-03-25T05:59:30Z 6 pages, 16 figures, Published in "IEEE 12th International Conference on Automation, Robotics and Application 2026" Shantanu Rahman Nayeb Hasin Mainul Islam Md. Zubair Alom Rony Golam Sarowar http://arxiv.org/abs/2603.23869v1 Joint Source-Channel-Check Coding with HARQ for Reliable Semantic Communications 2026-03-25T02:51:46Z

Semantic communication has emerged as a promising paradigm for improving transmission efficiency and task-level reliability, yet most existing reliability-enhancement approaches rely on retransmission strategies driven by semantic fidelity checking that require additional check codewords solely for retransmission triggering, thereby incurring substantial communication overhead. In this paper, we propose S3CHARQ, a Joint Source-Channel-Check Coding framework with hybrid automatic repeat request that fundamentally rethinks the role of check codewords in semantic communications. By integrating the check codeword into the JSCC process, S3CHARQ enables JS3C, allowing the check codeword to simultaneously support semantic fidelity verification and reconstruction enhancement. At the transmitter, a semantic fidelity-aware check encoder embeds auxiliary reconstruction information into the check codeword. At the receiver, the JSCC and check codewords are jointly decoded by a JS3C decoder, while the check codeword is additionally exploited for perceptual quality estimation. Moreover, because retransmission decisions are necessarily based on imperfect semantic quality estimation in the absence of ground-truth reconstruction, estimation errors are unavoidable and fundamentally limit the effectiveness of rule-based decision schemes. To overcome this limitation, we develop a reinforcement learning-based retransmission decision module that enables adaptive, sample-level retransmission decisions, effectively balancing recovery and refinement information under dynamic channel conditions. Experimental results demonstrate that compared with existing HARQ-based semantic communication systems, the proposed S3CHARQ framework achieves a 2.36 dB improvement in the 97th percentile PSNR, as well as a 37.45% reduction in outage probability.

2026-03-25T02:51:46Z 13 pages, 12 figures, Boyuan Li Shuoyao Wang Suzhi Bi Liping Qian Yunlong Cai http://arxiv.org/abs/2603.23779v1 Sentinel-2 for Crop Yield Estimation: A Systematic Review 2026-03-24T23:24:23Z

Accurate and timely crop yield estimation is critical for global food security, agricultural policy, and farm management. The Copernicus Sentinel-2 satellite constellation, with high spatial, temporal, and spectral resolution, has transformed agricultural monitoring by enabling field- and sub-field-scale analysis. This review synthesizes recent advances in Sentinel-2-based crop yield estimation. A key trend is the shift from regional models to high-resolution field-level assessments driven by three main approaches: (i) empirical models using vegetation indices combined with machine and deep learning methods such as Random Forest and Convolutional Neural Networks; (ii) integration of process-based crop growth models (e.g., WOFOST, SAFY) via data assimilation of Sentinel-2-derived variables like Leaf Area Index (LAI); and (iii) data fusion techniques combining Sentinel-2 optical data with Sentinel-1 SAR to mitigate cloud-related limitations. The review shows that machine learning, deep learning, and hybrid modeling frameworks can explain substantial within-field yield variability across crops and regions. However, performance remains constrained by limited ground-truth data, cloud-induced gaps, and challenges in model transferability across years and locations. Future directions include tighter integration of multi-modal data and improved in-season observations to support robust, operational decision-making in precision agriculture and sustainable intensification.

2026-03-24T23:24:23Z 29 pages, 5 figures, review paper Mohammadreza Narimani Alireza Pourreza Ali Moghimi Parastoo Farajpoor http://arxiv.org/abs/2603.23390v1 Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation 2026-03-24T16:24:19Z

Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at https://github.com/CUHK-AIM-Group/Light-UNETR.

2026-03-24T16:24:19Z Accepted to IEEE TPAMI Xinyu Liu Zhen Chen Wuyang Li Chenxin Li Yixuan Yuan http://arxiv.org/abs/2603.23096v1 Rigid Motion Estimation using Accelerated Iterative Coordinate Descent (REACT) for MR Imaging 2026-03-24T11:44:57Z

Purpose: To develop a computationally viable autofocus method for estimating 3D rigid motion in MR imaging. Theory and Methods: The proposed method, REACT, assumes a piecewise-constant motion trajectory and estimates the rigid motion parameters of individual temporal segments by optimizing an image-quality metric. Coordinate descent is adopted to decompose the high-dimensional optimization problem into a series of subproblems, each updating the motion parameters of a single temporal segment. The cost function of each subproblem is assumed to be approximately locally convex under suitable acquisition conditions. Each subproblem is then solved using a derivative-free solver, thereby avoiding an exhaustive grid search. Numerical simulations were conducted to investigate the local convexity assumption. REACT was evaluated for respiratory motion correction on in vivo free-breathing coronary MR angiography datasets acquired using a 3D cones trajectory with image-based navigators (iNAVs). An autofocus nonrigid motion correction method was also evaluated for comparison. Coronary artery sharpness was quantified using unbounded image edge profile acutance (u-IEPA). Results: In numerical simulations, the objective surfaces of the subproblems were approximately locally convex when the current motion estimate was close to the desired solution. In the in vivo study, REACT yielded higher u-IEPA than the conventional iNAV-based translational motion-estimation method for both the left anterior descending artery (LAD) and right coronary artery. REACT also yielded higher u-IEPA for the LAD than the autofocus nonrigid motion correction method. Conclusion: This study demonstrates the feasibility of coordinate descent for autofocus motion correction in MR imaging.

2026-03-24T11:44:57Z 14 pages, 7 figures, submitted to MRM Kwang Eun Jang Dwight G. Nishimura