https://arxiv.org/api/BQA1XQ74eDB8arglqMxi0cMEfkA2026-03-30T08:43:06Z304074515http://arxiv.org/abs/2603.22496v1Far-field compressive ultrasound beamforming2026-03-23T19:00:54ZWe present a compressive beamforming method for coherent plane-wave compounding (CPWC) ultrasound imaging based on a far-field decomposition of the received radiofrequency (RF) data into virtual plane waves. This decomposition recasts the imaging operation entirely in the spatial frequency domain ($k$-space), allowing direct and flexible control over $k$-space sampling distributions based on the principle of coarrays. We present vernier-type sampling strategies designed to optimize the tradeoff between image contrast and resolution with minimum redundancy, including strategies that favor dense low-frequency sampling for high contrast, shifted schemes that extend the frequency support for improved resolution, and confocal or hybrid compounding schemes that approximate the spatial-frequency transfer function of conventional DAS beamforming. Our method, called KK beamforming, is validated with a calibration phantom and in-vivo human tissue data, demonstrating compression factors of an order of magnitude while maintaining image qualities comparable to conventional DAS. We further demonstrate that KK beamforming yields improvements in computational speed owing to its reduced memory footprint and more efficient cache utilization of the compressed data and associated look-up tables.2026-03-23T19:00:54ZNikunj KhetanJerome Mertzhttp://arxiv.org/abs/2603.23297v1Drop-In Perceptual Optimization for 3D Gaussian Splatting2026-03-23T17:42:49ZDespite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad-hoc combinations of pixel-level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a diverse set of distortion losses. We conduct the first-of-its-kind large-scale human subjective study on 3DGS, involving 39,320 pairwise ratings across several datasets and 3DGS frameworks. A regularized version of Wasserstein Distortion, which we call WD-R, emerges as the clear winner, excelling at recovering fine textures without incurring a higher splat count. WD-R is preferred by raters more than $2.3\times$ over the original 3DGS loss, and $1.5\times$ over current best method Perceptual-GS. WD-R also consistently achieves state-of-the-art LPIPS, DISTS, and FID scores across various datasets, and generalizes across recent frameworks, such as Mip-Splatting and Scaffold-GS, where replacing the original loss with WD-R consistently enhances perceptual quality within a similar resource budget (number of splats for Mip-Splatting, model size for Scaffold-GS), and leads to reconstructions being preferred by human raters $1.8\times$ and $3.6\times$, respectively. We also find that this carries over to the task of 3DGS scene compression, with $\approx 50\%$ bitrate savings for comparable perceptual metric performance.2026-03-23T17:42:49ZProject page: https://apple.github.io/ml-perceptual-3dgsEzgi OzyilkanZhiqi ChenOren RippelJona BalléKedar Tatwawadihttp://arxiv.org/abs/2506.09161v3From Explanations to Architecture: Explainability-Driven CNN Refinement for Brain Tumor Classification in MRI2026-03-23T16:04:36ZRecent brain tumor classification methods often report high accuracy but rely on deep, over-parameterized architectures with limited interpretability, making it difficult to determine whether predictions are driven by tumor-relevant evidence or by spurious cues such as background artifacts or normal tissue. We propose an explainable convolutional neural network (CNN) framework that enhances model transparency without sacrificing classification accuracy. This approach supports more trustworthy AI in healthcare and contributes to SDG 3: Good Health and Well-being by enabling more dependable MRI-based brain tumor diagnosis and earlier detection. Rather than using explainable AI solely for post hoc visualization, we employ Grad-CAM to quantify layer-wise relevance and guide the removal of low-contribution layers, reducing unnecessary depth and parameters while encouraging attention to discriminative tumor regions. We further validate the model's decision rationale using complementary explainability methods, combining Grad-CAM for spatial localization with SHAP and LIME for attribution-based verification. Experiments on multi-class brain MRI datasets show that the proposed model achieves 98.21% accuracy on the primary dataset and 95.74% accuracy on an unseen dataset, indicating strong cross-dataset generalization. Overall, the proposed approach balances simplicity, transparency, and accuracy, supporting more trustworthy and clinically applicable brain tumor classification for improved health outcomes and non-invasive disease detection.2025-06-10T18:19:56ZThis is the preprint version of the manuscript. It is currently being prepared for submission to an academic conferenceRajan Das GuptaMd Imrul Hasan ShowmickLei WeiMushfiqur Rahman AbirShanjida AkterMd. Yeasin RahatMd. Jakir Hossenhttp://arxiv.org/abs/2603.21911v1A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing2026-03-23T12:32:09ZSynthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.2026-03-23T12:32:09ZChedly Ben AziziClaire GuilloteauGilles RousselMatthieu Puigthttp://arxiv.org/abs/2603.21891v1HMS-VesselNet: Hierarchical Multi-Scale Attention Network with Topology-Preserving Loss for Retinal Vessel Segmentation2026-03-23T12:16:45ZRetinal vessel segmentation methods based on standard overlap losses tend to miss thin peripheral vessels because these structures occupy very few pixels and have low contrast against the background. We propose HMS-VesselNet, a hierarchical multi-scale network that processes fundus images across four parallel branches at different resolutions and combines their outputs using learned fusion weights. The training loss combines Dice, binary cross-entropy, and centerline Dice to jointly optimize area overlap and vessel continuity. Hard example mining is applied from epoch 20 onward to concentrate gradient updates on the most difficult training images. Tested on 68 images from DRIVE, STARE, and CHASE_DB1 using 5-fold cross-validation, the model achieves a mean Dice of 88.72 +/- 0.67%, Sensitivity of 90.78 +/- 1.42%, and AUC of 98.25 +/- 0.21%. In leave-one-dataset-out experiments, AUC remains above 95% on each unseen dataset. The largest improvement is in the recall of thin peripheral vessels, which are the structures most frequently missed by standard methods and most critical for early detection of diabetic retinopathy.2026-03-23T12:16:45Z19 pages, 14 figures, 8 tablesAmarnath Rhttp://arxiv.org/abs/2511.18493v3SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation2026-03-23T10:53:48ZThe significant variability in cell size and shape continues to pose a major obstacle in computer-assisted cancer detection on gigapixel Whole Slide Images (WSIs), due to cellular heterogeneity. Current CNN-Transformer hybrids use static computation graphs with fixed routing. This leads to extra computation and makes it harder to adapt to changes in input. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures via a dual-path design with hierarchical gating and a Shape-Adapting Hub (SA-Hub) that harmonizes feature representations across convolutional and transformer modules. Embodied as SAGE with ConvNeXt and Vision Transformer UNet (SAGE-ConvNeXt+ViT-UNet), our model achieves a Dice score of 95.23\% on EBHI, 92.78\%/91.42\% DSC on GlaS Test A/Test B, and 91.26\% DSC at the WSI level on DigestPath, while exhibiting robust generalization under distribution shifts by adaptively balancing local refinement and global context. SAGE establishes a scalable foundation for dynamic expert routing in visual networks, thereby facilitating flexible visual reasoning.2025-11-23T15:25:36ZGia Huy ThaiHoang-Nguyen VuAnh-Minh PhanQuang-Thinh LyTram DinhThi-Ngoc-Truc NguyenNhat Hohttp://arxiv.org/abs/2603.21786v1The Universal Normal Embedding2026-03-23T10:28:14ZGenerative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/2026-03-23T10:28:14ZAccepted to CVPR 2026Chen TaskerRoy BetserEyal GoferMeir Yossef LeviGuy Gilboahttp://arxiv.org/abs/2603.22378v1Abnormalities and Disease Detection in Gastro-Intestinal Tract Images2026-03-23T10:13:56ZGastrointestinal (GI) tract image analysis plays a crucial role in medical diagnosis. This research addresses the challenge of accurately classifying and segmenting GI images for real-time applications, where traditional methods often struggle due to the diversity and complexity of abnormalities. The high computational demands of this domain require efficient and adaptable solutions.
This PhD thesis presents a multifaceted approach to GI image analysis. Initially, texture-based feature extraction and classification methods were explored, achieving high processing speed (over 4000 FPS) and strong performance (F1-score: 0.76, Accuracy: 0.98) on the Kvasir V2 dataset.
The study then transitions to deep learning, where an optimized model combined with data bagging techniques improved performance, reaching an accuracy of 0.92 and an F1-score of 0.60 on the HyperKvasir dataset, and an F1-score of 0.88 on Kvasir V2.
To support real-time detection, a streamlined neural network integrating texture and local binary patterns was developed. By addressing inter-class similarity and intra-class variation through a learned threshold, the system achieved 41 FPS with high accuracy (0.99) and an F1-score of 0.91 on HyperKvasir.
Additionally, two segmentation tools are proposed to enhance usability, leveraging Depth-Wise Separable Convolution and neural network ensembles for improved detection, particularly in low-FPS scenarios.
Overall, this research introduces novel and adaptable methodologies, progressing from traditional texture-based techniques to deep learning and ensemble approaches, providing a comprehensive framework for advancing GI image analysis.2026-03-23T10:13:56ZPhD ThesisZeshan KhanMuhammad Atif Tahirhttp://arxiv.org/abs/2603.21760v1Cycle Inverse-Consistent TransMorph: A Balanced Deep Learning Framework for Brain MRI Registration2026-03-23T09:53:06ZDeformable image registration plays a fundamental role in medical image analysis by enabling spatial alignment of anatomical structures across subjects. While recent deep learning-based approaches have significantly improved computational efficiency, many existing methods remain limited in capturing long-range anatomical correspondence and maintaining deformation consistency. In this work, we present a cycle inverse-consistent transformer-based framework for deformable brain MRI registration. The model integrates a Swin-UNet architecture with bidirectional consistency constraints, enabling the joint estimation of forward and backward deformation fields. This design allows the framework to capture both local anatomical details and global spatial relationships while improving deformation stability. We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets. Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields. Detailed quantitative comparisons with baseline methods, including ANTs, ICNet, and VoxelMorph, are provided in the appendix. Experimental results demonstrate that CICTM achieves consistently strong performance across multiple evaluation criteria while maintaining stable and physically plausible deformation fields. These properties make the proposed framework suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical.2026-03-23T09:53:06ZJiaqi ShangHaojin WuYinyi LaiZongyu LiChenghao ZhangJia Guohttp://arxiv.org/abs/2503.11851v3Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging2026-03-23T09:37:02ZDeep learning models have gained increasing adoption in medical image analysis. However, these models often produce overconfident predictions, which can compromise clinical accuracy and reliability. Bridging the gap between high-performance and awareness of uncertainty remains a crucial challenge in biomedical imaging applications. This study focuses on developing a unified deep learning framework for enhancing feature integration, interpretability, and reliability in prediction. We introduced a cross-guided channel spatial attention architecture that fuses feature representations extracted from EfficientNetB4 and ResNet34. Bidirectional attention approach enables the exchange of information across networks with differing receptive fields, enhancing discriminative and contextual feature learning. For quantitative predictive uncertainty assessment, Monte Carlo (MC)-Dropout is integrated with conformal prediction. This provides statistically valid prediction sets with entropy-based uncertainty visualization. The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images. The proposed framework achieved strong classification performance with an AUC of 99.75% for COVID-19, 100% for Tuberculosis, 99.3% for Pneumonia chest X-rays, and 98.69% for retinal OCT images. Uncertainty-aware inference yields calibrated prediction sets with interpretable examples of uncertainty, showing transparency. The results demonstrate that bidirectional cross-attention with uncertainty quantification can improve performance and transparency in medical image classification.2025-03-14T20:28:20Z18 pages, 8 figures, 5 tablesJutika BorahHidam Kumarjit Singhhttp://arxiv.org/abs/2603.22371v1Multimodal Fusion of Skeleton Dynamics and Clinical Gait Features for Video-Based Cerebral Palsy Severity Assessment2026-03-23T06:47:09ZVideo-based gait analysis has become a promising approach for assessing motor impairment in children with cerebral palsy (CP). However, existing methods usually rely on either pose sequences or handcrafted gait features alone, making it difficult to simultaneously capture spatiotemporal motion patterns and clinically meaningful biomechanical information. To address this gap, we propose a multimodal fusion framework that integrates skeleton dynamics with contribution-guided clinically meaningful gait features. First, Grad-CAM analysis on a pre-trained ST-GCN backbone identified the most discriminative body keypoints, providing an interpretable basis for subsequent gait feature extraction. We then build a dual-stream architecture, with one stream modeling skeleton dynamics using ST-GCN and the other encoding gait geatures derived from the identified keypoints. By fusing the two streams through feature cross-attention improved four-level CP motor severity classification to 70.86%, outperforming the baseline by 5.6 percentage points. Overall, this work suggests that integrating skeleton dynamics with clinically meaningful gait descriptors can improve both prediction performance and biomechanical interpretability for video-based CP severity assessment.2026-03-23T06:47:09ZKaiyuan YangXupeng ChenJiangpeng Hehttp://arxiv.org/abs/2603.21510v1Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability2026-03-23T02:55:16ZThis paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.2026-03-23T02:55:16ZJiahui SongSagar ShresthaXiao Fuhttp://arxiv.org/abs/2410.01591v3Imaging foundation model for universal enhancement of non-ideal measurement CT2026-03-22T15:42:22ZNon-ideal measurement computed tomography (NICT) employs suboptimal imaging protocols to expand CT applications. However, the resulting trade-offs degrade image quality, limiting clinical acceptability. Although deep learning methods have been used to enhance NICT images, their reliance on large training datasets and limited generalizability across diverse settings hinder practical use. We propose the multi-scale integrated Transformer AMPlifier (TAMP), the first imaging foundation model for universal NICT enhancement. Pre-trained on 10.8 million physics-driven simulated NICT images, TAMP generalizes effectively across various NICT settings, defect degrees, and body regions. Moreover, a parameter-efficient fine-tuning strategy enables TAMP to adapt to specific clinical scenarios using only few slices. Extensive experiments, including radiologists and real-world validations, demonstrate that TAMP consistently improves image quality and clinical acceptability, underscoring its significant potential to advance CT imaging and broaden NICT applications in clinical practice.2024-10-02T14:25:02ZThis paper has been accepted by Nature CommunicationsRongjun GeYuxin LiuZhan WuShangwen YangYuan GaoChenyu YouGe WangShuo LiYuting HeYang Chenhttp://arxiv.org/abs/2603.20999v1OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields2026-03-22T01:16:40ZAdaptive 360° video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their "black-box" nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7\% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5\%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360° streaming in teleoperation.2026-03-22T01:16:40ZAizierjiang AiersilanZhangfei Yanghttp://arxiv.org/abs/2601.08758v3M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding2026-03-21T15:08:32ZChain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. Such opaque reasoning processes lack reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.2026-01-13T17:42:27Z39 pages, 8 figures; accepted by ICLR 2026Juntao JiangJiangning ZhangYali BiJinsheng BaiWeixuan LiuWeiwei JinZhucun XueYong LiuXiaobin HuShuicheng Yan