https://arxiv.org/api/3oluN/kV/LVt9+osDJdXFOxvM9U 2026-04-09T11:11:23Z 30498 120 15 http://arxiv.org/abs/2603.27263v1 DeepBayesFlow: A Bayesian Structured Variational Framework for Generalizable Prostate Segmentation via Expressive Posteriors and SDE-Girsanov Uncertainty Modeling 2026-03-28T13:01:43Z

Automatic prostate MRI segmentation faces persistent challenges due to inter-patient anatomical variability, blurred tissue boundaries, and distribution shifts arising from diverse imaging protocols. To address these issues, we propose DeepBayesFlow, a novel Bayesian segmentation framework designed to enhance both robustness and generalization across clinical domains. DeepBayesFlow introduces three key innovations: a learnable NF-Posterior module based on normalizing flows that models complex, data-adaptive latent distributions; a NCVI inference mechanism that removes conjugacy constraints to enable flexible posterior learning in high-dimensional settings; and a SDE-Girsanov module that refines latent representations via time-continuous diffusion and formal measure transformation, injecting temporal coherence and physically grounded uncertainty into the inference process. Together, these components allow DeepBayesFlow to capture domain-invariant structural priors while dynamically adapting to domain-specific variations, achieving accurate and interpretable segmentation across heterogeneous prostate MRI datasets.

2026-03-28T13:01:43Z Zhuoyi Fang http://arxiv.org/abs/2603.27261v1 MD-RWKV-UNet: Scale-Aware Anatomical Encoding with Cross-Stage Fusion for Multi-Organ Segmentation 2026-03-28T12:53:50Z

Multi-organ segmentation in medical imaging remains challenging due to large anatomical variability, complex inter-organ dependencies, and diverse organ scales and shapes. Conventional encoder-decoder architectures often struggle to capture both fine-grained local details and long-range context, which are crucial for accurate delineation - especially for small or deformable organs. To address these limitations, we propose MD-RWKV-UNet, a dynamic encoder network that enables scale-aware representation and spatially adaptive context modeling. At its core is the MD-RWKV block, a dual-path module that integrates deformable spatial shifts with the Receptance Weighted Key Value mechanism, allowing the receptive field to adapt dynamically to local structural cues. We further incorporate Selective Kernel Attention to enable adaptive selection of convolutional kernels with varying receptive fields, enhancing multi-scale interaction and improving robustness to organ size and shape variation. In parallel, a cross-stage dual-attention fusion strategy aggregates multi-level features across the encoder, preserving low-level structure while enhancing semantic consistency. Unlike methods that stack static convolutions or rely heavily on global attention, our approach provides a lightweight yet expressive solution for dynamic organ modeling. Experiments on Synapse and ACDC demonstrate state-of-the-art performance, particularly in boundary precision and small-organ segmentation.

2026-03-28T12:53:50Z Zhuoyi Fang http://arxiv.org/abs/2603.27118v1 Quantitative measurements of biological/chemical concentrations using smartphone cameras 2026-03-28T04:16:04Z

This paper presents a smartphone-based imaging system capable of quantifying the concentration of an assortment of biological/chemical assay samples. The main objective is to construct an image database which characterizes the relationship between color information and concentrations of the biological/chemical assay sample. For this aim, a designated optical setup combined with image processing and data analyzing techniques was implemented. A series of experiments conducted on selected assays, including fluorescein, RNA Mango, homogenized milk and yeast have demonstrated that the proposed system estimates the concentration of fluorescent materials and colloidal mixtures comparable to currently used commercial and laboratory instruments. Furthermore, by utilizing the camera and computational power of smartphones, eventual development can be directed toward extremely compact, inexpensive and portable analysis and diagnostic systems which will allow experiments and tests to be conducted in remote or impoverished areas.

2026-03-28T04:16:04Z Zhendong Cao Hongji Dai Zhida Li Ash Parameswaran http://arxiv.org/abs/2602.07029v2 Guidestar-Free Adaptive Optics with Asymmetric Apertures 2026-03-28T00:47:01Z

This work introduces the first closed-loop adaptive optics (AO) system capable of optically correcting aberrations in real-time without a guidestar or a wavefront sensor. Nearly 40 years ago, Cederquist et al. demonstrated that asymmetric apertures enable phase retrieval (PR) algorithms to perform fully computational wavefront sensing, albeit at a high computational cost. More recently, Chimitt et al. extended this approach with machine learning and demonstrated real-time wavefront sensing using only a single (guidestar-based) point-spread-function (PSF) measurement. Inspired by these works, we introduce a guidestar-free AO framework built around asymmetric apertures and machine learning. Our approach combines three key elements: (1) an asymmetric aperture placed at the system's pupil plane that enables PR-based wavefront sensing, (2) a pair of machine learning algorithms that estimate the PSF from natural scene measurements and reconstruct phase aberrations, and (3) a spatial light modulator that performs optical correction. We experimentally validate this framework on dense natural scenes imaged through unknown obscurants. Our method outperforms state-of-the-art guidestar-free wavefront shaping methods, using an order of magnitude fewer measurements and three orders of magnitude less computation.

2026-02-02T22:52:42Z Accepted to ACM Transactions on Graphics (TOG) Weiyun Jiang Haiyun Guo Christopher A. Metzler Ashok Veeraraghavan http://arxiv.org/abs/2603.27018v1 On-Device Super Resolution Imaging Using Low-Cost SPAD Array and Embedded Lightweight Deep Learning 2026-03-27T22:15:20Z

This work presents a lightweight super-resolution (LiteSR) neural network for depth and intensity images acquired from a consumer-grade single-photon avalanche diode (SPAD) array with a 48x32 spatial resolution. The proposed framework reconstructs high-resolution (HR) images of size 256x256. Both synthetic and real datasets are used for performance evaluation. Extensive quantitative metrics demonstrate high reconstruction fidelity on synthetic datasets, while experiments on real indoor and outdoor measurements further confirm the robustness of the proposed approach. Moreover, the SPAD sensor is interfaced with an Arduino UNO Q microcontroller, which receives low-resolution (LR) depth and intensity images and feeds them into a compressed, pre-trained deep learning (DL) model, enabling real-time SR video streaming. In addition to the 256x256 setting, a range of target HR resolutions is evaluated to determine the maximum achievable upscaling resolution (512x512) with LiteSR, including scenarios with noise-corrupted LR inputs. The proposed LiteSR-embedded system co-design provides a scalable, cost-effective solution to enhance the spatial resolution of current consumer-grade SPAD arrays to meet HR imaging requirements.

2026-03-27T22:15:20Z Zhenya Zang Xingda Li David Day Uei Li http://arxiv.org/abs/2507.03745v4 StreamDiT: Real-Time Streaming Text-to-Video Generation 2026-03-27T15:59:42Z

Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://cumulo-autumn.github.io/StreamDiT/

2025-07-04T18:00:01Z CVPR 2026 Akio Kodaira Tingbo Hou Ji Hou Markos Georgopoulos Felix Juefei-Xu Masayoshi Tomizuka Yue Zhao http://arxiv.org/abs/2603.26859v1 Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation 2026-03-27T15:25:30Z

Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at https://github.com/yds3/IPM-BTK/.

2026-03-27T15:25:30Z Main paper (37 pages). Accepted for publication by the Information Processing and Management,Volume 63,Issue 6,September 2026,104766 Dongsheng Yang Yinfeng Yu Liejun Wang http://arxiv.org/abs/2603.26393v1 Adapting Frozen Mono-modal Backbones for Multi-modal Registration via Contrast-Agnostic Instance Optimization 2026-03-27T13:21:19Z

Deformable image registration remains a central challenge in medical image analysis, particularly under multi-modal scenarios where intensity distributions vary significantly across scans. While deep learning methods provide efficient feed-forward predictions, they often fail to generalize robustly under distribution shifts at test time. A straightforward remedy is full network fine-tuning, yet for modern architectures such as Transformers or deep U-Nets, this adaptation is prohibitively expensive in both memory and runtime when operating in 3D. Meanwhile, the naive fine-tuning struggles more with potential degradation in performance in the existence of drastic domain shifts. In this work, we propose a registration framework that integrates a frozen pretrained \textbf{mono-modal} registration model with a lightweight adaptation pipeline for \textbf{multi-modal} image registration. Specifically, we employ style transfer based on contrast-agnostic representation generation and refinement modules to bridge modality and domain gaps with instance optimization at test time. This design is orthogonal to the choice of backbone mono-modal model, thus avoids the computational burden of full fine-tuning while retaining the flexibility to adapt to unseen domains. We evaluate our approach on the Learn2Reg 2025 LUMIR validation set and observe consistent improvements over the pretrained state-of-the-art mono-modal backbone. In particular, the method ranks second on the multi-modal subset, third on the out-of-domain subset, and achieves fourth place overall in Dice score. These results demonstrate that combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers an effective and practical pathway toward robust multi-modal registration.

2026-03-27T13:21:19Z MICCAI Learn2Reg Challenge Yi Zhang Yidong Zhao Qian Tao http://arxiv.org/abs/2603.26387v1 Rethinking Feature Conditioning for Robust Forged Media Detection in Edge AI Sensing Systems 2026-03-27T13:14:54Z

Generalization under manipulation and dataset shift remains a core challenge in forged media detection for AI-driven edge sensing systems. Frozen vision foundation models with linear probes are strong baselines, but most pipelines use default backbone outputs without testing conditioning at the frozen feature interface. We present the first controlled probing study on DINOv3 ConvNeXt and show that, without task-specific fine-tuning, linear probing alone yields competitive forged-media detection performance, indicating that ViT-7B self-supervised distillation transfers to security-critical vision workloads at edge-compatible inference cost. Backbone, head, data, and optimization are fixed while conditioning is varied; LN-Affine, the default ConvNeXt head output, is the natural baseline. On FaceForensics++ c23, five conditioning variants are evaluated under in-distribution testing, leave-one-manipulation-out (LOMO), and cross-dataset transfer to Celeb-DF v2 and DeepFakeDetection. In ConvNeXt-Tiny, conditioning alone changes LOMO mean AUC by 6.1 points and reverses ID-vs-OOD ranking: LN-Affine is strongest on external datasets, while LayerNorm is strongest in-distribution. In ConvNeXt-Base replication, the OOD winner becomes protocol-dependent, and ID-optimal selection still fails as a robust deployment rule. Results show that feature conditioning is a first-order design variable and should be selected with robustness-oriented validation, not ID accuracy alone.

2026-03-27T13:14:54Z Izaldein Al-Zyoud Abdulmotaleb El Saddik http://arxiv.org/abs/2601.01200v2 MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity 2026-03-27T10:59:40Z

The unstructured and irregular nature of points poses a significant challenge for accurate point cloud quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes radial basis function (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat multi-layer perceptron (MLP) by adopting a grouped encoding strategy integrated with residual blocks and channel-wise attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.

2026-01-03T14:58:52Z Zhang Chen Shuai Wan Yuezhe Zhang Siyu Ren Fuzheng Yang Junhui Hou http://arxiv.org/abs/2603.26844v1 Uncertainty-Aware Mapping from 3D Keypoints to Anatomical Landmarks for Markerless Biomechanics 2026-03-27T09:42:23Z

Markerless biomechanics increasingly relies on 3D skeletal keypoints extracted from video, yet downstream biomechanical mappings typically treat these estimates as deterministic, providing no principled mechanism for frame-wise quality control. In this work, we investigate predictive uncertainty as a quantitative measure of confidence for mapping 3D pose keypoints to 3D anatomical landmarks, a critical step preceding inverse kinematics and musculoskeletal analysis. Within a temporal learning framework, we model both uncertainty arising from observation noise and uncertainty related to model limitations. Using synchronized motion capture ground truth on AMASS, we evaluate uncertainty at frame and joint level through error--uncertainty rank correlation, risk--coverage analysis, and catastrophic outlier detection. Across experiments, uncertainty estimates, particularly those associated with model uncertainty, exhibit a strong monotonic association with landmark error (Spearman $ρ\approx 0.63$), enabling selective retention of reliable frames (error reduced to $\approx 16.8$ mm at 10% coverage) and accurate detection of severe failures (ROC-AUC $\approx 0.92$ for errors $>50$ mm). Reliability ranking remains stable under controlled input degradation, including Gaussian noise and simulated missing joints. In contrast, uncertainty attributable to observation noise provides limited additional benefit in this setting, suggesting that dominant failures in keypoint-to-landmark mapping are driven primarily by model uncertainty. Our results establish predictive uncertainty as a practical, frame-wise tool for automatic quality control in markerless biomechanical pipelines.

2026-03-27T09:42:23Z 7 pages, 1 figure, submitted to Patter Recognition Letters, uncertainty-aware framework for 3D keypoint-to-landmark mapping in markerless biomechanics Cesare Davide Pace Alessandro Marco De Nunzio Claudio De Stefano Francesco Fontanella Mario Molinara http://arxiv.org/abs/2603.26836v1 Reliability-Aware Weighted Multi-Scale Spatio-Temporal Maps for Heart Rate Monitoring 2026-03-27T07:58:34Z

Remote photoplethysmography (rPPG) allows for the contactless estimation of physiological signals from facial videos by analyzing subtle skin color changes. However, rPPG signals are extremely susceptible to illumination changes, motion, shadows, and specular reflections, resulting in low-quality signals in unconstrained environments. To overcome these issues, we present a Reliability-Aware Weighted Multi-Scale Spatio-Temporal (WMST) map that models pixel reliability through the suppression of environmental noises. These noises are modeled using different weighting strategies to focus on more physiologically valid areas. Leveraging the WMST map, we develop an SSL contrastive learning approach based on Swin-Unet, where positive pairs are generated from conventional rPPG signals and temporally expanded WMST maps. Moreover, we introduce a new High-High-High (HHH) wavelet map as a negative example that maintains motion and structural details while filtering out physiological information. Here, our aim is to estimate heart rate (HR), and the experiments on public rPPG benchmarks show that our approach enhances motion and illumination robustness with lower HR estimation error and higher Pearson correlation than existing Self-Supervised Learning (SSL) based rPPG methods.

2026-03-27T07:58:34Z 6 pages, 4 figures. Under review at ICIP 2026 Arpan Bairagi Rakesh Dey Siladittya Manna Umapada Pal http://arxiv.org/abs/2603.26117v1 FINDER: Zero-Shot Field-Integrated Network for Distortion-free EPI Reconstruction in Diffusion MRI 2026-03-27T07:05:46Z

Echo-planar imaging (EPI) remains the cornerstone of diffusion MRI, but it is prone to severe geometric distortions due to its rapid sampling scheme that renders the sequence highly sensitive to $B_{0}$ field inhomogeneities. While deep learning has helped improve MRI reconstruction, integrating robust geometric distortion correction into a self-supervised framework remains an unmet need. To address this, we present FINDER (Field-Integrated Network for Distortion-free EPI Reconstruction), a novel zero-shot, scan-specific framework that reformulates reconstruction as a joint optimization of the underlying image and the $B_{0}$ field map. Specifically, we employ a physics-guided unrolled network that integrates dual-domain denoisers and virtual coil extensions to enforce robust data consistency. This is coupled with an Implicit Neural Representation (INR) conditioned on spatial coordinates and latent image features to model the off-resonance field as a continuous, differentiable function. Employing an alternating minimization strategy, FINDER synergistically updates the reconstruction network and the field map, effectively disentangling susceptibility-induced geometric distortions from anatomical structures. Experimental results demonstrate that FINDER achieves superior geometric fidelity and image quality compared to state-of-the-art baselines, offering a robust solution for high-quality diffusion imaging.

2026-03-27T07:05:46Z 11 pages, 4 figures Namgyu Han Seong Dae Yun Chaeeun Lim Sunghyun Seok Sunju Kim Yoonhwan Kim Yohan Jun Tae Hyung Kim Berkin Bilgic Jaejin Cho http://arxiv.org/abs/2603.03073v2 Context Adaptive Extended Chain Coding for Semantic Map Compression 2026-03-27T05:42:12Z

Semantic maps are increasingly utilized in areas such as robotics, autonomous systems, and extended reality, motivating the investigation of efficient compression methods that preserve structured semantic information. This paper studies lossless compression of semantic maps through a novel chain-coding-based framework that explicitly exploits contour topology and shared boundaries between adjacent semantic regions. We propose an extended chain code (ECC) to represent long-range contour transitions more compactly, while retaining a legacy three-orthogonal chain code (3OT) as a fallback mode for further efficiency. To efficiently encode sequences of ECC symbols, a context-adaptive entropy coding scheme based on Markov modeling is employed. Furthermore, a skip-coding mechanism is introduced to eliminate redundant representations of shared contours between adjacent semantic regions, supporting both complete and partial skips via run-length signaling. Experimental results demonstrate that the proposed method achieves an average bitrate reduction of 18\% compared with a state-of-the-art benchmark on semantic map datasets. In addition, the proposed encoder and decoder achieve up to 98\% and 50\% runtime reduction, respectively, relative to a modern generic lossless codec. Extended evaluations on occupancy maps further confirm consistent compression gains across the majority of tested scenarios. The source code is made publicly available at \url{https://github.com/InterDigitalInc/LosslessSegmentationMapCompression}.

2026-03-03T15:19:07Z 11 pages, 10 figures Runyu Yang Junqi Liao Hyomin Choi Fabien Racapé Ivan V. Bajić http://arxiv.org/abs/2603.26834v1 Hybrid Diffusion Model for Breast Ultrasound Image Augmentation 2026-03-27T05:29:41Z

We propose a hybrid diffusion-based augmentation framework to overcome the critical challenge of ultrasound data augmentation in breast ultrasound (BUS) datasets. Unlike conventional diffusion-based augmentations, our approach improves visual fidelity and preserves ultrasound texture by combining text-to-image generation with image-to-image (img2img) refinement, as well as fine-tuning with low-rank adaptation (LoRA) and textual inversion (TI). Our method generated realistic, class-consistent images on an open-source Kaggle breast ultrasound image dataset (BUSI). Compared to the Stable Diffusion v1.5 baseline, incorporating TI and img2img refinement reduced the Frechet Inception Distance (FID) from 45.97 to 33.29, demonstrating a substantial gain in fidelity while maintaining comparable downstream classification performance. Overall, the proposed framework effectively mitigates the low-fidelity limitations of synthetic ultrasound images and enhances the quality of augmentation for robust diagnostic modeling.

2026-03-27T05:29:41Z Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2026 Farhan Fuad Abir Sanjeda Sara Jennifer Niloofar Yousefi Laura J. Brattain