https://arxiv.org/api/wj0D3KlFo6ytTK7Hj+xtHhXOx54 2026-07-18T22:05:34Z 31213 15 15 http://arxiv.org/abs/2607.13812v1 TCAM-Diff: Triplane-Aware Cross-Attention Medical Diffusion Model 2026-07-15T13:23:01Z

We introduce TCAM-Diff, a novel 3D medical image generation model that reduces the memory requirements to encode and generate high-resolution 3D data. This model utilizes a decoder-only autoencoder method to learn triplane representation from dense volume and leverages generalization operations to prevent overfitting. Subsequently, it uses a triplane-aware cross-attention diffusion model to learn and integrate these features effectively. Furthermore, the features generated by the diffusion model can be rapidly transformed into 3D volumes using a pre-trained decoder module. Our experiments on three different scales of medical datasets, BrainTumour 128 x 128 x 128, Pancreas 256 x 256 x 256, and Colon 512 x 512 x 512, demonstrate outstanding results. We utilized MSE and SSIM to assess reconstruction quality and leveraged the Wasserstein Generative Adversarial Network (W-GAN) critic to assess generative quality. Comparisons with existing approaches show that our method gives better reconstruction and generation results than other encoder-decoder methods with similar-sized latent spaces.

2026-07-15T13:23:01Z Accepted at AAAI 2025. Code is available at https://github.com/Fredy-Zhang/TCAM-Diff Proceedings of the AAAI Conference on Artificial Intelligence, 39(21): 22732-22740, 2025 Zhenkai Zhang Krista A. Ehinger Tom Drummond 10.1609/aaai.v39i21.34433 http://arxiv.org/abs/2606.20763v2 From Sparse X-rays to 3D CT: Training-Free Reconstruction with Diffusion Priors 2026-07-15T12:42:38Z

Solving 3D medical inverse problems typically requires training dedicated supervised models for each specific task and measurement setting. To break this dependency, we present TF-PRDiT: a training-free conditional sampling framework that converts a frozen voxel-level 3D Diffusion Transformer prior into a versatile inverse medical problem solver. Building on the posterior-sampling view of diffusion inverse solvers, TF-PRDiT enforces measurement consistency during sampling via a task-specific forward operator rather than updating model weights, enabling a single pretrained prior to be reused across diverse conditional settings. Our method combines a predictor-corrector sampler with likelihood-based guidance on the denoised prediction, providing stable data-fidelity correction while preserving the underlying 3D anatomical prior. We highlight our framework's capability on the challenging task of X-ray-to-CT reconstruction by integrating a differentiable DRR projector to allow gradients to propagate directly from projection space back to voxels without any retraining. Experiments on LIDC-IDRI demonstrate that TF-PRDiT achieves strong reconstruction quality and uniquely scales to an arbitrary number of input X-rays (1-12) under a unified model, with performance improving consistently as additional views are provided. Beyond X-ray-to-CT, we show that simply swapping the forward operator extends the same frozen model to 3D super-resolution, volumetric infilling, and deblurring without any task-specific retraining, demonstrating that a single 3D diffusion prior can serve as a universal solver for volumetric medical inverse problems.

2026-06-18T11:51:06Z Accepted at DGM4MICCAI 2026. Code available at https://github.com/Fredy-Zhang/TF-PRDiT Zhenkai Zhang Markus Hiller Krista A. Ehinger Tom Drummond http://arxiv.org/abs/2607.13723v1 N-O Cool-chic: reconcile fast encoding with lightweight decoding for neural image compression 2026-07-15T11:40:24Z

Overfitted image codecs achieve strong compression performance and low decoder complexity by learning a lightweight decoder for each image. Such codecs include Cool-chic, which presents image coding performance on par with VVC while requiring around 2000 multiplications per decoded pixel. However, the encoding time associated with overfitted codecs may be prohibitively long for real-time applications, posing a challenge to their practical implementation in such scenarios. To address this issue, this paper proposes to decrease the encoding complexity of Cool-chic by bypassing the overfitting procedure and complementing the decoder with an encoder network. The proposed non-overfitted (N-O) Cool-chic, significantly reduces encoding complexity by a factor of 1000 compared to Cool-chic, while maintaining competitive performance.

2026-07-15T11:40:24Z Presented at CORESA 2024 Théophile Blard Théo Ladune Pierrick Philippe Xiaoran Jiang Olivier Déforges http://arxiv.org/abs/2607.12501v2 What Does Goodness Measure? A Likelihood-Ratio Account of Forward-Forward Learning 2026-07-15T09:07:05Z

The Forward-Forward (FF) algorithm trains each layer locally, so that a scalar goodness - the sum of squared activations - is high on real inputs and low on contrastive ones, with activations normalized between layers. Both choices are usually treated as heuristics. Under an explicit generative model they are not: the squared goodness is the sufficient statistic of a likelihood-ratio test between two zero-mean populations differing in scale, and the FF threshold is its boundary. It generalizes: anisotropic populations yield a Mahalanobis goodness, the plain square being its isotropic case; heavy-tailed populations yield a saturating statistic whose slope is a posterior precision - divisive normalization - with bounded evidence and an advantage only under aggregation. The same lens characterizes the inter-layer normalization: it must remove the length while preserving per-coordinate energy, explaining a depth collapse we observe under unit-norm normalization; and the pairwise objective admits a scale-inflation shortcut that a whitened goodness removes.

2026-07-14T08:33:33Z Paolo Giannitrapani http://arxiv.org/abs/2607.13601v1 Video to All-in-focus Image Reconstruction Algorithm for Automated Microscopic Urinalysis 2026-07-15T08:48:19Z

Microscopic urinalysis is a routine diagnostic test at hospitals. Recent studies have demonstrated the effectiveness of deep learning methods to automate microscopic urinalysis. These methods rely on high-quality images of the urine samples in which each cell is clearly identifiable. However, in practice, the urine sample on a glass slide has a multi-layer structure; hence, all the cells are not clearly visible within the depth of field of a lens focused at a particular focal plane. It demands acquiring multiple images at different focal planes to correctly identify each cell in a given urine sample, which is a time-consuming task. In this paper, we propose to simplify the task by recording a video, in place of acquiring multiple images, while gradually changing the focus of the lens manually by hand. A typical length of the video is from 2 to 14 seconds. We reconstruct an all-in-focus image from the recorded video frames and apply a deep learning model to detect and classify urine sediments. As a proof of concept, we conduct experiments on 14 videos acquired by a trained lab technician in a usual diagnostic lab environment and show the effectiveness of the proposed automated urinalysis pipeline with our novel reconstruction algorithm.

2026-07-15T08:48:19Z Chinmay Nema Hari Om Aggrawal Dipam Goswami Rajiv Gupta Vinti Agarwal http://arxiv.org/abs/2603.08521v2 OccTrack360: 4D Panoptic Occupancy Tracking from Surround-View Fisheye Cameras 2026-07-15T08:44:09Z

Understanding dynamic 3D environments in a spatially continuous and temporally consistent manner is fundamental for robotics and autonomous driving. While recent advances in occupancy prediction provide a unified representation of scene geometry and semantics, progress in 4D panoptic occupancy tracking remains limited by the lack of benchmarks that support surround-view fisheye sensing, long temporal sequences, and instance-level voxel tracking. To address this gap, we present OccTrack360, a new benchmark for 4D panoptic occupancy tracking from surround-view fisheye cameras. OccTrack360 provides substantially longer and more diverse sequences (174~2234 frames) than prior benchmarks, together with principled voxel visibility annotations, including an all-direction occlusion mask and an MEI-based fisheye field-of-view mask. To establish a strong fisheye-oriented baseline, we further propose Focus on Sphere Occ (FoSOcc), a framework that addresses two core challenges in fisheye occupancy tracking: distorted spherical projection and inaccurate voxel-space localization. FoSOcc includes a Center Focusing Module (CFM) to enhance instance-aware spatial localization through supervised focus guidance, and a Fisheye-based Enhanced Lifting (FEL) that extends perspective lifting to fisheye imaging under the Unified Projection Model. Extensive experiments on Occ3D-Waymo and OccTrack360 show that our method improves occupancy tracking quality with notable gains on geometrically regular categories, and establishes a strong baseline for future research on surround-view fisheye 4D occupancy tracking. The benchmark and source code will be made publicly available at https://github.com/YouthZest-Lin/OccTrack360.

2026-03-09T15:55:26Z Accepted to IEEE/RSJ IROS 2026. The benchmark and source code will be made publicly available at https://github.com/YouthZest-Lin/OccTrack360 Yongzhi Lin Kai Luo Yuanfan Zheng Hao Shi Mengfei Duan Yang Liu Kailun Yang http://arxiv.org/abs/2607.13471v1 Bring Music The Horizon: Music-Driven 360$^\circ$ Video Generation 2026-07-15T06:00:47Z

Music visualization offers a powerful way to enhance listeners' understanding and experience of music by translating auditory signals into visual forms. However, most existing approaches either rely heavily on lyrics or generate flat, non-immersive videos similar to conventional music videos, which limits their ability to convey the emotional dynamics of music and provide an immersive listening experience. We propose Bring Music The Horizon, an emotion-aware pipeline for music-driven 360$^\circ$ video generation. Given an input song, our work first estimates its emotional trajectory by predicting valence-arousal values at the level of every four bars. These values are then converted into emotion-aware visual guidance using EmotiCrafter, and these guidance vectors can be manipulated by the SEGA framework, which provides fine-grained semantic control for keyframe generation. Finally, image-to-video models are applied to the generated keyframes to synthesize temporally continuous 360$^\circ$ videos for immersive music visualization. Our pipeline generates 360$^\circ$ music visualization videos that reflect the emotional progression and temporal structure of the input song. We demonstrate its capability using songs from different genres and provide qualitative comparisons with From-Sound-To-Sight, a representative audio-to-visual generation baseline, on our project page at https://etoile-et-toi-mp3.github.io/BMTH_Project_Page/.

2026-07-15T06:00:47Z 5 pages, 1 figure Kai Hsu Tsai Yong Wei Fu Hung I Yang Yu-Chih Chen http://arxiv.org/abs/2505.09433v3 Efficient LiDAR Reflectance Compression via Scanning Serialization 2026-07-15T05:13:49Z

Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serialization, offering a device-centric perspective for reflectance analysis. Each point is then tokenized into a contextual representation comprising its sensor scanning index, radial distance, and prior reflectance, for effective dependencies exploration. For efficient sequential modeling, Mamba is incorporated with a dual parallelization scheme, enabling simultaneous autoregressive dependency capture and fast processing. Extensive experiments demonstrate that SerLiC attains over 2x volume reduction against the original reflectance data, outperforming the state-of-the-art method by up to 22% reduction of compressed bits while using only 2% of its parameters. Moreover, a lightweight version of SerLiC achieves > 10 fps (frames per second) with just 111K parameters, which is attractive for real-world applications.

2025-05-14T14:38:40Z Jiahao Zhu Kang You Dandan Ding Zhan Ma http://arxiv.org/abs/2607.12054v2 Analyzing Image Encoder Choices and Graph Homophily in GCN Frameworks for Breast Ultrasound Classification 2026-07-15T02:46:37Z

Breast ultrasound is widely used for screening, yet automated analysis remains challenging due to speckle noise, acquisition variability, and weak separation of benign and malignant cases in standard ultrasound imaging. Graph convolutional networks (GCNs) have recently emerged as a promising approach by leveraging relationships among similar patient samples. However, it remains unclear how the choice of image encoder influences graph construction and downstream classification performance. In this work, we systematically evaluate five image encoders spanning convolutional and transformer-based architectures for GCN-based breast ultrasound classification. Image embeddings are used to construct cosine similarity k-nearest-neighbor graphs, which are classified using a single-layer GCN with a linear classification head. Across three patientwise cross-validation folds, higher-capacity encoders consistently improve graph homophily and downstream classification performance, yielding gains in accuracy, AUC, sensitivity, specificity, and F1-score. Moreover, test-set graph homophily exhibits a strong linear correlation with classification accuracy, with higher-capacity encoders consistently occupying the high-homophily, high-accuracy region suggesting that encoder-driven improvements in graph structure are a key mechanism underlying the observed performance gains. These findings establish encoder selection as a critical factor in graph-based breast ultrasound classification and identify graph homophily as a key indicator linking representation quality to downstream classification performance.

2026-07-13T18:14:16Z Submitted to the MICCAI 2026 ASMUS Workshop (under review) Sabahattin Mert Daloglu Ceren Coskun Harvey Castro Soner Hacihaliloglu Ilker Hacihaliloglu http://arxiv.org/abs/2509.08685v2 Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding 2026-07-14T23:52:31Z

Given encoded 3D point cloud geometry available at the decoder, we study the problem of lossy attribute compression in a multi-resolution B-spline projection framework. A target continuous 3D attribute function is first projected onto a sequence of nested subspaces $\mathcal{F}^{(p)}_{l_0} \subseteq \cdots \subseteq \mathcal{F}^{(p)}_{L}$, where $\mathcal{F}^{(p)}_{l}$ is a family of functions spanned by a B-spline basis function of order $p$ at a chosen scale and its integer shifts. The projected low-pass coefficients $F_l^*$ are computed by variable-complexity unrolling of a rate-distortion (RD) optimization algorithm into a feed-forward network, where the rate term is the sparsity-promoting $\ell_1$-norm. Thus, the projection operation is end-to-end differentiable. For a chosen coarse-to-fine predictor, the coefficients are then adjusted to account for the prediction from a lower-resolution to a higher-resolution, which is also optimized in a data-driven manner.

2025-09-10T15:23:21Z Tam Thuc Do Philip A. Chou Gene Cheung http://arxiv.org/abs/2607.06597v2 Reconfigurable Radiology Labels Without Relabeling 2026-07-14T21:21:35Z

Public chest-radiograph (CXR) datasets are typically released with small, fixed label schemas such as CheXpert-14. However, the underlying free-text reports describe far more findings -- and which findings matter depends on the task, site, and reader. We release a pipeline that converts free-text reports into multi-label matrices and then reconfigures the label schema through dictionary edits rather than new inference passes, i.e., without relabeling the corpus. After this one-time pass, reconfiguring MIMIC-CXR (223K reports) from cached annotations takes 196 seconds with no API cost, compared to \$6.6K for an equivalent relabeling pass with Claude Opus 4.7. Using a 58-label taxonomy, we show that 43\% of CXR studies contain at least one finding outside CheXpert-14. Image probes trained on these labels match CheXpert-14 probes on shared targets while also reaching 0.78 AUROC on expert-reviewed long-tail labels that CheXpert-14 cannot represent. These results suggest a different unit of work for radiology labeling: once reports are structured, the label schema becomes a configuration to edit, not a corpus to relabel.

2026-07-06T23:19:20Z Jean-Benoit Delbrouck Dave Van Veen Akash Pattnaik Kalina Slavkova Javid Abderezaei Harris Bergman Khan Siddiqui http://arxiv.org/abs/2607.13204v1 Efficient Computing for Medical Image Acquisition and Reconstruction 2026-07-14T18:54:08Z

Medical imaging systems such as CT, MRI, PET, and SPECT do not directly acquire images. Instead, they measure physical signals that encode anatomical or physiological information, and image reconstruction recovers the underlying image by solving an inverse problem. Although these imaging modalities are governed by different imaging physics, they share a common computational framework that naturally connects medical physics, linear algebra, probability, numerical optimization, and efficient computing. As medical imaging systems acquire increasingly large and higher-dimensional datasets, image reconstruction has become one of the primary computational bottlenecks in modern medical imaging. Advanced reconstruction methods, including analytical reconstruction, iterative optimization, and statistical model-based reconstruction, substantially improve image quality while reducing radiation dose or scan time, but at significantly increased computational cost. Efficient computing has therefore become essential for achieving clinically practical reconstruction times. This chapter presents a unified computational perspective on medical image acquisition and reconstruction across CT, MRI, PET, and SPECT. It first reviews the imaging physics and data acquisition process for each modality and derives a generalized mathematical framework for image reconstruction. Building on this framework, the chapter discusses analytical, iterative, and statistical reconstruction methods together with their computational characteristics. Finally, it examines efficient computing considerations, including optimization algorithms, physics-aware forward operators, memory-efficient implementations, and parallel computing strategies. Together, these topics demonstrate how the integration of imaging physics, mathematical modeling, and efficient computing enables accurate and scalable medical image reconstruction.

2026-07-14T18:54:08Z book chapter for textbook "Medical Image Vision Handbook" Xiao Wang Jayasai Rajagopal Md Safaiat Hossain Peng Chen Mohamed Wahib Enzhi Zhang Emma J. Reid http://arxiv.org/abs/2607.12937v1 Exact and Calibrated Diffusion Reconstruction for Digital Breast Tomosynthesis 2026-07-14T16:10:10Z

Limited-angle digital breast tomosynthesis (DBT) reconstructs a volume from a few low-dose projections over a narrow arc. At a representative nine-view, $25^{\circ}$ protocol more than 98% of image space is unmeasured, so a learned prior must supply structure in the missing wedge. Conditional diffusion priors achieve strong perceptual quality here but leave three clinical obstacles: inexact data consistency, unlocalized hallucination, and uncalibrated uncertainty. We enforce measurements exactly by replacing the per-step proximal update of a conditional diffusion sampler with exact Euclidean projection onto the data-consistent set, computed via an $m$-dimensional dual system with a one-time Gram matrix $AA^{\top}$ factorization. This projection costs 4.5 ms per step (a $248\times$ speedup) and drives the data residual to the double-precision floor ($2.4\times10^{-13}$). We prove it is the $ρ\to0$ limit of the proximal step, provide a no-harm theorem, and show that exactly consistent sample ensembles have variance supported on null($A$). Thus, the mean's entire error lies in the unmeasured subspace covered by the uncertainty map. On patient-derived breast phantoms, this improves fidelity at no depth-resolution cost. Conversely, a proximal step applied post-update degrades quality, isolating the consistency step's placement as decisive. Isotonic recalibration brings the ensemble spread to a calibrated error scale (expected calibration error $0.029\to0.008$; standardized error $4.7\to0.96$), ranking errors better than the pure prior. We also repair a 20.3% adjoint mismatch in a deployed projector via a materialized operator of record. This is the first data-consistent, uncertainty-calibrated learned reconstruction for limited-angle DBT. The solver naturally relaxes to discrepancy-ball and maximum-a-posteriori modes for noisy measurements.

2026-07-14T16:10:10Z Imade Bouftini http://arxiv.org/abs/2603.08503v2 Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction 2026-07-14T11:58:17Z

Omnidirectional images are increasingly used in robotics and vision due to their wide field of view. However, extending 3D Gaussian Splatting (3DGS) to panoramic camera models remains challenging, as existing formulations are designed for perspective projections and naive adaptations often introduce distortion and geometric inconsistencies. We present Spherical-GOF, an omnidirectional Gaussian rendering framework built upon Gaussian Opacity Fields (GOF). Unlike projection-based rasterization, Spherical-GOF performs GOF ray sampling directly on the unit sphere in spherical ray space, enabling consistent ray-Gaussian interactions for panoramic rendering. To make the spherical ray casting efficient and robust, we derive a conservative spherical bounding rule for fast ray-Gaussian culling and introduce a spherical filtering scheme that adapts Gaussian footprints to distortion-varying panoramic pixel sampling. Extensive experiments on standard panoramic benchmarks (OmniBlender and OmniPhotos) demonstrate competitive photometric quality and substantially improved geometric consistency. Compared with the strongest baseline, Spherical-GOF reduces depth reprojection error by 57% and improves cycle inlier ratio by 21%. Qualitative results show cleaner depth and more coherent normal maps, with strong robustness to global panorama rotations. We further validate generalization on OmniRob, a real-world robotic omnidirectional dataset introduced in this work, featuring UAV and quadruped platforms. The source code and the OmniRob dataset will be released at https://github.com/1170632760/Spherical-GOF.

2026-03-09T15:35:56Z Accepted to IEEE/RSJ IROS 2026. The source code and dataset will be released at https://github.com/1170632760/Spherical-GOF Zhe Yang Guoqiang Zhao Sheng Wu Kai Luo Kailun Yang http://arxiv.org/abs/2607.12641v1 GeoFovea-GS: Geometry-Aware Cross-Layer Gaussian Splatting for Wireless Aerial VR 2026-07-14T11:19:16Z

Wireless aerial virtual reality (VR) aims to provide immersive access to large-scale scenes, but high-resolution view generation and delivery are jointly constrained by limited bandwidth, latency, and power. 3D Gaussian Splatting (3DGS) can reduce the payload by rendering views from compact pose information, yet its geometry errors may cause severe VR quality degradation. Existing channel-aware or pixel-level resource allocation schemes fail to capture such geometry-sensitive distortion. To address this issue, this paper proposes GeoFovea-GS as a geometry-aware cross-layer framework for communication-efficient wireless aerial VR. A foveated geometry-aware distortion metric is developed to characterize photometric rendering error, geometric inconsistency, and view-dependent perceptual importance in a unified form. Based on this metric, the joint selection of pose-only 3DGS rendering and image/tile correction transmission is formulated as a cross-layer optimization problem under wireless constraints. A lightweight value-of-information scheduler is further developed to allocate communication resources to regions that are both geometry-critical and perceptually important. Experiments on real-world 3DGS scenes demonstrate that GeoFovea-GS achieves superior immersive rendering quality with substantially reduced transmission cost.

2026-07-14T11:19:16Z 7 pages, 5 figures Zeyi Ren Wencheng Yan Jiawen Zhang Jintao Yan Sheng Zhou Zhisheng Niu