https://arxiv.org/api/RW3SMWbWR3Mc9HwwkbSjJpm2ydc 2026-06-28T05:39:36Z 9390 1770 15 http://arxiv.org/abs/2508.06768v1 DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging 2025-08-09T01:04:11Z

Intraoperative ultrasound imaging provides real-time guidance during numerous surgical procedures, but its interpretation is complicated by noise, artifacts, and poor alignment with high-resolution preoperative MRI/CT scans. To bridge the gap between reoperative planning and intraoperative guidance, we present DiffUS, a physics-based, differentiable ultrasound renderer that synthesizes realistic B-mode images from volumetric imaging. DiffUS first converts MRI 3D scans into acoustic impedance volumes using a machine learning approach. Next, we simulate ultrasound beam propagation using ray tracing with coupled reflection-transmission equations. DiffUS formulates wave propagation as a sparse linear system that captures multiple internal reflections. Finally, we reconstruct B-mode images via depth-resolved echo extraction across fan-shaped acquisition geometry, incorporating realistic artifacts including speckle noise and depth-dependent degradation. DiffUS is entirely implemented as differentiable tensor operations in PyTorch, enabling gradient-based optimization for downstream applications such as slice-to-volume registration and volumetric reconstruction. Evaluation on the ReMIND dataset demonstrates DiffUS's ability to generate anatomically accurate ultrasound images from brain MRI data.

2025-08-09T01:04:11Z 10 pages, accepted to MICCAI ASMUS 25 Noe Bertramo Gabriel Duguey Vivek Gopalakrishnan http://arxiv.org/abs/2508.06757v1 VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions 2025-08-09T00:13:46Z

Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98.github.io/VOccl3D-dataset/

2025-08-09T00:13:46Z ICCV 2025 Yash Garg Saketh Bachu Arindam Dutta Rohit Lal Sarosij Bose Calvin-Khang Ta M. Salman Asif Amit Roy-Chowdhury http://arxiv.org/abs/2503.06364v2 Generative Video Bi-flow 2025-08-08T17:54:32Z

We propose a novel generative video model to robustly learn temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective which combines two aspects: The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process. Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors. Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training. We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a conditional diffusion baseline but with higher speed, i.e., fewer ODE solver steps.

2025-03-09T00:03:59Z ICCV 2025. Project Page at https://ryushinn.github.io/ode-video Chen Liu Tobias Ritschel http://arxiv.org/abs/2508.06345v1 Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering 2025-08-08T14:18:24Z

Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those "one-size-fits-all" approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by $F_{ZS}$, tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from $F_{ZS}$ for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy

2025-08-08T14:18:24Z Yanbin Wei Jiangyue Yan Chun Kang Yang Chen Hua Liu James T. Kwok Yu Zhang http://arxiv.org/abs/2507.03256v3 MoDA: Multi-modal Diffusion Architecture for Talking Head Generation 2025-08-08T12:58:08Z

Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications. Project Page: https://lixinyyang.github.io/MoDA.github.io/

2025-07-04T02:25:10Z 12 pages, 7 figures Xinyang Li Gen Li Zhihui Lin Yichen Qian GongXin Yao Weinan Jia Aowen Wang Weihua Chen Fan Wang http://arxiv.org/abs/2409.13291v2 Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence 2025-08-08T07:52:24Z

Current data-driven methodologies for point cloud matching demand extensive training time and computational resources, presenting significant challenges for model deployment and application. In the point cloud matching task, recent advancements with an encoder-only Transformer architecture have revealed the emergence of semantically meaningful patterns in the attention heads, particularly resembling Gaussian functions centered on each point of the input shape. In this work, we further investigate this phenomenon by integrating these patterns as fixed attention weights within the attention heads of the Transformer architecture. We evaluate two variants: one utilizing predetermined variance values for the Gaussians, and another where the variance values are treated as learnable parameters. Additionally we analyze the performances on noisy data and explore a possible way to improve robustness to noise. Our findings demonstrate that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization. Furthermore, we conducted an ablation study to identify the specific layers where the infused information is most impactful and to understand the reliance of the network on this information.

2024-09-20T07:41:47Z Alessandro Riva Alessandro Raganato Simone Melzi 10.2312/stag.20241345 http://arxiv.org/abs/2508.06086v1 Exploring Interactive Simulation of Grass Display Color Characteristic Based on Real-World Conditions 2025-08-08T07:29:14Z

Recent research has focused on incorporating media into living environments via color-controlled materials and image display. In particular, grass-based displays have drawn attention as landscape-friendly interactive interfaces. To develop the grass display, it is important to obtain the grass color change characteristics that depend on the real environment. However, conventional methods require experiments on actual equipment every time the lighting or viewpoint changes, which is time-consuming and costly. Although research has begun on simulating grass colors, this approach still faces significant issues as it takes many hours for a single measurement. In this paper, we explore an interactive simulation of a grass display color change characteristic based on real-world conditions in a virtual environment. We evaluated our method's accuracy by simulating grass color characteristics across multiple viewpoints and environments, and then compared the results against prior work. The results indicated that our method tended to simulate the grass color characteristics similar to the actual characteristics and showed the potential to do so more quickly and with comparable accuracy to the previous study.

2025-08-08T07:29:14Z Accepted to 20th IFIP TC13 International Conference on Human-Computer Interaction (INTERACT '25), 24 pages Kojiro Tanaka Keiichi Sato Masahiko Mikawa Makoto Fujisawa http://arxiv.org/abs/2508.05626v1 Physically Controllable Relighting of Photographs 2025-08-07T17:58:42Z

We present a self-supervised approach to in-the-wild image relighting that enables fully controllable, physically based illumination editing. We achieve this by combining the physical accuracy of traditional rendering with the photorealistic appearance made possible by neural rendering. Our pipeline works by inferring a colored mesh representation of a given scene using monocular estimates of geometry and intrinsic components. This representation allows users to define their desired illumination configuration in 3D. The scene under the new lighting can then be rendered using a path-tracing engine. We send this approximate rendering of the scene through a feed-forward neural renderer to predict the final photorealistic relighting result. We develop a differentiable rendering process to reconstruct in-the-wild scene illumination, enabling self-supervised training of our neural renderer on raw image collections. Our method represents a significant step in bringing the explicit physical control over lights available in typical 3D computer graphics tools, such as Blender, to in-the-wild relighting.

2025-08-07T17:58:42Z Proc. SIGGRAPH 2025, 10 pages, 9 figures SIGGRAPH Conference Papers, Year 2025, Article No. 105, Pages 1 - 10 Chris Careaga Yağız Aksoy 10.1145/3721238.3730666 http://arxiv.org/abs/2508.05531v1 Point cloud segmentation for 3D Clothed Human Layering 2025-08-07T16:02:15Z

3D Cloth modeling and simulation is essential for avatars creation in several fields, such as fashion, entertainment, and animation. Achieving high-quality results is challenging due to the large variability of clothed body especially in the generation of realistic wrinkles. 3D scan acquisitions provide more accuracy in the representation of real-world objects but lack semantic information that can be inferred with a reliable semantic reconstruction pipeline. To this aim, shape segmentation plays a crucial role in identifying the semantic shape parts. However, current 3D shape segmentation methods are designed for scene understanding and interpretation and only few work is devoted to modeling. In the context of clothed body modeling the segmentation is a preliminary step for fully semantic shape parts reconstruction namely the underlying body and the involved garments. These parts represent several layers with strong overlap in contrast with standard segmentation methods that provide disjoint sets. In this work we propose a new 3D point cloud segmentation paradigm where each 3D point can be simultaneously associated to different layers. In this fashion we can estimate the underlying body parts and the unseen clothed regions, i.e., the part of a cloth occluded by the clothed-layer above. We name this segmentation paradigm clothed human layering. We create a new synthetic dataset that simulates very realistic 3D scans with the ground truth of the involved clothing layers. We propose and evaluate different neural network settings to deal with 3D clothing layering. We considered both coarse and fine grained per-layer garment identification. Our experiments demonstrates the benefit in introducing proper strategies for the segmentation on the garment domain on both the synthetic and real-world scan datasets.

2025-08-07T16:02:15Z Davide Garavaso Federico Masi Pietro Musoni Umberto Castellani http://arxiv.org/abs/2408.07836v2 Learned Single-Pass Multitasking Perceptual Graphics for Immersive Displays 2025-08-07T15:59:19Z

Emerging immersive display technologies efficiently utilize resources with perceptual graphics methods such as foveated rendering and denoising. Running multiple perceptual graphics methods challenges devices with limited power and computational resources. We propose a computationally-lightweight learned multitasking perceptual graphics model. Given RGB images and text-prompts, our model performs text-described perceptual tasks in a single inference step. Simply daisy-chaining multiple models or training dedicated models can lead to model management issues and exhaust computational resources. In contrast, our flexible method unlocks consistent high quality perceptual effects with reasonable compute, supporting various permutations at varied intensities using adjectives in text prompts (e.g. mildly, lightly). Text-guidance provides ease of use for dynamic requirements such as creative processes. To train our model, we propose a dataset containing source and perceptually enhanced images with corresponding text prompts. We evaluate our model on desktop and embedded platforms and validate perceptual quality through a user study.

2024-07-31T19:05:00Z Doğa Yılmaz He Wang Towaki Takikawa Duygu Ceylan Kaan Akşit 10.1145/3746027.3754801 http://arxiv.org/abs/2508.04326v2 Radiance Fields in XR: A Survey on How Radiance Fields are Envisioned and Addressed for XR Research 2025-08-07T15:24:10Z

The development of radiance fields (RF), such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF), has revolutionized interactive photorealistic view synthesis and presents enormous opportunities for XR research and applications. However, despite the exponential growth of RF research, RF-related contributions to the XR community remain sparse. To better understand this research gap, we performed a systematic survey of current RF literature to analyze (i) how RF is envisioned for XR applications, (ii) how they have already been implemented, and (iii) the remaining research gaps. We collected 365 RF contributions related to XR from computer vision, computer graphics, robotics, multimedia, human-computer interaction, and XR communities, seeking to answer the above research questions. Among the 365 papers, we performed an analysis of 66 papers that already addressed a detailed aspect of RF research for XR. With this survey, we extended and positioned XR-specific RF research topics in the broader RF research field and provide a helpful resource for the XR community to navigate within the rapid development of RF research.

2025-08-06T11:14:06Z This work is a pre-print version of a paper that has been accepted to the IEEE TVCG journal for future publication Ke Li Mana Masuda Susanne Schmidt Shohei Mori http://arxiv.org/abs/2508.05187v1 Refining Gaussian Splatting: A Volumetric Densification Approach 2025-08-07T09:23:17Z

Achieving high-quality novel view synthesis in 3D Gaussian Splatting (3DGS) often depends on effective point primitive management. The underlying Adaptive Density Control (ADC) process addresses this issue by automating densification and pruning. Yet, the vanilla 3DGS densification strategy shows key shortcomings. To address this issue, in this paper we introduce a novel density control method, which exploits the volumes of inertia associated to each Gaussian function to guide the refinement process. Furthermore, we study the effect of both traditional Structure from Motion (SfM) and Deep Image Matching (DIM) methods for point cloud initialization. Extensive experimental evaluations on the Mip-NeRF 360 dataset demonstrate that our approach surpasses 3DGS in reconstruction quality, delivering encouraging performance across diverse scenes.

2025-08-07T09:23:17Z Mohamed Abdul Gafoor Marius Preda Titus Zaharia 10.24132/CSRN.2025-6 http://arxiv.org/abs/2507.21411v3 InSituTale: Enhancing Augmented Data Storytelling with Physical Objects 2025-08-07T07:22:16Z

Augmented data storytelling enhances narrative delivery by integrating visualizations with physical environments and presenter actions. Existing systems predominantly rely on body gestures or speech to control visualizations, leaving interactions with physical objects largely underexplored. We introduce augmented physical data storytelling, an approach enabling presenters to manipulate visualizations through physical object interactions. To inform this approach, we first conducted a survey of data-driven presentations to identify common visualization commands. We then conducted workshops with nine HCI/VIS researchers to collect mappings between physical manipulations and these commands. Guided by these insights, we developed InSituTale, a prototype that combines object tracking via a depth camera with Vision-LLM for detecting real-world events. Through physical manipulations, presenters can dynamically execute various visualization commands, delivering cohesive data storytelling experiences that blend physical and digital elements. A user study with 12 participants demonstrated that InSituTale enables intuitive interactions, offers high utility, and facilitates an engaging presentation experience.

2025-07-29T00:35:11Z Kentaro Takahira Yue Yu Takanori Fujiwara Ryo Suzuki Huamin Qu 10.1145/3746059.3747678 http://arxiv.org/abs/2502.20045v2 Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting 2025-08-07T05:06:41Z

Professional 3D asset creation often requires diverse sculpting brushes to add surface details and geometric structures. Despite recent progress in 3D generation, producing reusable sculpting brushes compatible with artists' workflows remains an open and challenging problem. These sculpting brushes are typically represented as vector displacement maps (VDMs), which existing models cannot easily generate compared to natural images. This paper presents Text2VDM, a novel framework for text-to-VDM brush generation through the deformation of a dense planar mesh guided by score distillation sampling (SDS). The original SDS loss is designed for generating full objects and struggles with generating desirable sub-object structures from scratch in brush generation. We refer to this issue as semantic coupling, which we address by introducing weighted blending of prompt tokens to SDS, resulting in a more accurate target distribution and semantic guidance. Experiments demonstrate that Text2VDM can generate diverse, high-quality VDM brushes for sculpting surface details and geometric structures. Our generated brushes can be seamlessly integrated into mainstream modeling software, enabling various applications such as mesh stylization and real-time interactive modeling.

2025-02-27T12:36:51Z 16 pages, 21 figures Hengyu Meng Duotun Wang Zhijing Shao Ligang Liu Zeyu Wang http://arxiv.org/abs/2507.12714v2 NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement 2025-08-07T04:24:35Z

We develop a neural parametric model for 3D leaves for plant modeling and reconstruction that are essential for agriculture and computer graphics. While neural parametric models are actively studied for humans and animals, plant leaves present unique challenges due to their diverse shapes and flexible deformation. To this problem, we introduce a neural parametric model for leaves, NeuraLeaf. Capitalizing on the fact that flattened leaf shapes can be approximated as a 2D plane, NeuraLeaf disentangles the leaves' geometry into their 2D base shapes and 3D deformations. This representation allows learning from rich sources of 2D leaf image datasets for the base shapes, and also has the advantage of simultaneously learning textures aligned with the geometry. To model the 3D deformation, we propose a novel skeleton-free skinning model and create a newly captured 3D leaf dataset called DeformLeaf. We show that NeuraLeaf successfully generates a wide range of leaf shapes with deformation, resulting in accurate model fitting to 3D observations like depth maps and point clouds. Our implementation and dataset are available at https://neuraleaf-yang.github.io/.

2025-07-17T01:46:24Z IEEE/CVF International Conference on Computer Vision (ICCV 2025), Highlight, Project: https://neuraleaf-yang.github.io/ Yang Yang Dongni Mao Hiroaki Santo Yasuyuki Matsushita Fumio Okura