https://arxiv.org/api/IPfmRfEz9YZK5+4ZjnFlDLlpUoY2026-07-01T11:17:34Z9421213015http://arxiv.org/abs/2506.04858v1Beyond the Desktop: XR-Driven Segmentation with Meta Quest 3 and MX Ink2025-06-05T10:25:46ZMedical imaging segmentation is essential in clinical settings for diagnosing diseases, planning surgeries, and other procedures. However, manual annotation is a cumbersome and effortful task. To mitigate these aspects, this study implements and evaluates the usability and clinical applicability of an extended reality (XR)-based segmentation tool for anatomical CT scans, using the Meta Quest 3 headset and Logitech MX Ink stylus. We develop an immersive interface enabling real-time interaction with 2D and 3D medical imaging data in a customizable workspace designed to mitigate workflow fragmentation and cognitive demands inherent to conventional manual segmentation tools. The platform combines stylus-driven annotation, mirroring traditional pen-on-paper workflows, with instant 3D volumetric rendering. A user study with a public craniofacial CT dataset demonstrated the tool's foundational viability, achieving a System Usability Scale (SUS) score of 66, within the expected range for medical applications. Participants highlighted the system's intuitive controls (scoring 4.1/5 for self-descriptiveness on ISONORM metrics) and spatial interaction design, with qualitative feedback highlighting strengths in hybrid 2D/3D navigation and realistic stylus ergonomics. While users identified opportunities to enhance task-specific precision and error management, the platform's core workflow enabled dynamic slice adjustment, reducing cognitive load compared to desktop tools. Results position the XR-stylus paradigm as a promising foundation for immersive segmentation tools, with iterative refinements targeting haptic feedback calibration and workflow personalization to advance adoption in preoperative planning.2025-06-05T10:25:46Z10 pagesLisle Faray de PaivaGijs LuijtenAna Sofia Ferreira SantosMoon KimBehrus PuladiJens KleesiekJan Eggerhttp://arxiv.org/abs/2506.04841v1Midplane based 3D single pass unbiased segment-to-segment contact interaction using penalty method2025-06-05T10:05:25ZThis work introduces a contact interaction methodology for an unbiased treatment of contacting surfaces without assigning surfaces as master and slave. The contact tractions between interacting discrete segments are evaluated with respect to a midplane in a single pass, inherently maintaining the equilibrium of tractions. These tractions are based on the penalisation of true interpenetration between opposite surfaces, and the procedure of their integral for discrete contacting segments is described in this paper. A meticulous examination of the different possible geometric configurations of interacting 3D segments is presented to develop visual understanding and better traction evaluation accuracy. The accuracy and robustness of the proposed method are validated against the analytical solutions of the contact patch test, two-beam bending, Hertzian contact, and flat punch test, thus proving the capability to reproduce contact between flat surfaces, curved surfaces, and sharp corners in contact, respectively. The method passes the contact patch test with the uniform transmission of contact pressure matching the accuracy levels of finite elements. It converges towards the analytical solution with mesh refinement and a suitably high penalty factor. The effectiveness of the proposed algorithm also extends to self-contact problems and has been tested for self-contact between flat and curved surfaces with inelastic material. Dynamic problems of elastic and inelastic collisions between bars, as well as oblique collisions of cylinders, are also presented. The ability of the algorithm to resolve contacts between flat and curved surfaces for nonconformal meshes with high accuracy demonstrates its versatility in general contact problems.2025-06-05T10:05:25ZIndrajeet SahuNik Petrinichttp://arxiv.org/abs/2506.04623v1VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection2025-06-05T04:31:55Z3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also achieves 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.2025-06-05T04:31:55ZProject Page: https://vita-epfl.github.io/VoxDet/Wuyang LiZhu YuAlexandre Alahihttp://arxiv.org/abs/2310.17451v4Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings2025-06-05T03:24:20ZMaking neural visual generative models controllable by logical reasoning systems is promising for improving faithfulness, transparency, and generalizability. We propose the Abductive visual Generation (AbdGen) approach to build such logic-integrated models. A vector-quantized symbol grounding mechanism and the corresponding disentanglement training method are introduced to enhance the controllability of logical symbols over generation. Furthermore, we propose two logical abduction methods to make our approach require few labeled training data and support the induction of latent logical generative rules from data. We experimentally show that our approach can be utilized to integrate various neural generative models with logical reasoning systems, by both learning from scratch or utilizing pre-trained models directly. The code is released at https://github.com/future-item/AbdGen.2023-10-26T15:00:21ZKDD 2025 research track paperYifei PengZijie ZhaYu JinZhexu LuoWang-Zhou DaiZhong RenYao-Xiang DingKun Zhouhttp://arxiv.org/abs/2506.04444v1Photoreal Scene Reconstruction from an Egocentric Device2025-06-04T20:53:43ZIn this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device's visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at http://www.projectaria.com/photoreal-reconstruction/2025-06-04T20:53:43ZPaper accepted to SIGGRAPH Conference Paper 2025Zhaoyang LvMaurizio MongeKa ChenYufeng ZhuMichael GoeseleJakob EngelZhao DongRichard Newcombehttp://arxiv.org/abs/2506.00839v2Neural Path Guiding with Distribution Factorization2025-06-04T18:10:39ZIn this paper, we present a neural path guiding method to aid with Monte Carlo (MC) integration in rendering. Existing neural methods utilize distribution representations that are either fast or expressive, but not both. We propose a simple, but effective, representation that is sufficiently expressive and reasonably fast. Specifically, we break down the 2D distribution over the directional domain into two 1D probability distribution functions (PDF). We propose to model each 1D PDF using a neural network that estimates the distribution at a set of discrete coordinates. The PDF at an arbitrary location can then be evaluated and sampled through interpolation. To train the network, we maximize the similarity of the learned and target distributions. To reduce the variance of the gradient during optimizations and estimate the normalization factor, we propose to cache the incoming radiance using an additional network. Through extensive experiments, we demonstrate that our approach is better than the existing methods, particularly in challenging scenes with complex light transport.2025-06-01T05:04:56Z11 pages, 11 figures. Accepted to EGSR 2025Pedro FigueiredoQihao HeNima Khademi Kalantarihttp://arxiv.org/abs/2506.04283v1SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization2025-06-04T07:22:48ZWe propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency -- our framework builds on continuous-time diffusion models and introduces SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion). SSIMBaD applies a sigma-space transformation that aligns perceptual degradation, as measured by structural similarity (SSIM), in a linear manner. This scaling ensures uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. Experiments on a large-scale Anime face dataset demonstrate that our method outperforms state-of-the-art models in both pixel accuracy and perceptual quality, while generalizing to diverse styles. Code is available at github.com/Giventicket/SSIMBaD-Sigma-Scaling-with-SSIM-Guided-Balanced-Diffusion-for-AnimeFace-Colorization2025-06-04T07:22:48Z10 pages, rest of the pages are appendixJunpyo SeoDepartment of Computer Science, Seoul National UniversityHanbin KooDepartment of Computer Science, Seoul National UniversityJieun YookDepartment of Computer Science, Seoul National UniversityByung-Ro MoonDepartment of Computer Science, Seoul National Universityhttp://arxiv.org/abs/2506.03594v1SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting2025-06-04T05:53:16ZReconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.2025-06-04T05:53:16Zhttps://github.com/ripl/splartShengjie LinJiading FangMuhammad Zubair IrshadVitor Campagnolo GuiziliniRares Andrei AmbrusGreg ShakhnarovichMatthew R. Walterhttp://arxiv.org/abs/2506.03478v1Facial Appearance Capture at Home with Patch-Level Reflectance Prior2025-06-04T01:21:07ZExisting facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.2025-06-04T01:21:07ZACM Transactions on Graphics (Proc. of SIGGRAPH), 2025. Code: https://github.com/yxuhan/DoRA; Project Page: https://yxuhan.github.io/DoRAYuxuan HanJunfeng LyuKuan ShengMinghao QueQixuan ZhangLan XuFeng Xuhttp://arxiv.org/abs/2504.03980v3Virtual Reality Lensing for Surface Approximation in Feature-driven DVR2025-06-03T20:24:24ZWe present a novel lens technique to support the identification of heterogeneous features in direct volume rendering (DVR) visualizations. In contrast to data-centric transfer function (TF) design, our image-driven approach enables users to specify target features directly within the visualization using deformable quadric surfaces. The lens leverages quadrics for their expressive yet simple parametrization, enabling users to sculpt feature approximations by composing multiple quadric lenses. By doing so, the lens offers greater versatility than traditional rigid-shape lenses for selecting and bringing into focus features with irregular geometry. We discuss the lens visualization and interaction design, advocating for bimanual spatial virtual reality (VR) input for reducing cognitive and physical strain. We also report findings from a pilot qualitative evaluation with a domain specialist using a public asteroid impact dataset. These insights not only shed light on the benefits and pitfalls of using deformable lenses but also suggest directions for future research.2025-04-04T22:47:05ZRoberta MotaEhud SharlinUsman Alimhttp://arxiv.org/abs/2506.05397v1Gen4D: Synthesizing Humans and Scenes in the Wild2025-06-03T20:04:41ZLack of input data for in-the-wild activities often results in low performance across various computer vision tasks. This challenge is particularly pronounced in uncommon human-centric domains like sports, where real-world data collection is complex and impractical. While synthetic datasets offer a promising alternative, existing approaches typically suffer from limited diversity in human appearance, motion, and scene composition due to their reliance on rigid asset libraries and hand-crafted rendering pipelines. To address this, we introduce Gen4D, a fully automated pipeline for generating diverse and photorealistic 4D human animations. Gen4D integrates expert-driven motion encoding, prompt-guided avatar generation using diffusion-based Gaussian splatting, and human-aware background synthesis to produce highly varied and lifelike human sequences. Based on Gen4D, we present SportPAL, a large-scale synthetic dataset spanning three sports: baseball, icehockey, and soccer. Together, Gen4D and SportPAL provide a scalable foundation for constructing synthetic datasets tailored to in-the-wild human-centric vision tasks, with no need for manual 3D modeling or scene design.2025-06-03T20:04:41ZProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) WorkshopsJerrin BrightZhibo WangYuhao ChenSirisha RambhatlaJohn ZelekDavid Clausihttp://arxiv.org/abs/2506.03118v1HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers2025-06-03T17:50:05Z3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets. Video results are available at https://zju3dv.github.io/humanram/.2025-06-03T17:50:05ZAccepted by SIGGRAPH 2025 (Conference Track). Project page: https://zju3dv.github.io/humanram/SIGGRAPH 2025 Conference ProceedingsZhiyuan YuZhe LiHujun BaoCan YangXiaowei Zhou10.1145/3721238.3730605http://arxiv.org/abs/2506.03099v1TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models2025-06-03T17:29:28ZIn this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/2025-06-03T17:29:28ZChetwin LowWeimin Wanghttp://arxiv.org/abs/2506.02895v1VolTex: Food Volume Estimation using Text-Guided Segmentation and Neural Surface Reconstruction2025-06-03T14:03:28ZAccurate food volume estimation is crucial for dietary monitoring, medical nutrition management, and food intake analysis. Existing 3D Food Volume estimation methods accurately compute the food volume but lack for food portions selection. We present VolTex, a framework that improves \change{the food object selection} in food volume estimation. Allowing users to specify a target food item via text input to be segmented, our method enables the precise selection of specific food objects in real-world scenes. The segmented object is then reconstructed using the Neural Surface Reconstruction method to generate high-fidelity 3D meshes for volume computation. Extensive evaluations on the MetaFood3D dataset demonstrate the effectiveness of our approach in isolating and reconstructing food items for accurate volume estimation. The source code is accessible at https://github.com/GCVCG/VolTex.2025-06-03T14:03:28ZAhmad AlMughrabiUmair HaroonRicardo MarquesPetia Radevahttp://arxiv.org/abs/2506.00512v2Pro3D-Editor : A Progressive-Views Perspective for Consistent and Precise 3D Editing2025-06-03T12:03:44ZText-guided 3D editing aims to precisely edit semantically relevant local 3D regions, which has significant potential for various practical applications ranging from 3D games to film production. Existing methods typically follow a view-indiscriminate paradigm: editing 2D views indiscriminately and projecting them back into 3D space. However, they overlook the different cross-view interdependencies, resulting in inconsistent multi-view editing. In this study, we argue that ideal consistent 3D editing can be achieved through a \textit{progressive-views paradigm}, which propagates editing semantics from the editing-salient view to other editing-sparse views. Specifically, we propose \textit{Pro3D-Editor}, a novel framework, which mainly includes Primary-view Sampler, Key-view Render, and Full-view Refiner. Primary-view Sampler dynamically samples and edits the most editing-salient view as the primary view. Key-view Render accurately propagates editing semantics from the primary view to other key views through its Mixture-of-View-Experts Low-Rank Adaption (MoVE-LoRA). Full-view Refiner edits and refines the 3D object based on the edited multi-views. Extensive experiments demonstrate that our method outperforms existing methods in editing accuracy and spatial consistency.2025-05-31T11:11:55ZYang ZhengMengqi HuangNan ChenZhendong Mao