https://arxiv.org/api/bFi3wGWJoDpZqNsKzBB8rAf2VVE 2026-06-22T13:39:42Z 9354 990 15 http://arxiv.org/abs/2510.23494v2 Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap 2026-01-22T13:52:11Z

Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.

2025-10-27T16:28:55Z Elisabeth Jüttner Janelle Pfeifer Leona Krath Stefan Korfhage Hannah Dröge Matthias B. Hullin Markus Plack http://arxiv.org/abs/2603.03287v1 Deep Sketch-Based 3D Modeling: A Survey 2026-01-22T03:22:00Z

In the past decade, advances in artificial intelligence have revolutionized sketch-based 3D modeling, leading to a new paradigm known as Deep Sketch-Based 3D Modeling (DS-3DM). DS-3DM offers data-driven methods that address the long-standing challenges of sketch abstraction and ambiguity. DS-3DM keeps humans at the center of the creative process by enhancing the flexibility, usability, faithfulness, and adaptability of sketch-based 3D modeling interfaces. This paper contributes a comprehensive survey of the latest DS-3DM within a novel design space: MORPHEUS. Built upon the Input-Model-Output (IMO) framework, MORPHEUS categorizes Models outputting Options of 3D Representations and Parts, derived from Human inputs (varying in quantity and modality), and Evaluated across diverse User-views and Styles. Throughout MORPHEUS we highlight limitations and identify opportunities for interdisciplinary research in Computer Vision, Computer Graphics, and Human-Computer Interaction, revealing a need for controllability and information-rich outputs. These opportunities align design processes more closely with user' intent, responding to the growing importance of user-centered approaches.

2026-01-22T03:22:00Z Alberto Tono Jiajun Wu Gordon Wetzstein Iro Armeni Hariharan Subramonyam James Landay Martin Fischer http://arxiv.org/abs/2601.15431v1 SplatBus: A Gaussian Splatting Viewer Framework via GPU Interprocess Communication 2026-01-21T19:56:22Z

Radiance field-based rendering methods have attracted significant interest from the computer vision and computer graphics communities. They enable high-fidelity rendering with complex real-world lighting effects, but at the cost of high rendering time. 3D Gaussian Splatting solves this issue with a rasterisation-based approach for real-time rendering, enabling applications such as autonomous driving, robotics, virtual reality, and extended reality. However, current 3DGS implementations are difficult to integrate into traditional mesh-based rendering pipelines, which is a common use case for interactive applications and artistic exploration. To address this limitation, this software solution uses Nvidia's interprocess communication (IPC) APIs to easily integrate into implementations and allow the results to be viewed in external clients such as Unity, Blender, Unreal Engine, and OpenGL viewers. The code is available at https://github.com/RockyXu66/splatbus.

2026-01-21T19:56:22Z Yinghan Xu Théo Morales John Dingliana http://arxiv.org/abs/2601.14844v1 CAG-Avatar: Cross-Attention Guided Gaussian Avatars for High-Fidelity Head Reconstruction 2026-01-21T10:22:53Z

Creating high-fidelity, real-time drivable 3D head avatars is a core challenge in digital animation. While 3D Gaussian Splashing (3D-GS) offers unprecedented rendering speed and quality, current animation techniques often rely on a "one-size-fits-all" global tuning approach, where all Gaussian primitives are uniformly driven by a single expression code. This simplistic approach fails to unravel the distinct dynamics of different facial regions, such as deformable skin versus rigid teeth, leading to significant blurring and distortion artifacts. We introduce Conditionally-Adaptive Gaussian Avatars (CAG-Avatar), a framework that resolves this key limitation. At its core is a Conditionally Adaptive Fusion Module built on cross-attention. This mechanism empowers each 3D Gaussian to act as a query, adaptively extracting relevant driving signals from the global expression code based on its canonical position. This "tailor-made" conditioning strategy drastically enhances the modeling of fine-grained, localized dynamics. Our experiments confirm a significant improvement in reconstruction fidelity, particularly for challenging regions such as teeth, while preserving real-time rendering performance.

2026-01-21T10:22:53Z Zhe Chang Haodong Jin Yan Song Hui Yu http://arxiv.org/abs/2601.14766v1 PAColorHolo: A Perceptually-Aware Color Management Framework for Holographic Displays 2026-01-21T08:43:28Z

Holographic displays offer significant potential for augmented and virtual reality applications by reconstructing wavefronts that enable continuous depth cues and natural parallax without vergence-accommodation conflict. However, despite advances in pixel-level image quality, current systems struggle to achieve perceptually accurate color reproduction--an essential component of visual realism. These challenges arise from complex system-level distortions caused by coherent laser illumination, spatial light modulator imperfections, chromatic aberrations, and camera-induced color biases. In this work, we propose a perceptually-aware color management framework for holographic displays that jointly addresses input-output color inconsistencies through color space transformation, adaptive illumination control, and neural network-based perceptual modeling of the camera's color response. We validate the effectiveness of our approach through numerical simulations, optical experiments, and a controlled user study. The results demonstrate substantial improvements in perceptual color fidelity, laying the groundwork for perceptually driven holographic rendering in future systems.

2026-01-21T08:43:28Z Preprint (accepted to ACM TOG), 34 pages, 32 figures Chun Chen Minseok Chae Seung-Woo Nam Myeong-Ho Choi Minseong Kim Eunbi Lee Yoonchan Jeong Jae-Hyeung Park http://arxiv.org/abs/2412.04827v3 PanoDreamer: Optimization-Based Single Image to 360 3D Scene With Diffusion 2026-01-21T01:27:09Z

In this paper, we present PanoDreamer, a novel method for producing a coherent 360° 3D scene from a single input image. Unlike existing methods that generate the scene sequentially, we frame the problem as single-image panorama and depth estimation. Once the coherent panoramic image and its corresponding depth are obtained, the scene can be reconstructed by inpainting the small occluded regions and projecting them into 3D space. Our key contribution is formulating single-image panorama and depth estimation as two optimization tasks and introducing alternating minimization strategies to effectively solve their objectives. We demonstrate that our approach outperforms existing techniques in single-image 360° 3D scene reconstruction in terms of consistency and overall quality.

2024-12-06T07:42:48Z SIGGRAPH Asia 2025, Project page: https://people.engr.tamu.edu/nimak/Papers/PanoDreamer, Code: https://github.com/avinashpaliwal/PanoDreamer Avinash Paliwal Xilong Zhou Andrii Tsarov Nima Khademi Kalantari 10.1145/3757377.3763883 http://arxiv.org/abs/2503.10860v2 RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors 2026-01-21T00:35:13Z

In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.

2025-03-13T20:16:58Z ICCV 2025, Project page: https://people.engr.tamu.edu/nimak/Papers/RI3D, Code: https://github.com/avinashpaliwal/RI3D Avinash Paliwal Xilong Zhou Wei Ye Jinhui Xiong Rakesh Ranjan Nima Khademi Kalantari http://arxiv.org/abs/2601.14208v1 Rig-Aware 3D Reconstruction of Vehicle Undercarriages using Gaussian Splatting 2026-01-20T18:13:03Z

Inspecting the undercarriage of used vehicles is a labor-intensive task that requires inspectors to crouch or crawl underneath each vehicle to thoroughly examine it. Additionally, online buyers rarely see undercarriage photos. We present an end-to-end pipeline that utilizes a three-camera rig to capture videos of the undercarriage as the vehicle drives over it, and produces an interactive 3D model of the undercarriage. The 3D model enables inspectors and customers to rotate, zoom, and slice through the undercarriage, allowing them to detect rust, leaks, or impact damage in seconds, thereby improving both workplace safety and buyer confidence. Our primary contribution is a rig-aware Structure-from-Motion (SfM) pipeline specifically designed to overcome the challenges of wide-angle lens distortion and low-parallax scenes. Our method overcomes the challenges of wide-angle lens distortion and low-parallax scenes by integrating precise camera calibration, synchronized video streams, and strong geometric priors from the camera rig. We use a constrained matching strategy with learned components, the DISK feature extractor, and the attention-based LightGlue matcher to generate high-quality sparse point clouds that are often unattainable with standard SfM pipelines. These point clouds seed the Gaussian splatting process to generate photorealistic undercarriage models that render in real-time. Our experiments and ablation studies demonstrate that our design choices are essential to achieve state-of-the-art quality.

2026-01-20T18:13:03Z 8 pages, 9 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): https://www.icmla-conference.org/icmla25/ Nitin Kulkarni Akhil Devarashetti Charlie Cluss Livio Forte Dan Buckmaster Philip Schneider Chunming Qiao Alina Vereshchaka http://arxiv.org/abs/2601.13689v1 Criminator: An Easy-to-Use XR "Crime Animator" for Rapid Reconstruction and Analysis of Dynamic Crime Scenes 2026-01-20T07:43:48Z

Law enforcement authorities are increasingly interested in 3D modelling for virtual crime scene reconstruction, enabling offline analysis without the cost and contamination risk of on-site investigation. Past work has demonstrated spatial relationships through static modelling but validating the sequence of events in dynamic scenarios is crucial for solving a case. Yet, animation tools are not well suited to crime scene reconstruction, and complex for non-experts in 3D modelling/animation. Through a co-design process with criminology experts, we designed "Criminator"-a methodological framework and XR tool that simplifies animation authoring. We evaluated this tool with participants trained in criminology (n=6) and untrained individuals (n=12). Both groups were able to successfully complete the character animation tasks and provided high usability ratings for observation tasks. Criminator has potential for hypothesis testing, demonstration, sense-making, and training. Challenges remain in how such a tool fits into the entire judicial process, with questions about including animations as evidence.

2026-01-20T07:43:48Z Vahid Pooryousef Lonni Besançon Maxime Cordeil Chris Flight Alastair M Ross AM Richard Bassed Tim Dwyer 10.1145/3772318.3791210 http://arxiv.org/abs/2503.12052v3 A Text-to-3D Framework for Joint Generation of CG-Ready Humans and Compatible Garments 2026-01-19T18:32:27Z

Creating detailed 3D human avatars with fitted garments traditionally requires specialized expertise and labor-intensive workflows. While recent advances in generative AI have enabled text-to-3D human and clothing synthesis, existing methods fall short in offering accessible, integrated pipelines for generating CG-ready 3D avatars with physically compatible outfits; here we use the term CG-ready for models following a technical aesthetic common in computer graphics (CG) and adopt standard CG polygonal meshes and strands representations (rather than neural representations like NeRF and 3DGS) that can be directly integrated into conventional CG pipelines and support downstream tasks such as physical simulation. To bridge this gap, we introduce Tailor, an integrated text-to-3D framework that generates high-fidelity, customizable 3D avatars dressed in simulation-ready garments. Tailor consists of three stages. (1) Seman tic Parsing: we employ a large language model to interpret textual descriptions and translate them into parameterized human avatars and semantically matched garment templates. (2) Geometry-Aware Garment Generation: we propose topology-preserving deformation with novel geometric losses to generate body-aligned garments under text control. (3) Consistent Texture Synthesis: we propose a novel multi-view diffusion process optimized for garment texturing, which enforces view consistency, preserves photorealistic details, and optionally supports symmetric texture generation common in garments. Through comprehensive quantitative and qualitative evaluations, we demonstrate that Tailor outperforms state-of-the-art methods in fidelity, usability, and diversity. Our code will be released for academic use. Project page: https://human-tailor.github.io

2025-03-15T08:58:02Z Project page: https://human-tailor.github.io Zhiyao Sun Yu-Hui Wen Ho-Jui Fang Sheng Ye Matthieu Lin Tian Lv Yong-Jin Liu 10.1109/TVCG.2026.3668900 http://arxiv.org/abs/2305.14080v3 Eye-tracked Virtual Reality: A Comprehensive Survey on Methods and Privacy Challenges 2026-01-19T10:52:44Z

The latest developments in computer hardware, sensor technologies, and artificial intelligence can make virtual reality (VR) and virtual spaces an important part of human everyday life. Eye tracking offers not only a hands-free way of interaction but also the possibility of a deeper understanding of human visual attention and cognitive processes in VR. Despite these possibilities, eye-tracking data also reveals users' privacy-sensitive attributes when combined with the information about the presented stimulus. To address all, this survey first covers major works in eye tracking, VR, and privacy areas between 2012 and 2022. While eye tracking in VR part covers the computational eye tracking pipeline from pupil detection and gaze estimation to offline data analysis, for privacy and security, we focus on eye-based authentication as well as computational methods to preserve the privacy of individuals and their eye-tracking data in VR. Later, we outline three main directions by focusing on privacy. In summary, this survey presents an extensive literature review of the utmost possibilities of eye tracking in VR and their privacy implications.

2023-05-23T14:02:38Z Accepted for publication in the Proceedings of the IEEE Efe Bozkir Süleyman Özdel Mengdi Wang Brendan David-John Hong Gao Kevin Butler Eakta Jain Enkelejda Kasneci 10.1109/JPROC.2026.3653661 http://arxiv.org/abs/2509.00052v2 Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation 2026-01-19T08:05:12Z

Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for general diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inefficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel prediction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial decoupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference features in certain layers to bring extra speedup. Extensive experiments demonstrate that our framework significantly improves inference speed while preserving video quality.

2025-08-25T02:58:39Z Jianzhi Long Wenhao Sun Rongcheng Tu Dacheng Tao http://arxiv.org/abs/2601.09291v2 TIDI-GS: Floater Suppression in 3D Gaussian Splatting for Enhanced Indoor Scene Fidelity 2026-01-19T03:49:13Z

3D Gaussian Splatting (3DGS) is a technique to create high-quality, real-time 3D scenes from images. This method often produces visual artifacts known as floaters--nearly transparent, disconnected elements that drift in space away from the actual surface. This geometric inaccuracy undermines the reliability of these models for practical applications, which is critical. To address this issue, we introduce TIDI-GS, a new training framework designed to eliminate these floaters. A key benefit of our approach is that it functions as a lightweight plugin for the standard 3DGS pipeline, requiring no major architectural changes and adding minimal overhead to the training process. The core of our method is a floater pruning algorithm--TIDI--that identifies and removes floaters based on several criteria: their consistency across multiple viewpoints, their spatial relationship to other elements, and an importance score learned during training. The framework includes a mechanism to preserve fine details, ensuring that important high-frequency elements are not mistakenly removed. This targeted cleanup is supported by a monocular depth-based loss function that helps improve the overall geometric structure of the scene. Our experiments demonstrate that TIDI-GS improves both the perceptual quality and geometric integrity of reconstructions, transforming them into robust digital assets, suitable for high-fidelity applications.

2026-01-14T08:53:11Z Sooyeun Yang Cheyul Im Jee Won Lee Jongseong Brad Choi http://arxiv.org/abs/2507.16869v3 Controllable Video Generation: A Survey 2026-01-19T03:14:11Z

With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.

2025-07-22T06:05:34Z project page: https://github.com/mayuelala/Awesome-Controllable-Video-Generation Yue Ma Kunyu Feng Zhongyuan Hu Xinyu Wang Yucheng Wang Mingzhe Zheng Bingyuan Wang Qinghe Wang Xuanhua He Hongfa Wang Chenyang Zhu Hongyu Liu Yingqing He Zeyu Wang Zhifeng Li Xiu Li Sirui Han Yike Guo Wei Liu Dan Xu Linfeng Zhang Qifeng Chen http://arxiv.org/abs/2508.05685v7 DogFit: Domain-guided Fine-tuning for Efficient Transfer Learning of Diffusion Models 2026-01-18T23:27:49Z

Transfer learning of diffusion models to smaller target domains is challenging, as naively fine-tuning the model often results in poor generalization. Test-time guidance methods help mitigate this by offering controllable improvements in image fidelity through a trade-off with sample diversity. However, this benefit comes at a high computational cost, typically requiring dual forward passes during sampling. We propose the Domain-guided Fine-tuning (DogFit) method, an effective guidance mechanism for diffusion transfer learning that maintains controllability without incurring additional computational overhead. DogFit injects a domain-aware guidance offset into the training loss, effectively internalizing the guided behavior during the fine-tuning process. The domain-aware design is motivated by our observation that during fine-tuning, the unconditional source model offers a stronger marginal estimate than the target model. To support efficient controllable fidelity-diversity trade-offs at inference, we encode the guidance strength value as an additional model input through a lightweight conditioning mechanism. We further investigate the optimal placement and timing of the guidance offset during training and propose two simple scheduling strategies, i.e., late-start and cut-off, which improve generation quality and training stability. Experiments on DiT and SiT backbones across six diverse target domains show that DogFit can outperform prior guidance methods in transfer learning in terms of FID and FDDINOV2 while requiring up to 2x fewer sampling TFLOPS.

2025-08-05T21:33:05Z Accepted for poster presentation at AAAI 2026 Yara Bahram Mohammadhadi Shateri Eric Granger