https://arxiv.org/api/bFi3wGWJoDpZqNsKzBB8rAf2VVE2026-06-22T13:39:42Z935499015http://arxiv.org/abs/2510.23494v2Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap2026-01-22T13:52:11ZVolumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.2025-10-27T16:28:55ZElisabeth JüttnerJanelle PfeiferLeona KrathStefan KorfhageHannah DrögeMatthias B. HullinMarkus Plackhttp://arxiv.org/abs/2603.03287v1Deep Sketch-Based 3D Modeling: A Survey2026-01-22T03:22:00ZIn the past decade, advances in artificial intelligence have revolutionized sketch-based 3D modeling, leading to a new paradigm known as Deep Sketch-Based 3D Modeling (DS-3DM). DS-3DM offers data-driven methods that address the long-standing challenges of sketch abstraction and ambiguity. DS-3DM keeps humans at the center of the creative process by enhancing the flexibility, usability, faithfulness, and adaptability of sketch-based 3D modeling interfaces. This paper contributes a comprehensive survey of the latest DS-3DM within a novel design space: MORPHEUS. Built upon the Input-Model-Output (IMO) framework, MORPHEUS categorizes Models outputting Options of 3D Representations and Parts, derived from Human inputs (varying in quantity and modality), and Evaluated across diverse User-views and Styles. Throughout MORPHEUS we highlight limitations and identify opportunities for interdisciplinary research in Computer Vision, Computer Graphics, and Human-Computer Interaction, revealing a need for controllability and information-rich outputs. These opportunities align design processes more closely with user' intent, responding to the growing importance of user-centered approaches.2026-01-22T03:22:00ZAlberto TonoJiajun WuGordon WetzsteinIro ArmeniHariharan SubramonyamJames LandayMartin Fischerhttp://arxiv.org/abs/2601.15431v1SplatBus: A Gaussian Splatting Viewer Framework via GPU Interprocess Communication2026-01-21T19:56:22ZRadiance field-based rendering methods have attracted significant interest from the computer vision and computer graphics communities. They enable high-fidelity rendering with complex real-world lighting effects, but at the cost of high rendering time. 3D Gaussian Splatting solves this issue with a rasterisation-based approach for real-time rendering, enabling applications such as autonomous driving, robotics, virtual reality, and extended reality. However, current 3DGS implementations are difficult to integrate into traditional mesh-based rendering pipelines, which is a common use case for interactive applications and artistic exploration. To address this limitation, this software solution uses Nvidia's interprocess communication (IPC) APIs to easily integrate into implementations and allow the results to be viewed in external clients such as Unity, Blender, Unreal Engine, and OpenGL viewers. The code is available at https://github.com/RockyXu66/splatbus.2026-01-21T19:56:22ZYinghan XuThéo MoralesJohn Dinglianahttp://arxiv.org/abs/2601.14844v1CAG-Avatar: Cross-Attention Guided Gaussian Avatars for High-Fidelity Head Reconstruction2026-01-21T10:22:53ZCreating high-fidelity, real-time drivable 3D head avatars is a core challenge in digital animation. While 3D Gaussian Splashing (3D-GS) offers unprecedented rendering speed and quality, current animation techniques often rely on a "one-size-fits-all" global tuning approach, where all Gaussian primitives are uniformly driven by a single expression code. This simplistic approach fails to unravel the distinct dynamics of different facial regions, such as deformable skin versus rigid teeth, leading to significant blurring and distortion artifacts. We introduce Conditionally-Adaptive Gaussian Avatars (CAG-Avatar), a framework that resolves this key limitation. At its core is a Conditionally Adaptive Fusion Module built on cross-attention. This mechanism empowers each 3D Gaussian to act as a query, adaptively extracting relevant driving signals from the global expression code based on its canonical position. This "tailor-made" conditioning strategy drastically enhances the modeling of fine-grained, localized dynamics. Our experiments confirm a significant improvement in reconstruction fidelity, particularly for challenging regions such as teeth, while preserving real-time rendering performance.2026-01-21T10:22:53ZZhe ChangHaodong JinYan SongHui Yuhttp://arxiv.org/abs/2601.14766v1PAColorHolo: A Perceptually-Aware Color Management Framework for Holographic Displays2026-01-21T08:43:28ZHolographic displays offer significant potential for augmented and virtual reality applications by reconstructing wavefronts that enable continuous depth cues and natural parallax without vergence-accommodation conflict. However, despite advances in pixel-level image quality, current systems struggle to achieve perceptually accurate color reproduction--an essential component of visual realism. These challenges arise from complex system-level distortions caused by coherent laser illumination, spatial light modulator imperfections, chromatic aberrations, and camera-induced color biases. In this work, we propose a perceptually-aware color management framework for holographic displays that jointly addresses input-output color inconsistencies through color space transformation, adaptive illumination control, and neural network-based perceptual modeling of the camera's color response. We validate the effectiveness of our approach through numerical simulations, optical experiments, and a controlled user study. The results demonstrate substantial improvements in perceptual color fidelity, laying the groundwork for perceptually driven holographic rendering in future systems.2026-01-21T08:43:28ZPreprint (accepted to ACM TOG), 34 pages, 32 figuresChun ChenMinseok ChaeSeung-Woo NamMyeong-Ho ChoiMinseong KimEunbi LeeYoonchan JeongJae-Hyeung Parkhttp://arxiv.org/abs/2412.04827v3PanoDreamer: Optimization-Based Single Image to 360 3D Scene With Diffusion2026-01-21T01:27:09ZIn this paper, we present PanoDreamer, a novel method for producing a coherent 360° 3D scene from a single input image. Unlike existing methods that generate the scene sequentially, we frame the problem as single-image panorama and depth estimation. Once the coherent panoramic image and its corresponding depth are obtained, the scene can be reconstructed by inpainting the small occluded regions and projecting them into 3D space. Our key contribution is formulating single-image panorama and depth estimation as two optimization tasks and introducing alternating minimization strategies to effectively solve their objectives. We demonstrate that our approach outperforms existing techniques in single-image 360° 3D scene reconstruction in terms of consistency and overall quality.2024-12-06T07:42:48ZSIGGRAPH Asia 2025, Project page: https://people.engr.tamu.edu/nimak/Papers/PanoDreamer, Code: https://github.com/avinashpaliwal/PanoDreamerAvinash PaliwalXilong ZhouAndrii TsarovNima Khademi Kalantari10.1145/3757377.3763883http://arxiv.org/abs/2503.10860v2RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors2026-01-21T00:35:13ZIn this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.2025-03-13T20:16:58ZICCV 2025, Project page: https://people.engr.tamu.edu/nimak/Papers/RI3D, Code: https://github.com/avinashpaliwal/RI3DAvinash PaliwalXilong ZhouWei YeJinhui XiongRakesh RanjanNima Khademi Kalantarihttp://arxiv.org/abs/2601.14208v1Rig-Aware 3D Reconstruction of Vehicle Undercarriages using Gaussian Splatting2026-01-20T18:13:03ZInspecting the undercarriage of used vehicles is a labor-intensive task that requires inspectors to crouch or crawl underneath each vehicle to thoroughly examine it. Additionally, online buyers rarely see undercarriage photos. We present an end-to-end pipeline that utilizes a three-camera rig to capture videos of the undercarriage as the vehicle drives over it, and produces an interactive 3D model of the undercarriage. The 3D model enables inspectors and customers to rotate, zoom, and slice through the undercarriage, allowing them to detect rust, leaks, or impact damage in seconds, thereby improving both workplace safety and buyer confidence. Our primary contribution is a rig-aware Structure-from-Motion (SfM) pipeline specifically designed to overcome the challenges of wide-angle lens distortion and low-parallax scenes. Our method overcomes the challenges of wide-angle lens distortion and low-parallax scenes by integrating precise camera calibration, synchronized video streams, and strong geometric priors from the camera rig. We use a constrained matching strategy with learned components, the DISK feature extractor, and the attention-based LightGlue matcher to generate high-quality sparse point clouds that are often unattainable with standard SfM pipelines. These point clouds seed the Gaussian splatting process to generate photorealistic undercarriage models that render in real-time. Our experiments and ablation studies demonstrate that our design choices are essential to achieve state-of-the-art quality.2026-01-20T18:13:03Z8 pages, 9 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): https://www.icmla-conference.org/icmla25/Nitin KulkarniAkhil DevarashettiCharlie ClussLivio ForteDan BuckmasterPhilip SchneiderChunming QiaoAlina Vereshchakahttp://arxiv.org/abs/2601.13689v1Criminator: An Easy-to-Use XR "Crime Animator" for Rapid Reconstruction and Analysis of Dynamic Crime Scenes2026-01-20T07:43:48ZLaw enforcement authorities are increasingly interested in 3D modelling for virtual crime scene reconstruction, enabling offline analysis without the cost and contamination risk of on-site investigation. Past work has demonstrated spatial relationships through static modelling but validating the sequence of events in dynamic scenarios is crucial for solving a case. Yet, animation tools are not well suited to crime scene reconstruction, and complex for non-experts in 3D modelling/animation. Through a co-design process with criminology experts, we designed "Criminator"-a methodological framework and XR tool that simplifies animation authoring. We evaluated this tool with participants trained in criminology (n=6) and untrained individuals (n=12). Both groups were able to successfully complete the character animation tasks and provided high usability ratings for observation tasks. Criminator has potential for hypothesis testing, demonstration, sense-making, and training. Challenges remain in how such a tool fits into the entire judicial process, with questions about including animations as evidence.2026-01-20T07:43:48ZVahid PooryousefLonni BesançonMaxime CordeilChris FlightAlastair M Ross AMRichard BassedTim Dwyer10.1145/3772318.3791210http://arxiv.org/abs/2503.12052v3A Text-to-3D Framework for Joint Generation of CG-Ready Humans and Compatible Garments2026-01-19T18:32:27ZCreating detailed 3D human avatars with fitted garments traditionally requires specialized expertise and labor-intensive workflows. While recent advances in generative AI have enabled text-to-3D human and clothing synthesis, existing methods fall short in offering accessible, integrated pipelines for generating CG-ready 3D avatars with physically compatible outfits; here we use the term CG-ready for models following a technical aesthetic common in computer graphics (CG) and adopt standard CG polygonal meshes and strands representations (rather than neural representations like NeRF and 3DGS) that can be directly integrated into conventional CG pipelines and support downstream tasks such as physical simulation. To bridge this gap, we introduce Tailor, an integrated text-to-3D framework that generates high-fidelity, customizable 3D avatars dressed in simulation-ready garments. Tailor consists of three stages. (1) Seman tic Parsing: we employ a large language model to interpret textual descriptions and translate them into parameterized human avatars and semantically matched garment templates. (2) Geometry-Aware Garment Generation: we propose topology-preserving deformation with novel geometric losses to generate body-aligned garments under text control. (3) Consistent Texture Synthesis: we propose a novel multi-view diffusion process optimized for garment texturing, which enforces view consistency, preserves photorealistic details, and optionally supports symmetric texture generation common in garments. Through comprehensive quantitative and qualitative evaluations, we demonstrate that Tailor outperforms state-of-the-art methods in fidelity, usability, and diversity. Our code will be released for academic use. Project page: https://human-tailor.github.io2025-03-15T08:58:02ZProject page: https://human-tailor.github.ioZhiyao SunYu-Hui WenHo-Jui FangSheng YeMatthieu LinTian LvYong-Jin Liu10.1109/TVCG.2026.3668900http://arxiv.org/abs/2305.14080v3Eye-tracked Virtual Reality: A Comprehensive Survey on Methods and Privacy Challenges2026-01-19T10:52:44ZThe latest developments in computer hardware, sensor technologies, and artificial intelligence can make virtual reality (VR) and virtual spaces an important part of human everyday life. Eye tracking offers not only a hands-free way of interaction but also the possibility of a deeper understanding of human visual attention and cognitive processes in VR. Despite these possibilities, eye-tracking data also reveals users' privacy-sensitive attributes when combined with the information about the presented stimulus. To address all, this survey first covers major works in eye tracking, VR, and privacy areas between 2012 and 2022. While eye tracking in VR part covers the computational eye tracking pipeline from pupil detection and gaze estimation to offline data analysis, for privacy and security, we focus on eye-based authentication as well as computational methods to preserve the privacy of individuals and their eye-tracking data in VR. Later, we outline three main directions by focusing on privacy. In summary, this survey presents an extensive literature review of the utmost possibilities of eye tracking in VR and their privacy implications.2023-05-23T14:02:38ZAccepted for publication in the Proceedings of the IEEEEfe BozkirSüleyman ÖzdelMengdi WangBrendan David-JohnHong GaoKevin ButlerEakta JainEnkelejda Kasneci10.1109/JPROC.2026.3653661http://arxiv.org/abs/2509.00052v2Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation2026-01-19T08:05:12ZDiffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for general diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inefficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel prediction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial decoupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference features in certain layers to bring extra speedup. Extensive experiments demonstrate that our framework significantly improves inference speed while preserving video quality.2025-08-25T02:58:39ZJianzhi LongWenhao SunRongcheng TuDacheng Taohttp://arxiv.org/abs/2601.09291v2TIDI-GS: Floater Suppression in 3D Gaussian Splatting for Enhanced Indoor Scene Fidelity2026-01-19T03:49:13Z3D Gaussian Splatting (3DGS) is a technique to create high-quality, real-time 3D scenes from images. This method often produces visual artifacts known as floaters--nearly transparent, disconnected elements that drift in space away from the actual surface. This geometric inaccuracy undermines the reliability of these models for practical applications, which is critical. To address this issue, we introduce TIDI-GS, a new training framework designed to eliminate these floaters. A key benefit of our approach is that it functions as a lightweight plugin for the standard 3DGS pipeline, requiring no major architectural changes and adding minimal overhead to the training process. The core of our method is a floater pruning algorithm--TIDI--that identifies and removes floaters based on several criteria: their consistency across multiple viewpoints, their spatial relationship to other elements, and an importance score learned during training. The framework includes a mechanism to preserve fine details, ensuring that important high-frequency elements are not mistakenly removed. This targeted cleanup is supported by a monocular depth-based loss function that helps improve the overall geometric structure of the scene. Our experiments demonstrate that TIDI-GS improves both the perceptual quality and geometric integrity of reconstructions, transforming them into robust digital assets, suitable for high-fidelity applications.2026-01-14T08:53:11ZSooyeun YangCheyul ImJee Won LeeJongseong Brad Choihttp://arxiv.org/abs/2507.16869v3Controllable Video Generation: A Survey2026-01-19T03:14:11ZWith the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.2025-07-22T06:05:34Zproject page: https://github.com/mayuelala/Awesome-Controllable-Video-GenerationYue MaKunyu FengZhongyuan HuXinyu WangYucheng WangMingzhe ZhengBingyuan WangQinghe WangXuanhua HeHongfa WangChenyang ZhuHongyu LiuYingqing HeZeyu WangZhifeng LiXiu LiSirui HanYike GuoWei LiuDan XuLinfeng ZhangQifeng Chenhttp://arxiv.org/abs/2508.05685v7DogFit: Domain-guided Fine-tuning for Efficient Transfer Learning of Diffusion Models2026-01-18T23:27:49ZTransfer learning of diffusion models to smaller target domains is challenging, as naively fine-tuning the model often results in poor generalization. Test-time guidance methods help mitigate this by offering controllable improvements in image fidelity through a trade-off with sample diversity. However, this benefit comes at a high computational cost, typically requiring dual forward passes during sampling. We propose the Domain-guided Fine-tuning (DogFit) method, an effective guidance mechanism for diffusion transfer learning that maintains controllability without incurring additional computational overhead. DogFit injects a domain-aware guidance offset into the training loss, effectively internalizing the guided behavior during the fine-tuning process. The domain-aware design is motivated by our observation that during fine-tuning, the unconditional source model offers a stronger marginal estimate than the target model. To support efficient controllable fidelity-diversity trade-offs at inference, we encode the guidance strength value as an additional model input through a lightweight conditioning mechanism. We further investigate the optimal placement and timing of the guidance offset during training and propose two simple scheduling strategies, i.e., late-start and cut-off, which improve generation quality and training stability. Experiments on DiT and SiT backbones across six diverse target domains show that DogFit can outperform prior guidance methods in transfer learning in terms of FID and FDDINOV2 while requiring up to 2x fewer sampling TFLOPS.2025-08-05T21:33:05ZAccepted for poster presentation at AAAI 2026Yara BahramMohammadhadi ShateriEric Granger