https://arxiv.org/api/QM5tg09XeuTqPt1Tzjy89Zh8mxU2026-06-25T21:27:25Z9383135015http://arxiv.org/abs/2511.00362v1Oitijjo-3D: Generative AI Framework for Rapid 3D Heritage Reconstruction from Street View Imagery2025-11-01T02:09:26ZCultural heritage restoration in Bangladesh faces a dual challenge of limited resources and scarce technical expertise. Traditional 3D digitization methods, such as photogrammetry or LiDAR scanning, require expensive hardware, expert operators, and extensive on-site access, which are often infeasible in developing contexts. As a result, many of Bangladesh's architectural treasures, from the Paharpur Buddhist Monastery to Ahsan Manzil, remain vulnerable to decay and inaccessible in digital form. This paper introduces Oitijjo-3D, a cost-free generative AI framework that democratizes 3D cultural preservation. By using publicly available Google Street View imagery, Oitijjo-3D reconstructs faithful 3D models of heritage structures through a two-stage pipeline - multimodal visual reasoning with Gemini 2.5 Flash Image for structure-texture synthesis, and neural image-to-3D generation through Hexagen for geometry recovery. The system produces photorealistic, metrically coherent reconstructions in seconds, achieving significant speedups compared to conventional Structure-from-Motion pipelines, without requiring any specialized hardware or expert supervision. Experiments on landmarks such as Ahsan Manzil, Choto Sona Mosque, and Paharpur demonstrate that Oitijjo-3D preserves both visual and structural fidelity while drastically lowering economic and technical barriers. By turning open imagery into digital heritage, this work reframes preservation as a community-driven, AI-assisted act of cultural continuity for resource-limited nations.2025-11-01T02:09:26Z6 Pages, 4 figures, 2 Tables, Submitted to ICECTE 2026Momen Khandoker OpeAkif IslamMohd Ruhul AmeenAbu Saleh Musa MiahMd Rashedul IslamJungpil Shinhttp://arxiv.org/abs/2511.00248v1Object-Aware 4D Human Motion Generation2025-10-31T20:40:17ZRecent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.2025-10-31T20:40:17ZShurui GuiDeep Anil PatelXiner LiMartin Renqiang Minhttp://arxiv.org/abs/2510.04539v2C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing2025-10-31T16:06:19ZExisting 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.2025-10-06T07:07:14ZICCV 2025 Workshop Wild3DZeng TaoZheng DingZeyuan ChenXiang ZhangLeizhi LiZhuowen Tuhttp://arxiv.org/abs/2510.27533v1Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds2025-10-31T15:05:43ZThe protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method's ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.2025-10-31T15:05:43ZPrint ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 17-30, Vol. 9, No. 4, 1 October 2025Khandoker Ashik Uz ZamanMohammad Zahangir AlamMohammed N. M. AliMahdi H. Miraz10.33166/AETiC.2025.04.002http://arxiv.org/abs/2511.21697v1Detail Enhanced Gaussian Splatting for Large-Scale Volumetric Capture2025-10-31T14:36:08ZWe present a unique system for large-scale, multi-performer, high resolution 4D volumetric capture providing realistic free-viewpoint video up to and including 4K resolution facial closeups. To achieve this, we employ a novel volumetric capture, reconstruction and rendering pipeline based on Dynamic Gaussian Splatting and Diffusion-based Detail Enhancement. We design our pipeline specifically to meet the demands of high-end media production. We employ two capture rigs: the Scene Rig, which captures multi-actor performances at a resolution which falls short of 4K production quality, and the Face Rig, which records high-fidelity single-actor facial detail to serve as a reference for detail enhancement. We first reconstruct dynamic performances from the Scene Rig using 4D Gaussian Splatting, incorporating new model designs and training strategies to improve reconstruction, dynamic range, and rendering quality. Then to render high-quality images for facial closeups, we introduce a diffusion-based detail enhancement model. This model is fine-tuned with high-fidelity data from the same actors recorded in the Face Rig. We train on paired data generated from low- and high-quality Gaussian Splatting (GS) models, using the low-quality input to match the quality of the Scene Rig, with the high-quality GS as ground truth. Our results demonstrate the effectiveness of this pipeline in bridging the gap between the scalable performance capture of a large-scale rig and the high-resolution standards required for film and media production.2025-10-31T14:36:08Z10 pages, Accepted as a Journal paper at Siggraph Asia 2025. Webpage: https://eyeline-labs.github.io/DEGS/Julien PhilipLi MaPascal ClausenWenqi XianAhmet Levent TaşelMingming HeXueming YuDavid M. GeorgeNing YuOliver PilarskiPaul Debevec10.1145/3763336http://arxiv.org/abs/2511.02852v1Real-Time Interactive Hybrid Ocean: Spectrum-Consistent Wave Particle-FFT Coupling2025-10-31T05:03:21ZFast Fourier Transform-based (FFT) spectral oceans are widely adopted for their efficiency and large-scale realism, but they assume global stationarity and spatial homogeneity, making it difficult to represent non-uniform seas and near-field interactions (e.g., ships and floaters). In contrast, wave particles capture local wakes and ripples, yet are costly to maintain at scale and hard to match global spectral statistics.We present a real-time interactive hybrid ocean: a global FFT background coupled with local wave-particle (WP) patch regions around interactive objects, jointly driven under a unified set of spectral parameters and dispersion. At patch boundaries, particles are injected according to the same directional spectrum as the FFT, aligning the local frequency-direction distribution with the background and matching energy density, without disturbing the far field.Our approach introduces two main innovations: (1) Hybrid ocean representation. We couple a global FFT background with local WP patches under a unified spectrum, achieving large-scale spectral consistency while supporting localized wakes and ripples.(2) Frequency-bucketed implementation. We design a particle sampling and GPU-parallel synthesis scheme based on frequency buckets, which preserves spectral energy consistency and sustains real-time interactive performance.Together, these innovations enable a unified framework that delivers both large-scale spectral realism and fine-grained interactivity in real time.2025-10-31T05:03:21ZShengze XueYu RenJiacheng HongRun NiShuangjiu XiaoDeli Donghttp://arxiv.org/abs/2510.26800v1OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes2025-10-30T17:59:51ZThere are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.2025-10-30T17:59:51ZProject page: https://yukun-huang.github.io/OmniX/Yukun HuangJiwen YuYanning ZhouJianan WangXintao WangPengfei WanXihui Liuhttp://arxiv.org/abs/2510.26786v1HEIR: Learning Graph-Based Motion Hierarchies2025-10-30T17:57:40ZHierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/2025-10-30T17:57:40ZCode link: https://github.com/princeton-computational-imaging/HEIRAdvances in Neural Information Processing Systems 38 (NeurIPS 2025)Cheng ZhengWilliam KochBaiang LiFelix Heidehttp://arxiv.org/abs/2510.26694v1The Impact and Outlook of 3D Gaussian Splatting2025-10-30T17:01:18ZSince its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.2025-10-30T17:01:18ZArticle written for Frontiers of Science Award, International Congress on Basic Science, 2025Bernhard Kerblhttp://arxiv.org/abs/2503.22159v3Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS2025-10-30T10:00:04ZWhile dynamic novel view synthesis from 2D videos has seen progress, achieving efficient reconstruction and rendering of dynamic scenes remains a challenging task. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering pipeline that achieves real-time performance without compromising visual fidelity. Disentangled4DGS decouples the temporal and spatial components of 4D Gaussians, avoiding the need for slicing first and four-dimensional matrix calculations in prior methods. By projecting temporal and spatial deformations into dynamic 2D Gaussians and deferring temporal processing, we minimize redundant computations of 4DGS. Our approach also features a gradient-guided flow loss and temporal splitting strategy to reduce artifacts. Experiments demonstrate a significant improvement in rendering speed and quality, achieving 343 FPS when render 1352*1014 resolution images on a single RTX3090 while reducing storage requirements by at least 4.5%. Our approach sets a new benchmark for dynamic novel view synthesis, outperforming existing methods on both multi-view and monocular dynamic scene datasets.2025-03-28T05:46:02ZHao FengHao SunWei XieZhi ZuoZhengzhe Liuhttp://arxiv.org/abs/2510.26265v1Look at That Distractor: Dynamic Translation Gain under Low Perceptual Load in Virtual Reality2025-10-30T08:45:02ZRedirected walking utilizes gain adjustments within perceptual thresholds to allow natural navigation in large scale virtual environments within confined physical environments. Previous research has found that when users are distracted by some scene elements, they are less sensitive to gain values. However, the effects on detection thresholds have not been quantitatively measured. In this paper, we present a novel method that dynamically adjusts translation gain by leveraging visual distractors. We place distractors within the user's field of view and apply a larger translation gain when their attention is drawn to them. Because the magnitude of gain adjustment depends on the user's level of engagement with the distractors, the redirection process remains smooth and unobtrusive. To evaluate our method, we developed a task oriented virtual environment for a user study. Results show that introducing distractors in the virtual environment significantly raises users' translation gain thresholds. Furthermore, assessments using the Simulator Sickness Questionnaire and Igroup Presence Questionnaire indicate that the method maintains user comfort and acceptance, supporting its effectiveness for RDW systems.2025-10-30T08:45:02ZLing-Long ZouQiang TongEr-Xia LuoSen-Zhe XuSong-Hai ZhangFang-Lue Zhanghttp://arxiv.org/abs/2510.26141v1StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and Disentanglement2025-10-30T04:52:12ZStructured layouts are preferable in many 2D visual contents (\eg, GUIs, webpages) since the structural information allows convenient layout editing. Computational frameworks can help create structured layouts but require heavy labor input. Existing data-driven approaches are effective in automatically generating fixed layouts but fail to produce layout structures. We present StructLayoutFormer, a novel Transformer-based approach for conditional structured layout generation. We use a structure serialization scheme to represent structured layouts as sequences. To better control the structures of generated layouts, we disentangle the structural information from the element placements. Our approach is the first data-driven approach that achieves conditional structured layout generation and produces realistic layout structures explicitly. We compare our approach with existing data-driven layout generation approaches by including post-processing for structure extraction. Extensive experiments have shown that our approach exceeds these baselines in conditional structured layout generation. We also demonstrate that our approach is effective in extracting and transferring layout structures. The code is publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer} {https://github.com/Teagrus/StructLayoutFormer}.2025-10-30T04:52:12ZXin HuPengfei XuJin ZhouHongbo FuHui Huanghttp://arxiv.org/abs/2503.00868v23D Dynamic Fluid Assets from Single-View Videos with Generative Gaussian Splatting2025-10-30T01:56:10ZWhile the generation of 3D content from single-view images has been extensively studied, the creation of physically consistent 3D dynamic scenes from videos remains in its early stages. We propose a novel framework leveraging generative 3D Gaussian Splatting (3DGS) models to extract and re-simulate 3D dynamic fluid objects from single-view videos using simulation methods. The fluid geometry represented by 3DGS is initially generated and optimized from single-view images, then denoised, densified, and aligned across frames. We estimate the fluid surface velocity using optical flow, propose a mainstream extraction algorithm to refine it. The 3D volumetric velocity field is then derived from the velocity of the fluid's enclosed surface. The velocity field is therewith converted into a divergence-free, grid-based representation, enabling the optimization of simulation parameters through its differentiability across frames. This process outputs simulation-ready fluid assets with physical dynamics closely matching those observed in the source video. Our approach is applicable to various liquid fluids, including inviscid and viscous types, and allows users to edit the output geometry or extend movement durations seamlessly. This automatic method for creating 3D dynamic fluid assets from single-view videos, easily obtainable from the internet, shows great potential for generating large-scale 3D fluid assets at a low cost.2025-03-02T12:08:19ZZhiwei ZhaoAlan ZhaoMinchen LiYixin Huhttp://arxiv.org/abs/2510.13587v2HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans2025-10-29T14:24:23ZWe present HRM$^2$Avatar, a framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real time on mobile devices. Monocular capture with smartphones provides a low-cost alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses challenges due to limited visual and geometric data. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. We extract garment meshes from monocular data to model clothing deformations effectively, and attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting. This representation efficiently learns high-resolution and dynamic information from monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for animating high-fidelity avatars in AR/VR, social gaming, and on-device creation. Our GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over $2.7\times$ faster than representative mobile-engine baselines. Experiments show that HRM$^2$Avatar delivers superior visual realism and real-time interactivity, outperforming state-of-the-art monocular methods.2025-10-15T14:18:20ZSIGGRAPH Asia 2025, Project Page: https://acennr-engine.github.io/HRM2AvatarChao ShiShenghao JiaJinhui LiuYong ZhangLiangchao ZhuZhonglei YangJinze MaChaoyue NiuChengfei Lv10.1145/3757377.3763894http://arxiv.org/abs/2510.25319v14-Doodle: Text to 3D Sketches that Move!2025-10-29T09:33:29ZWe present a novel task: text-to-3D sketch animation, which aims to bring freeform sketches to life in dynamic 3D space. Unlike prior works focused on photorealistic content generation, we target sparse, stylized, and view-consistent 3D vector sketches, a lightweight and interpretable medium well-suited for visual communication and prototyping. However, this task is very challenging: (i) no paired dataset exists for text and 3D (or 4D) sketches; (ii) sketches require structural abstraction that is difficult to model with conventional 3D representations like NeRFs or point clouds; and (iii) animating such sketches demands temporal coherence and multi-view consistency, which current pipelines do not address. Therefore, we propose 4-Doodle, the first training-free framework for generating dynamic 3D sketches from text. It leverages pretrained image and video diffusion models through a dual-space distillation scheme: one space captures multi-view-consistent geometry using differentiable Bézier curves, while the other encodes motion dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion), which optimizes from a single view per step, our multi-view optimization ensures structural alignment and avoids view ambiguity, critical for sparse sketches. Furthermore, we introduce a structure-aware motion module that separates shape-preserving trajectories from deformation-aware changes, enabling expressive motion such as flipping, rotation, and articulated movement. Extensive experiments show that our method produces temporally realistic and structurally stable 3D sketch animations, outperforming existing baselines in both fidelity and controllability. We hope this work serves as a step toward more intuitive and accessible 4D content creation.2025-10-29T09:33:29ZHao ChenJiaqi WangYonggang QiKe LiKaiyue PangYi-Zhe Song