https://arxiv.org/api/WEdwPkSPm/Kro/3XwXkmOUsXph4 2026-06-25T05:13:14Z 9383 1140 15 http://arxiv.org/abs/2512.18314v1 MatSpray: Fusing 2D Material World Knowledge on 3D Geometry 2025-12-20T10:58:45Z

Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.

2025-12-20T10:58:45Z Project page: https://matspray.jdihlmann.com/ Philipp Langsteiner Jan-Niklas Dihlmann Hendrik P. A. Lensch http://arxiv.org/abs/2508.05899v2 HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing 2025-12-19T05:39:09Z

3D scene generation plays a crucial role in gaming, artistic creation, virtual reality, and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. To address those challenges, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. Then, HOLODECK 2.0 iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Both human and model evaluations demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, HOLODECK 2.0 provides editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling to generate visually rich and immersive environments that can boost efficiency in game design.

2025-08-07T23:23:07Z Zixuan Bian Ruohan Ren Yue Yang Chris Callison-Burch http://arxiv.org/abs/2512.16896v1 Sceniris: A Fast Procedural Scene Generation Framework 2025-12-18T18:55:03Z

Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at https://github.com/rai-inst/sceniris

2025-12-18T18:55:03Z Code is available at https://github.com/rai-inst/sceniris Jinghuan Shang Harsh Patel Ran Gong Karl Schmeckpeper http://arxiv.org/abs/2512.16706v1 SDFoam: Signed-Distance Foam for explicit surface reconstruction 2025-12-18T16:11:18Z

Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.

2025-12-18T16:11:18Z Antonella Rech Nicola Conci Nicola Garau http://arxiv.org/abs/2512.16678v1 The stationary focus of the Kiepert parabola over a special Poncelet triangle family 2025-12-18T15:47:22Z

We show that the focus of the Kiepert in-parabola remains stationary over a family of circle-inscribed Poncelet triangles which contain an equilateral triangle.

2025-12-18T15:47:22Z 6 pages, 5 figures Mark Helman Ronaldo A. Garcia Dan Reznik http://arxiv.org/abs/2512.16670v1 FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering 2025-12-18T15:41:08Z

Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

2025-12-18T15:41:08Z Project Page: https://framediffuser.jdihlmann.com/ Ole Beisswenger Jan-Niklas Dihlmann Hendrik P. A. Lensch http://arxiv.org/abs/2512.16511v1 Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images 2025-12-18T13:23:49Z

Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

2025-12-18T13:23:49Z Hossein Javidnia http://arxiv.org/abs/2512.16397v1 Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture 2025-12-18T10:53:51Z

We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.

2025-12-18T10:53:51Z Submitted to CVPR 2026. 21 pages, 22 figures Haodi He Jihun Yu Ronald Fedkiw http://arxiv.org/abs/2511.05020v2 DAFM: Dynamic Adaptive Fusion for Multi-Model Collaboration in Composed Image Retrieval 2025-12-18T02:28:05Z

Composed Image Retrieval (CIR) is a cross-modal task that aims to retrieve target images from large-scale databases using a reference image and a modification text. Most existing methods rely on a single model to perform feature fusion and similarity matching. However, this paradigm faces two major challenges. First, one model alone can't see the whole picture and the tiny details at the same time; it has to handle different tasks with the same weights, so it often misses the small but important links between image and text. Second, the absence of dynamic weight allocation prevents adaptive leveraging of complementary model strengths, so the resulting embedding drifts away from the target and misleads the nearest-neighbor search in CIR. To address these limitations, we propose Dynamic Adaptive Fusion (DAFM) for multi-model collaboration in CIR. Rather than optimizing a single method in isolation, DAFM exploits the complementary strengths of heterogeneous models and adaptively rebalances their contributions. This not only maximizes retrieval accuracy but also ensures that the performance gains are independent of the fusion order, highlighting the robustness of our approach. Experiments on the CIRR and FashionIQ benchmarks demonstrate consistent improvements. Our method achieves a Recall@10 of 93.21 and an Rmean of 84.43 on CIRR, and an average Rmean of 67.48 on FashionIQ, surpassing recent strong baselines by up to 4.5%. These results confirm that dynamic multi-model collaboration provides an effective and general solution for CIR.

2025-11-07T06:51:10Z We discovered an error that affects the main conclusions, so we decided to withdraw the paper Yawei Cai Jiapeng Mi Nan Ji Haotian Rong Yawei Zhang Zhangti Li Wenbin Guo Rensong Xie http://arxiv.org/abs/2512.15985v1 Hierarchical Neural Surfaces for 3D Mesh Compression 2025-12-17T21:32:04Z

Implicit Neural Representations (INRs) have been demonstrated to achieve state-of-the-art compression of a broad range of modalities such as images, videos, 3D surfaces, and audio. Most studies have focused on building neural counterparts of traditional implicit representations of 3D geometries, such as signed distance functions. However, the triangle mesh-based representation of geometry remains the most widely used representation in the industry, while building INRs capable of generating them has been sparsely studied. In this paper, we present a method for building compact INRs of zero-genus 3D manifolds. Our method relies on creating a spherical parameterization of a given 3D mesh - mapping the surface of a mesh to that of a unit sphere - then constructing an INR that encodes the displacement vector field defined continuously on its surface that regenerates the original shape. The compactness of our representation can be attributed to its hierarchical structure, wherein it first recovers the coarse structure of the encoded surface before adding high-frequency details to it. Once the INR is computed, 3D meshes of arbitrary resolution/connectivity can be decoded from it. The decoding can be performed in real time while achieving a state-of-the-art trade-off between reconstruction quality and the size of the compressed representations.

2025-12-17T21:32:04Z Sai Karthikey Pentapati Gregoire Phillips Alan Bovik http://arxiv.org/abs/2512.15711v1 Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering 2025-12-17T18:58:50Z

We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.

2025-12-17T18:58:50Z Tech report Divam Gupta Anuj Pahuja Nemanja Bartolovic Tomas Simon Forrest Iandola Giljoo Nam http://arxiv.org/abs/2505.04050v3 TerraFusion: Joint Generation of Terrain Geometry and Texture Using Latent Diffusion Models 2025-12-17T06:05:23Z

3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly generates terrain heightmaps and textures using a latent diffusion model. First, we train the model in an unsupervised manner to randomly generate paired heightmaps and textures. Then, we perform supervised learning of an external adapter to enable user control via hand-drawn sketches. Experiments show that our approach allows intuitive terrain generation while preserving the correlation between heightmaps and textures.

2025-05-07T01:41:12Z Kazuki Higo Toshiki Kanai Yuki Endo Yoshihiro Kanamori http://arxiv.org/abs/2602.06044v1 COSMOS: Coherent Supergaussian Modeling with Spatial Priors for Sparse-View 3D Splatting 2025-12-17T04:55:15Z

3D Gaussian Splatting (3DGS) has recently emerged as a promising approach for 3D reconstruction, providing explicit, point-based representations and enabling high-quality real time rendering. However, when trained with sparse input views, 3DGS suffers from overfitting and structural degradation, leading to poor generalization on novel views. This limitation arises from its optimization relying solely on photometric loss without incorporating any 3D structure priors. To address this issue, we propose Coherent supergaussian Modeling with Spatial Priors (COSMOS). Inspired by the concept of superpoints from 3D segmentation, COSMOS introduces 3D structure priors by newly defining supergaussian groupings of Gaussians based on local geometric cues and appearance features. To this end, COSMOS applies inter group global self-attention across supergaussian groups and sparse local attention among individual Gaussians, enabling the integration of global and local spatial information. These structure-aware features are then used for predicting Gaussian attributes, facilitating more consistent 3D reconstructions. Furthermore, by leveraging supergaussian-based grouping, COSMOS enforces an intra-group positional regularization to maintain structural coherence and suppress floaters, thereby enhancing training stability under sparse-view conditions. Our experiments on Blender and DTU show that COSMOS surpasses state-of-the-art methods in sparse-view settings without any external depth supervision.

2025-12-17T04:55:15Z Chaeyoung Jeong Kwangsu Kim http://arxiv.org/abs/2605.08086v1 Representations of 3D Rotations: Mathematical Foundations and Comparative Analysis 2025-12-17T02:56:03Z

Rotation representations are foundational in fields such as computer graphics, robotics, and machine learning, where precise and efficient modeling of 3D orientations is critical. This paper comprehensively investigates diverse representations of the special orthogonal group $SO(3)$, such as Euler angles, axis-angle vectors, quaternions, rotation matrices, exponential maps, and emerging continuous and probabilistic methods, evaluating their mathematical formulations, continuity, susceptibility to gimbal lock, computational efficiency, storage requirements, interpolation properties, and composition operations, while integrating detailed algebraic insights with practical applications in fields like animation, pose estimation, inertial navigation, 3D shape registration, and neural networks. Empirical evidence highlights quaternions' dominance due to their compactness and computational efficiency, while alternatives like 6D continuous representations and matrix Fisher distributions provide enhanced continuity and uncertainty modeling. Future research could explore hybrid methods and thorough large-scale evaluations to help build a solid foundation for improving rotation representation techniques.

2025-12-17T02:56:03Z Keywords: Rotation representations, $SO(3)$, quaternions, continuity, gimbal lock, 3D shape registration Aizierjiang Aiersilan Haochen Liu James Hahn http://arxiv.org/abs/2512.14590v1 Inverse obstacle scattering regularized by the tangent-point energy 2025-12-16T16:57:04Z

We employ the so-called tangent-point energy as Tikhonov regularizer for ill-conditioned inverse scattering problems in 3D. The tangent-point energy is a self-avoiding functional on the space of embedded surfaces that also penalizes surface roughness. Moreover, it features nice compactness and continuity properties. These allow us to show the well-posedness of the regularized problems and the convergence of the regularized solutions to the true solution in the limit of vanishing noise level. We also provide a reconstruction algorithm of iteratively regularized Gauss-Newton type. Our numerical experiments demonstrate that our method is numerically feasible and effective in producing reconstructions of unprecedented quality.

2025-12-16T16:57:04Z 46 pages, 13 figures, 4 tables Henrik Schumacher Jannik Rönsch Thorsten Hohage Max Wardetzky