https://arxiv.org/api/AwR4VAwF7Gxgkx0t6DpbHGaB/ng 2026-03-24T16:28:07Z 8847 15 15 http://arxiv.org/abs/2603.19753v1 ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination 2026-03-20T08:37:39Z

Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present ReLi3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single-image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two-path prediction strategy. The first path predicts the object's structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This, combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. In addition, with our mixed domain training protocol, which combines synthetic PBR datasets with real-world RGB captures, we establish generalizable results in geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets. Project Page: https://reli3d.jdihlmann.com/

2026-03-20T08:37:39Z Project Page: https://reli3d.jdihlmann.com/ ICLR 2026 Jan-Niklas Dihlmann Mark Boss Simon Donne Andreas Engelhardt Hendrik P. A. Lensch Varun Jampani http://arxiv.org/abs/2506.20703v2 Generative Blocks World: Moving Things Around in Pictures 2026-03-20T03:36:38Z

We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method, which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding the texture-consistency provided by existing techniques. These texture hints (a) allow accurate object and camera moves and (b) preserve the identity of objects. Our experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.

2025-06-25T17:59:55Z ICLR 2026 34 pages, 25 figures, 4 tables Vaibhav Vavilala Seemandhar Jain Rahul Vasanth D. A. Forsyth Anand Bhattad http://arxiv.org/abs/2603.19500v1 Teaching an Agent to Sketch One Part at a Time 2026-03-19T22:08:53Z

We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.

2026-03-19T22:08:53Z Xiaodan Du Ruize Xu David Yunis Yael Vinker Greg Shakhnarovich http://arxiv.org/abs/2603.20310v1 GraphiContact: Pose-aware Human-Scene Robust Contact Perception for Interactive Systems 2026-03-19T17:17:04Z

Monocular vertex-level human-scene contact prediction is a fundamental capability for interactive systems such as assistive monitoring, embodied AI, and rehabilitation analysis. In this work, we study this task jointly with single-image 3D human mesh reconstruction, using reconstructed body geometry as a scaffold for contact reasoning. Existing approaches either focus on contact prediction without sufficiently exploiting explicit 3D human priors, or emphasize pose/mesh reconstruction without directly optimizing robust vertex-level contact inference under occlusion and perceptual noise. To address this gap, we propose GraphiContact, a pose-aware framework that transfers complementary human priors from two pretrained Transformer encoders and predicts per-vertex human-scene contact on the reconstructed mesh. To improve robustness in real-world scenarios, we further introduce a Single-Image Multi-Infer Uncertainty (SIMU) training strategy with token-level adaptive routing, which simulates occlusion and noisy observations during training while preserving efficient single-branch inference at test time. Experiments on five benchmark datasets show that GraphiContact achieves consistent gains on both contact prediction and 3D human reconstruction. Our code, based on the GraphiContact method, provides comprehensive 3D human reconstruction and interaction analysis, and will be publicly available at https://github.com/Aveiro-Lin/GraphiContact.

2026-03-19T17:17:04Z 15 pages, 9 figures, Accepted at ICME 2026 Xiaojian Lin Yaomin Shen Junyuan Ma Yujie Sun Chengqing Bu Wenxin Zhang Zongzheng Zhang Hao Fei Lei Jin Hao Zhao http://arxiv.org/abs/2603.19063v1 Fire as a Service: Augmenting Robot Simulators with Thermally and Visually Accurate Fire Dynamics 2026-03-19T15:55:02Z

Most existing robot simulators prioritize rigid-body dynamics and photorealistic rendering, but largely neglect the thermally and optically complex phenomena that characterize real-world fire environments. For robots envisioned as future firefighters, this limitation hinders both reliable capability evaluation and the generation of representative training data prior to deployment in hazardous scenarios. To address these challenges, we introduce Fire as a Service (FaaS), a novel, asynchronous co-simulation framework that augments existing robot simulators with high-fidelity and computationally efficient fire simulations. Our pipeline enables robots to experience accurate, multi-species thermodynamic heat transfer and visually consistent volumetric smoke without disrupting high-frequency rigid-body control loops. We demonstrate that our framework can be integrated with diverse robot simulators to generate physically accurate fire behavior, benchmark thermal hazards encountered by robotic platforms, and collect realistic multimodal perceptual data. Crucially, its real-time performance supports human-in-the-loop teleoperation, enabling the successful training of reactive, multimodal policies via Behavioral Cloning. By adding fire dynamics to robot simulations, FaaS provides a scalable pathway toward safer, more reliable deployment of robots in fire scenarios.

2026-03-19T15:55:02Z Anton R. Wagner Madhan Balaji Rao Helge Wrede Sören Pirk Xuesu Xiao http://arxiv.org/abs/2603.19053v1 SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation 2026-03-19T15:47:43Z

Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

2026-03-19T15:47:43Z CVPR 2026 Phuc Pham Uy Dieu Tran Binh-Son Hua Phong Nguyen http://arxiv.org/abs/2510.20558v2 From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail 2026-03-19T15:04:19Z

In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation, including geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.

2025-10-23T13:39:18Z Xiaohan Sun Carol O'Sullivan http://arxiv.org/abs/2512.09162v3 GTAvatar: Bridging Gaussian Splatting and Texture Mapping for Relightable and Editable Gaussian Avatars 2026-03-19T12:23:13Z

Recent advancements in Gaussian Splatting have enabled increasingly accurate reconstruction of photorealistic head avatars, opening the door to numerous applications in visual effects, videoconferencing, and virtual reality. This, however, comes with the lack of intuitive editability offered by traditional triangle mesh-based methods. In contrast, we propose a method that combines the accuracy and fidelity of 2D Gaussian Splatting with the intuitiveness of UV texture mapping. By embedding each canonical Gaussian primitive's local frame into a patch in the UV space of a template mesh in a computationally efficient manner, we reconstruct continuous editable material head textures from a single monocular video on a conventional UV domain. Furthermore, we leverage an efficient physically based reflectance model to enable relighting and editing of these intrinsic material maps. Through extensive comparisons with state-of-the-art methods, we demonstrate the accuracy of our reconstructions, the quality of our relighting results, and the ability to provide intuitive controls for modifying an avatar's appearance and geometry via texture mapping without additional optimization.

2025-12-09T22:19:28Z Accepted to Eurographics 2026. Project page: https://kelianb.github.io/GTAvatar/ Kelian Baert Mae Younes Francois Bourel Marc Christie Adnane Boukhayma 10.1111/cgf.70351 http://arxiv.org/abs/2603.18707v1 From ex(p) to poly: Gaussian Splatting with Polynomial Kernels 2026-03-19T10:05:38Z

Recent advancements in Gaussian Splatting (3DGS) have introduced various modifications to the original kernel, resulting in significant performance improvements. However, many of these kernel changes are incompatible with existing datasets optimized for the original Gaussian kernel, presenting a challenge for widespread adoption. In this work, we address this challenge by proposing an alternative kernel that maintains compatibility with existing datasets while improving computational efficiency. Specifically, we replace the original exponential kernel with a polynomial approximation combined with a ReLU function. This modification allows for more aggressive culling of Gaussians, leading to enhanced performance across different 3DGS implementations. Our results show a notable performance improvement of 4 to 15% with negligible impact on image quality. We also provide a detailed mathematical analysis of the new kernel and discuss its potential benefits for 3DGS implementations on NPU hardware.

2026-03-19T10:05:38Z Joerg H. Mueller Martin Winter Markus Steinberger http://arxiv.org/abs/2510.02691v3 FSFSplatter: Build Surface and Novel Views with Sparse-Views within 2min 2026-03-19T08:27:28Z

Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU, Replica, and BlendedMVS datasets.

2025-10-03T03:17:00Z Yibin Zhao Yihan Pan Jun Nan Liwei Chen Jianjun Yi http://arxiv.org/abs/2412.10488v4 SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers 2026-03-19T07:29:34Z

Scalable Vector Graphics (SVG) are essential XML-based formats for versatile graphics, offering resolution independence and scalability. Unlike raster images, SVGs use geometric shapes and support interactivity, animation, and manipulation via CSS and JavaScript. Current SVG generation methods face challenges related to high computational costs and complexity. In contrast, human designers use component-based tools for efficient SVG creation. Inspired by this, SVGBuilder introduces a component-based, autoregressive model for generating high-quality colored SVGs from textual input. It significantly reduces computational overhead and improves efficiency compared to traditional methods. Our model generates SVGs up to 604 times faster than optimization-based approaches. To address the limitations of existing SVG datasets and support our research, we introduce ColorSVG-100K, the first large-scale dataset of colored SVGs, comprising 100,000 graphics. This dataset fills the gap in color information for SVG generation models and enhances diversity in model training. Evaluation against state-of-the-art models demonstrates SVGBuilder's superior performance in practical applications, highlighting its efficiency and quality in generating complex SVG graphics.

2024-12-13T15:24:11Z Accepted by AAAI 2025. Project: https://svgbuilder.github.io Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(3), 2358-2366 Zehao Chen Rong Pan 10.1609/aaai.v39i3.32236 http://arxiv.org/abs/2507.02861v3 LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans 2026-03-19T02:04:27Z

We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c

2025-07-03T17:59:55Z Project Page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c&feature=youtu.be Camera-Ready Version Zhening Huang Xiaoyang Wu Fangcheng Zhong Hengshuang Zhao Matthias Nießner Joan Lasenby http://arxiv.org/abs/2603.17995v1 LoST: Level of Semantics Tokenization for 3D Shapes 2026-03-18T17:56:06Z

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

2026-03-18T17:56:06Z CVPR 2026; Project website-- https://lost3d.github.io Niladri Shekhar Dutt Zifan Shi Paul Guerrero Chun-Hao Paul Huang Duygu Ceylan Niloy J. Mitra Xuelin Chen http://arxiv.org/abs/2506.13212v3 Volumetric Functional Maps 2026-03-18T14:20:18Z

Computing volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the surface-based functional maps framework. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for high-quality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer and solid texturing. Finally, we show that the volumetric spectrum greatly improves the accuracy for classical shape matching tasks among surfaces, consistently outperforming surface-only spectral methods.

2025-06-16T08:13:57Z Filippo Maggioli Simone Melzi Marco Livesu http://arxiv.org/abs/2603.17704v1 DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies 2026-03-18T13:23:20Z

Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy-animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies, from plush toys to bananas, lowering the barrier to entry for novice animators.

2026-03-18T13:23:20Z Accepted to CHI2026 Haocheng Yuan Adrien Bousseau Hao Pan Lei Zhong Changjian Li