https://arxiv.org/api/XZSm/xxHjsJHVkRZSp2pxtbyspc 2026-06-17T14:46:43Z 9346 690 15 http://arxiv.org/abs/2603.16853v1 BrickSim: A Physics-Based Simulator for Manipulating Interlocking Brick Assemblies 2026-03-17T17:56:53Z

Interlocking brick assemblies provide a standardized yet challenging testbed for contact-rich and long-horizon robotic manipulation, but existing rigid-body simulators do not faithfully capture snap-fit mechanics. We present BrickSim, the first real-time physics-based simulator for interlocking brick assemblies. BrickSim introduces a compact force-based mechanics model for snap-fit connections and solves the resulting internal force distribution using a structured convex quadratic program. Combined with a hybrid architecture that delegates rigid-body dynamics to the underlying physics engine while handling snap-fit mechanics separately, BrickSim enables real-time, high-fidelity simulation of assembly, disassembly, and structural collapse. On 150 real-world assemblies, BrickSim achieves 100% accuracy in static stability prediction with an average solve time of 5 ms. In dynamic drop tests, it also faithfully reproduces real-world structural collapse, precisely mirroring both the occurrence of breakage and the specific breakage locations. Built on Isaac Sim, BrickSim further supports seamless integration with a wide variety of robots and existing pipelines. We demonstrate robotic construction of brick assemblies using BrickSim, highlighting its potential as a foundation for research in dexterous, long-horizon robotic manipulation. BrickSim is open-source, and the code is available at https://github.com/intelligent-control-lab/BrickSim.

2026-03-17T17:56:53Z 9 pages, 9 figures Haowei Wen Ruixuan Liu Weiyi Piao Siyu Li Changliu Liu http://arxiv.org/abs/2603.16612v1 Retrieval-Augmented Sketch-Guided 3D Building Generation 2026-03-17T14:51:52Z

In the early design stage of Japanese detached houses, the lack of a unified design representation among clients, sales representatives, and designers leads to design drift and inefficient feedback. Usually, sketches handed off by sales representatives may lose details for quick drawing, which reduces the fidelity of subsequent 3D generation using generative AI models. The generated 3D model typically takes the form of a single unified mesh, preventing component-level editing. To solve these issues, we propose a multi-stage 3D generative design framework capable of producing architectural models from rough design sketches. The framework combines generative and retrieval-based methods to enable component-level editing and personalized customization. It adopts a multimodal representation for 3D model generation and applies component segmentation to localize architectural components such as windows and doors and uses retrieval to support targeted replacement of components. Experiments show that the work enables modular customization which is thought to be suitable for personalized architectural design. This work introduces a multi-stage sketch-to-3D framework for Japanese detached houses, provides facade and component datasets, and shows effectiveness through quantitative and expert evaluations.

2026-03-17T14:51:52Z 10 pages, 4 figures, Proceeding of CAADRIA 2026 Zhengyang Wang Nuttapong Rochanavibhata Yuxiao Ren Xusheng Du Ye Zhang Haoran Xie http://arxiv.org/abs/2603.16566v1 VideoMatGen: PBR Materials through Joint Generative Modeling 2026-03-17T14:24:20Z

We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.

2026-03-17T14:24:20Z Jon Hasselgren Zheng Zeng Milos Hasan Jacob Munkberg http://arxiv.org/abs/2512.11237v2 WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering 2026-03-17T13:12:15Z

Existing methods achieve high-quality facial albedo capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial albedo capture from a smartphone video recorded in the wild. To disentangle high-quality albedo from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. We first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for the albedo map and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Other reflectance maps are then predicted from the albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin.

2025-12-12T02:37:03Z CVPR 2026. project page: https://yxuhan.github.io/WildCap/index.html; code: https://github.com/yxuhan/WildCap Yuxuan Han Xin Ming Tianxiao Li Zhuofan Shen Qixuan Zhang Lan Xu Feng Xu http://arxiv.org/abs/2603.16447v1 ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars 2026-03-17T12:30:27Z

In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.

2026-03-17T12:30:27Z Accepted to CVPR 2026, Project page: https://ustc3dv.github.io/ProgressiveAvatars/ Kaiwen Song Jinkai Cui Juyong Zhang http://arxiv.org/abs/2511.02580v2 TAUE: Training-free Noise Transplant and Cultivation Diffusion Model 2026-03-17T10:21:32Z

Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for layer-wise image generation that requires neither fine-tuning nor additional data. TAUE embeds global structural information from intermediate denoising latents into the initial noise to preserve spatial coherence, and integrates semantic cues through cross-layer attention sharing to maintain contextual and visual consistency across layers. Extensive experiments demonstrate that TAUE achieves state-of-the-art performance among training-free methods, delivering image quality comparable to fine-tuned models while improving inter-layer consistency. Moreover, it enables new applications, such as layout-aware editing, multi-object composition, and background replacement, indicating potential for interactive, layer-separated generation systems in real-world creative workflows.

2025-11-04T13:56:39Z Accepted to CVPR 2026 Findings. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE Daichi Nagai Ryugo Morita Shunsuke Kitada Hitoshi Iyatomi http://arxiv.org/abs/2603.16103v1 NanoGS: Training-Free Gaussian Splat Simplification 2026-03-17T03:58:02Z

3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at https://saliteta.github.io/NanoGS/.

2026-03-17T03:58:02Z Butian Xiong Rong Liu Tiantian Zhou Meida Chen Zhiwen Fan Andrew Feng http://arxiv.org/abs/2603.14927v2 Masked BRep Autoencoder via Hierarchical Graph Transformer 2026-03-17T03:30:12Z

We introduce a novel self-supervised learning framework that automatically learns representations from input computer-aided design (CAD) models for downstream tasks, including part classification, modeling segmentation, and machining feature recognition. To train our network, we construct a large-scale, unlabeled dataset of boundary representation (BRep) models. The success of our algorithm relies on two keycomponents. The first is a masked graph autoencoder that reconstructs randomly masked geometries and attributes of BReps for representation learning to enhance the generalization. The second is a hierarchical graph Transformer architecture that elegantly fuses global and local learning by a cross-scale mutual attention block to model long-range geometric dependencies and a graph neural network block to aggregate local topological information. After training the autoencoder, we replace its decoder with a task-specific network trained on a small amount of labeled data for downstream tasks. We conduct experiments on various tasks and achieve high performance, even with a small amount of labeled data, demonstrating the practicality and generalizability of our model. Compared to other methods, our model performs significantly better on downstream tasks with the same amount of training data, particularly when the training data is very limited.

2026-03-16T07:30:11Z 27 pages, 11 figures. Under review Yifei Li Kang Wu Wenming Wu Xiao-Ming Fu http://arxiv.org/abs/2603.16078v1 Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI 2026-03-17T02:55:02Z

Establishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.

2026-03-17T02:55:02Z Athena Taymourtash S. Mazdak Abulnaga Esra Abaci Turk P. Ellen Grant Polina Golland http://arxiv.org/abs/2603.15991v1 The Midas Touch in Gaze vs. Hand Pointing: Modality-Specific Failure Modes and Implications for XR Interfaces 2026-03-16T23:03:26Z

Extended Reality (XR) interfaces impose both ergonomic and cognitive demands, yet current systems often force a binary choice between hand-based input, which can produce fatigue, and gaze-based input, which is vulnerable to the Midas Touch problem and precision limitations. We introduce the xr-adaptive-modality-2025 platform, a web-based open-source framework for studying whether modality-specific adaptive interventions can improve XR-relevant pointing performance and reduce workload relative to static unimodal interaction. The platform combines physiologically informed gaze simulation, an ISO 9241-9 multidirectional tapping task, and two modality-specific adaptive interventions: gaze declutter and hand target-width inflation. We evaluated the system in a 2 x 2 x 2 within-subjects design manipulating Modality (Hand vs. Gaze), UI Mode (Static vs. Adaptive), and Pressure (Yes vs. No). Results from N=69 participants show that hand yielded higher throughput than gaze (5.17 vs. 4.73 bits/s), lower error (1.8% vs. 19.1%), and lower NASA-TLX workload. Crucially, error profiles differed sharply by modality: gaze errors were predominantly slips (99.2%), whereas hand errors were predominantly misses (95.7%), consistent with the Midas Touch account. Of the two adaptive interventions, only gaze declutter executed in this dataset; it modestly reduced timeouts but not slips. Hand width inflation was not evaluable due to a UI integration bug. These findings reveal modality-specific failure modes with direct implications for adaptive policy design, and establish the platform as a reproducible infrastructure for future studies.

2026-03-16T23:03:26Z 25 pages, 10 figures Mohammad Dastgheib Fatemeh Pourmahdian http://arxiv.org/abs/2502.05175v2 Fillerbuster: Unified Generative Scene Completion Model for Casual Captures 2026-03-16T22:10:13Z

We present Fillerbuster, a unified model that completes unknown regions of a 3D scene with a multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for this challenge as they focus on making known pixels look good with sparse-view priors, or on creating missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when camera parameters are unknown. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. We open-source our framework for integration into popular reconstruction platforms like Nerfstudio or Gsplat. We present a flexible, unified inpainting framework to predict many images and poses together, where all inputs are jointly inpainted, and it could be extended to predict more modalities such as depth.

2025-02-07T18:59:51Z Project page at https://ethanweber.me/fillerbuster/ Ethan Weber Norman Müller Yash Kant Vasu Agrawal Michael Zollhöfer Angjoo Kanazawa Christian Richardt http://arxiv.org/abs/2603.15796v1 Perceptual Requirements for Low-Latency Head-Mounted Displays 2026-03-16T18:26:55Z

End-to-end (e2e) latency in head-mounted displays (HMD) is the time delay between a physical change in the world (e.g., a user's head movement) and the moment the display updates to reflect that change. Tracking, rendering, and other computation in real systems invariably introduce some amount of e2e latency to all HMDs. In modern devices this latency is usually in the range of 12-60 milliseconds which is partially addressed through pose prediction and late stage reprojection which means that perceptual studies and user experience evaluations cannot explore latencies below these values. Here, we introduce a video passthrough HMD, called Camsicle, which is capable of 2-millisecond e2e latency and, additionally, uses a catadioptric design to achieve perspective-correct passthrough without reprojection. This platform enables naturalistic user studies to interrogate the impacts of latency on user experience, preference, and performance. Across two user studies and 57 participants we find that 2 and 14.3 millisecond latencies are preferred over 23 and 29 milliseconds when attempting to catch a ball. Additionally, we compare individual latency preferences in this naturalistic ball-catching task to psychophysical thresholds for latency detection in a reference-grade system with zero latency to investigate how psychophysical thresholds may relate to subjective evaluations in naturalistic scenarios.

2026-03-16T18:26:55Z Eric Penner Josephine D'Angelo Clinton Smith Nathan Matsuda Neethan Siva Phillip Guan 10.1145/3811335 http://arxiv.org/abs/2603.15780v1 Parallelised Differentiable Straightest Geodesics for 3D Meshes 2026-03-16T18:10:28Z

Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can improve learning and optimisation pipelines on general geometries. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voronoi tessellation. Our code, models, and pip-installable library (digeo) are available at: circle-group.github.io/research/DSG.

2026-03-16T18:10:28Z Accepted to CVPR 2026 Hippolyte Verninas Caner Korkmaz Stefanos Zafeiriou Tolga Birdal Simone Foti http://arxiv.org/abs/2502.17531v2 Laplace-Beltrami Operator for Gaussian Splatting 2026-03-16T18:02:34Z

With the rising popularity of 3D Gaussian splatting and the expanse of applications from rendering to 3D reconstruction, there comes also a need for geometry processing applications directly on this new representation. While considering the centers of Gaussians as a point cloud or meshing them is an option that allows to apply existing algorithms, this might ignore information present in the data or be unnecessarily expensive. Additionally, Gaussian splatting tends to contain a large number of outliers which do not affect the rendering quality but need to be handled correctly in order not to produce noisy results in geometry processing applications. In this work, we propose a formulation to compute the Laplace-Beltrami operator, a widely used tool in geometry processing, directly on Gaussian splatting using the Mahalanobis distance. While conceptually similar to a point cloud Laplacian, our experiments show superior accuracy on the point clouds encoded in the Gaussian splatting centers and, additionally, the operator can be used to evaluate the quality of the output during optimization.

2025-02-24T14:29:33Z 10 pages Hongyu Zhou Zorah Lähner http://arxiv.org/abs/2603.15546v1 Kimodo: Scaling Controllable Human Motion Generation 2026-03-16T17:09:30Z

High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.

2026-03-16T17:09:30Z Project page: https://research.nvidia.com/labs/sil/projects/kimodo/ Davis Rempe Mathis Petrovich Ye Yuan Haotian Zhang Xue Bin Peng Yifeng Jiang Tingwu Wang Umar Iqbal David Minor Michael de Ruyter Jiefeng Li Chen Tessler Edy Lim Eugene Jeong Sam Wu Ehsan Hassani Michael Huang Jin-Bey Yu Chaeyeon Chung Lina Song Olivier Dionne Jan Kautz Simon Yuen Sanja Fidler