https://arxiv.org/api/G9GwyFNgHmKvwzhyNIHWZngFmv42026-03-24T11:29:25Z88474515http://arxiv.org/abs/2603.14927v2Masked BRep Autoencoder via Hierarchical Graph Transformer2026-03-17T03:30:12ZWe introduce a novel self-supervised learning framework that automatically learns representations from input computer-aided design (CAD) models for downstream tasks, including part classification, modeling segmentation, and machining feature recognition. To train our network, we construct a large-scale, unlabeled dataset of boundary representation (BRep) models. The success of our algorithm relies on two keycomponents. The first is a masked graph autoencoder that reconstructs randomly masked geometries and attributes of BReps for representation learning to enhance the generalization. The second is a hierarchical graph Transformer architecture that elegantly fuses global and local learning by a cross-scale mutual attention block to model long-range geometric dependencies and a graph neural network block to aggregate local topological information. After training the autoencoder, we replace its decoder with a task-specific network trained on a small amount of labeled data for downstream tasks. We conduct experiments on various tasks and achieve high performance, even with a small amount of labeled data, demonstrating the practicality and generalizability of our model. Compared to other methods, our model performs significantly better on downstream tasks with the same amount of training data, particularly when the training data is very limited.2026-03-16T07:30:11Z27 pages, 11 figures. Under reviewYifei LiKang WuWenming WuXiao-Ming Fuhttp://arxiv.org/abs/2603.16078v1Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI2026-03-17T02:55:02ZEstablishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.2026-03-17T02:55:02ZAthena TaymourtashS. Mazdak AbulnagaEsra Abaci TurkP. Ellen GrantPolina Gollandhttp://arxiv.org/abs/2603.16057v1Toward Reliable Scientific Visualization Pipeline Construction with Structure-Aware Retrieval-Augmented LLMs2026-03-17T01:52:11ZScientific visualization pipelines encode domain-specific procedural knowledge with strict execution dependencies, making their construction sensitive to missing stages, incorrect operator usage, or improper ordering. Thus, generating executable scientific visualization pipelines from natural-language descriptions remains challenging for large language models, particularly in web-based environments where visualization authoring relies on explicit code-level pipeline assembly. In this work, we investigate the reliability of LLM-based scientific visualization pipeline generation, focusing on vtk.js as a representative web-based visualization library. We propose a structure-aware retrieval-augmented generation workflow that provides pipeline-aligned vtk.js code examples as contextual guidance, supporting correct module selection, parameter configuration, and execution order. We evaluate the proposed workflow across multiple multi-stage scientific visualization tasks and LLMs, measuring reliability in terms of pipeline executability and human correction effort. To this end, we introduce correction cost as metric for the amount of manual intervention required to obtain a valid pipeline. Our results show that structured, domain-specific context substantially improves pipeline executability and reduces correction cost. We additionally provide an interactive analysis interface to support human-in-the-loop inspection and systematic evaluation of generated visualization pipelines.2026-03-17T01:52:11ZGuanghui ZhaoZhe WangYu DongGuan LiGuiHua Shanhttp://arxiv.org/abs/2603.15991v1The Midas Touch in Gaze vs. Hand Pointing: Modality-Specific Failure Modes and Implications for XR Interfaces2026-03-16T23:03:26ZExtended Reality (XR) interfaces impose both ergonomic and cognitive demands, yet current systems often force a binary choice between hand-based input, which can produce fatigue, and gaze-based input, which is vulnerable to the Midas Touch problem and precision limitations. We introduce the xr-adaptive-modality-2025 platform, a web-based open-source framework for studying whether modality-specific adaptive interventions can improve XR-relevant pointing performance and reduce workload relative to static unimodal interaction. The platform combines physiologically informed gaze simulation, an ISO 9241-9 multidirectional tapping task, and two modality-specific adaptive interventions: gaze declutter and hand target-width inflation. We evaluated the system in a 2 x 2 x 2 within-subjects design manipulating Modality (Hand vs. Gaze), UI Mode (Static vs. Adaptive), and Pressure (Yes vs. No). Results from N=69 participants show that hand yielded higher throughput than gaze (5.17 vs. 4.73 bits/s), lower error (1.8% vs. 19.1%), and lower NASA-TLX workload. Crucially, error profiles differed sharply by modality: gaze errors were predominantly slips (99.2%), whereas hand errors were predominantly misses (95.7%), consistent with the Midas Touch account. Of the two adaptive interventions, only gaze declutter executed in this dataset; it modestly reduced timeouts but not slips. Hand width inflation was not evaluable due to a UI integration bug. These findings reveal modality-specific failure modes with direct implications for adaptive policy design, and establish the platform as a reproducible infrastructure for future studies.2026-03-16T23:03:26Z25 pages, 10 figuresMohammad DastgheibFatemeh Pourmahdianhttp://arxiv.org/abs/2502.05175v2Fillerbuster: Unified Generative Scene Completion Model for Casual Captures2026-03-16T22:10:13ZWe present Fillerbuster, a unified model that completes unknown regions of a 3D scene with a multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for this challenge as they focus on making known pixels look good with sparse-view priors, or on creating missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when camera parameters are unknown. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. We open-source our framework for integration into popular reconstruction platforms like Nerfstudio or Gsplat. We present a flexible, unified inpainting framework to predict many images and poses together, where all inputs are jointly inpainted, and it could be extended to predict more modalities such as depth.2025-02-07T18:59:51ZProject page at https://ethanweber.me/fillerbuster/Ethan WeberNorman MüllerYash KantVasu AgrawalMichael ZollhöferAngjoo KanazawaChristian Richardthttp://arxiv.org/abs/2603.15796v1Perceptual Requirements for Low-Latency Head-Mounted Displays2026-03-16T18:26:55ZEnd-to-end (e2e) latency in head-mounted displays (HMD) is the time delay between a physical change in the world (e.g., a user's head movement) and the moment the display updates to reflect that change. Tracking, rendering, and other computation in real systems invariably introduce some amount of e2e latency to all HMDs. In modern devices this latency is usually in the range of 12-60 milliseconds which is partially addressed through pose prediction and late stage reprojection which means that perceptual studies and user experience evaluations cannot explore latencies below these values. Here, we introduce a video passthrough HMD, called Camsicle, which is capable of 2-millisecond e2e latency and, additionally, uses a catadioptric design to achieve perspective-correct passthrough without reprojection. This platform enables naturalistic user studies to interrogate the impacts of latency on user experience, preference, and performance. Across two user studies and 57 participants we find that 2 and 14.3 millisecond latencies are preferred over 23 and 29 milliseconds when attempting to catch a ball. Additionally, we compare individual latency preferences in this naturalistic ball-catching task to psychophysical thresholds for latency detection in a reference-grade system with zero latency to investigate how psychophysical thresholds may relate to subjective evaluations in naturalistic scenarios.2026-03-16T18:26:55ZEric PennerJosephine D'AngeloClinton SmithNathan MatsudaNeethan SivaPhillip Guanhttp://arxiv.org/abs/2603.15780v1Parallelised Differentiable Straightest Geodesics for 3D Meshes2026-03-16T18:10:28ZMachine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can improve learning and optimisation pipelines on general geometries. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voronoi tessellation. Our code, models, and pip-installable library (digeo) are available at: circle-group.github.io/research/DSG.2026-03-16T18:10:28ZAccepted to CVPR 2026Hippolyte VerninasCaner KorkmazStefanos ZafeiriouTolga BirdalSimone Fotihttp://arxiv.org/abs/2502.17531v2Laplace-Beltrami Operator for Gaussian Splatting2026-03-16T18:02:34ZWith the rising popularity of 3D Gaussian splatting and the expanse of applications from rendering to 3D reconstruction, there comes also a need for geometry processing applications directly on this new representation. While considering the centers of Gaussians as a point cloud or meshing them is an option that allows to apply existing algorithms, this might ignore information present in the data or be unnecessarily expensive. Additionally, Gaussian splatting tends to contain a large number of outliers which do not affect the rendering quality but need to be handled correctly in order not to produce noisy results in geometry processing applications. In this work, we propose a formulation to compute the Laplace-Beltrami operator, a widely used tool in geometry processing, directly on Gaussian splatting using the Mahalanobis distance. While conceptually similar to a point cloud Laplacian, our experiments show superior accuracy on the point clouds encoded in the Gaussian splatting centers and, additionally, the operator can be used to evaluate the quality of the output during optimization.2025-02-24T14:29:33Z10 pagesHongyu ZhouZorah Lähnerhttp://arxiv.org/abs/2603.15546v1Kimodo: Scaling Controllable Human Motion Generation2026-03-16T17:09:30ZHigh-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.2026-03-16T17:09:30ZProject page: https://research.nvidia.com/labs/sil/projects/kimodo/Davis RempeMathis PetrovichYe YuanHaotian ZhangXue Bin PengYifeng JiangTingwu WangUmar IqbalDavid MinorMichael de RuyterJiefeng LiChen TesslerEdy LimEugene JeongSam WuEhsan HassaniMichael HuangJin-Bey YuChaeyeon ChungLina SongOlivier DionneJan KautzSimon YuenSanja Fidlerhttp://arxiv.org/abs/2505.23685v3Perceptual Sensitivity to Stereo Geometry Errors in Head-Mounted Displays2026-03-16T16:45:48ZStereoscopic head-mounted displays (HMDs) render and present binocular images to create an egocentric, 3D percept to the HMD user. Within this render and presentation pipeline there are potential rendering camera and viewing position errors that can induce deviations in the depth and distance that a user perceives compared to the underlying intended geometry. For example, rendering errors can arise when HMD render cameras are incorrectly positioned relative to the assumed centers of projections of the HMD displays and viewing errors can arise when users view stereo geometry from the incorrect location in the HMD eyebox. In this work we present a geometric framework that predicts errors in distance perception arising from inaccurate HMD perspective geometry and build an HMD platform to reliably simulate render and viewing error in a Quest 3 HMD with eye tracking to experimentally test these predictions. We present a series of five experiments to explore the efficacy of this geometric framework and show that errors in perspective geometry can induce both under- and over-estimations in perceived distance. We further demonstrate how real-time visual feedback can be used to dynamically recalibrate visuomotor mapping so that an accurate reach distance is achieved even if the perceived visual distance is negatively impacted by geometric error.2025-05-29T17:24:38ZRaffles Xingqi ZhuCharlie S. BurlinghamOlivier MercierPhillip Guanhttp://arxiv.org/abs/2603.15447v1A Texture Lookup Approach to Bézier Curve Evaluation on the GPU2026-03-16T15:47:33ZWe present a texture-based technique for evaluating Bézier curves on the GPU that leverages fixed-function linear texture interpolation hardware. By offloading curve evaluation to the texture interpolator, this approach can improve performance in compute-bound GPU workloads. The method can also be used naturally for Bézier surfaces and volumes and extends to advanced curve types such as B-splines, NURBS, and both integral and rational polynomials. We show how Seiler interpolation fits into this framework to improve efficiency. We also compare performance and accuracy against curves evaluated as polynomials in shader code.2026-03-16T15:47:33ZMuhammad AnasAlan Wolfehttp://arxiv.org/abs/2510.03813v3Diverse Text-to-Image Generation via Contrastive Noise Optimization2026-03-16T13:07:55ZText-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images, largely enabled by text-guided inference. However, this advantage often comes with a critical drawback: limited diversity, as outputs tend to collapse into similar modes under strong text guidance. Existing approaches typically optimize intermediate latents or text conditions during inference, but these methods deliver only modest gains or remain sensitive to hyperparameter tuning. In this work, we introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective. Unlike prior techniques that adapt intermediate latents, our approach shapes the initial noise to promote diverse outputs. Specifically, we develop a contrastive loss defined in the Tweedie data space and optimize a batch of noise latents. Our contrastive optimization repels instances within the batch to maximize diversity while keeping them anchored to a reference sample to preserve fidelity. We further provide theoretical insights into the mechanism of this preprocessing to substantiate its effectiveness. Extensive experiments across multiple T2I backbones demonstrate that our approach achieves a superior quality-diversity Pareto frontier while remaining robust to hyperparameter choices.2025-10-04T13:51:32ZAccepted to ICLR 2026Byungjun KimSoobin UmJong Chul Yehttp://arxiv.org/abs/2603.14982v1Adaptive GPU Kinetic Solver for Fluid-Granular Flows2026-03-16T08:45:46ZSimulating fluid-granular flows is crucial for understanding natural disasters, industrial processes, and visually realistic phenomena in computer graphics. These systems are challenging to simulate because of the strong nonlinear coupling between continuum fluids and discrete granular media, making it difficult to achieve both physical fidelity and computational efficiency at large scales. In this work, we present a unified framework for large-scale fluid-granular simulation that couples the Lattice Boltzmann Method (LBM) for fluids with the Material Point Method (MPM) for granular materials such as sand and snow. We introduce an adaptive block-based multi-level HOME-LBM solver based on solid geometric structures, enabling efficient memory usage and computational performance across multiple lattice resolutions. Consistent rescaling laws for moments allow accurate transfer of macroscopic quantities across refinement interfaces, while a GPU-based algorithm dynamically maintains the multi-level blocks in response to particle motion. By enforcing that all MPM particles reside within the finest fluid nodes, we achieve accurate two-way coupling between fluid and granular phases. Our framework supports a wide range of large-scale phenomena, including snow avalanches, sandstorms, and sand migration, demonstrating high physical fidelity and computational efficiency.2026-03-16T08:45:46ZXingqiao LiKui WuHaozhe SuTianhong GaoMengyu ChuChenfanfu JiangWei LiBaoquan Chenhttp://arxiv.org/abs/2512.12459v2From Particles to Fields: Reframing Photon Mapping with Continuous Gaussian Photon Fields2026-03-16T02:56:53ZAccurately modeling light transport is essential for realistic image synthesis. Photon mapping provides physically grounded estimates of complex global illumination effects such as caustics and specular-diffuse interactions, yet its per-view radiance estimation remains computationally inefficient when rendering multiple views of the same scene. The inefficiency arises from independent photon tracing and stochastic kernel estimation at each viewpoint, leading to inevitable redundant computation. To accelerate multi-view rendering, we reformulate photon mapping as a continuous and reusable radiance function. Specifically, we introduce the Gaussian Photon Field (GPF), a learnable representation that encodes photon distributions as anisotropic 3D Gaussian primitives parameterized by position, rotation, scale, and spectrum. GPF is initialized from physically traced photons in the first SPPM iteration and optimized using multi-view supervision of final radiance, distilling photon-based light transport into a continuous field. Once trained, the field enables differentiable radiance evaluation along camera rays without repeated photon tracing or iterative refinement. Extensive experiments on scenes with complex light transport, such as caustics and specular-diffuse interactions, demonstrate that GPF attains photon-level accuracy while reducing computation by orders of magnitude, unifying the physical rigor of photon-based rendering with the efficiency of neural scene representations.2025-12-13T21:09:09ZJiachen TaoBenjamin PlancheVan Nguyen NguyenJunyi WuYuchun LiuHaoxuan WangZhongpai GaoGengyu ZhangMeng ZhengFeiran WangAnwesa ChoudhuriZhenghao ZhaoWeitai KangTerrence ChenYan YanZiyan Wuhttp://arxiv.org/abs/2603.14301v14D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding2026-03-15T09:32:58ZCurrent 4D representations decouple geometry, motion, and semantics: reconstruction methods discard interpretable motion structure; language-grounded methods attach semantics after motion is learned, blind to how objects move; and motion-aware methods encode dynamics as opaque per-point residuals without object-level organization. We propose 4D Synchronized Fields, a 4D Gaussian representation that learns object-factored motion in-loop during reconstruction and synchronizes language to the resulting kinematics through a per-object conditioned field. Each Gaussian trajectory is decomposed into shared object motion plus an implicit residual, and a kinematic-conditioned ridge map predicts temporal semantic variation, yielding a single representation in which reconstruction, motion, and semantics are structurally coupled and enabling open-vocabulary temporal queries that retrieve both objects and moments. On HyperNeRF, 4D Synchronized Fields achieves 28.52 dB mean PSNR, the highest among all language-grounded and motion-aware baselines, within 1.5 dB of reconstruction-only methods. On targeted temporal-state retrieval, the kinematic-conditioned field attains 0.884 mean accuracy, 0.815 mean vIoU, and 0.733 mean tIoU, surpassing 4D LangSplat (0.620, 0.433, and 0.439 respectively) and LangSplat (0.415, 0.304, and 0.262). Ablation confirms that kinematic conditioning is the primary driver, accounting for +0.45 tIoU over a static-embedding-only baseline. 4D Synchronized Fields is the only method that jointly exposes interpretable motion primitives and temporally grounded language fields from a single trained representation. Code will be released.2026-03-15T09:32:58Z34 pages, 3 figures, 7 tables. Includes supplementary material. PreprintMohamed Rayan BarhdadiSamir AbdaljalilRasul KhanbayovErchin SerpedinHasan Kurban