https://arxiv.org/api/BdyV4dqT8hdcgqqFKwfqZqN5VcA 2026-06-14T03:24:44Z 9323 285 15 http://arxiv.org/abs/2605.12119v2 MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics 2026-05-13T01:49:16Z

Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion process. MoCam first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion process. Experiments demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.

2026-05-12T13:35:54Z Project page: https://orange-3dv-team.github.io/MoCam Haofeng Liu Yang Zhou Ziheng Wang Zhengbo Xu Zhan Peng Jie Ma Jun Liang Shengfeng He Jing Li http://arxiv.org/abs/2605.12778v1 Generative Motion In-betweening by Diffusion over Continuous Implicit Representations 2026-05-12T21:48:14Z

Recent advances in generative models have yielded impressive progress on motion in-betweening, allowing for more complex, varied, and realistic motion transitions. However, recent methods still exhibit noticeable limitations in preserving keyframe information and ensuring motion continuity. In this paper, we propose a novel pipeline and sampling optimization strategy for latent diffusion models (LDM) based on motion implicit neural representations (INR). By establishing a mapping between INR and sparse spatial or temporal information within latent diffusion, our model can sample the INR parameters from extremely sparse and ambiguous keyframe data and reconstruct plausible and smooth motions from the manifold. Our experiments demonstrate the superior performance of our model, which significantly improves motion generation quality in scenarios with few keyframes while ensuring both keyframe accuracy and diversity of in-between motions.

2026-05-12T21:48:14Z Shiyu Fan Paul Henderson Edmond S. L. Ho http://arxiv.org/abs/2605.12730v1 BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics 2026-05-12T20:32:02Z

Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts.

2026-05-12T20:32:02Z 19 pages Helene Malyutina http://arxiv.org/abs/2502.15761v3 AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results 2026-05-12T18:27:21Z

The deployment of large language models (LLMs) on extended reality (XR) devices has great potential to advance the field of human-AI interaction. In the case of direct, on-device model inference, selecting the appropriate model and device for specific tasks remains challenging. In this paper, we present AIvaluateXR, a comprehensive evaluation framework for benchmarking LLMs running on XR devices. To demonstrate the framework, we deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation. Our experimental setup measures four key metrics: performance consistency, processing speed, memory usage, and battery consumption. For each of the 68 model-device pairs, we assess performance under varying string lengths, batch sizes, and thread counts, analyzing the trade-offs for real-time XR applications. We propose a unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives. Additionally, we compare the efficiency of on-device LLMs with client-server and cloud-based setups, and evaluate their accuracy on two interactive tasks. We believe our findings offer valuable insight to guide future optimization efforts for LLM deployment on XR devices. Our evaluation method can be used as standard groundwork for further research and development in this emerging field. The source code and supplementary materials are available at: www.nanovis.org/AIvaluateXR.html

2025-02-13T20:55:48Z AIvaluateXR is updated version of LoXR Dawar Khan Xinyu Liu Omar Mena Donggang Jia Alexandre Kouyoumdjian Ivan Viola http://arxiv.org/abs/2605.12498v1 EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera 2026-05-12T17:59:56Z

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

2026-05-12T17:59:56Z 23 pages, 19 figures and 10 tables; project page: https://dfki-av.github.io/EgoForce (source code, data and demo available); SIGGRAPH 2026 Conference Christen Millerdurai Shaoxiang Wang Yaxu Xie Vladislav Golyanik Didier Stricker Alain Pagani http://arxiv.org/abs/2605.12159v1 ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization 2026-05-12T14:09:48Z

Algorithm Visualization (AV) helps students build mental models by animating algorithm execution states. Recent LLM-based systems such as CODE2VIDEO generate AV videos in an end-to-end manner. However, this paradigm requires the system to simultaneously simulate algorithm flow and satisfy video rendering constraints, such as element layout and color schemes. This complex task induces LLM hallucinations, resulting in reduced execution success rates, element overlap, and inter-frame inconsistencies. To address these challenges, we propose ALGOGEN, a novel paradigm that decouples algorithm execution from rendering. We first introduce Visualization Trace Algebra (VTA), a monoid over algorithm visual states and operations. The LLM then generates a Python tracker that simulates algorithm flow and outputs VTA-JSON traces, a JSON encoding of VTA. For rendering, we define a Rendering Style Language (RSL) to templatize algorithm layouts. A deterministic renderer then compiles algorithm traces with RSL into Manim, LaTeX/TikZ, or Three.js outputs. Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves an average success rate improvement of 17.3% compared to end-to-end methods, with 99.8% versus 82.5%. These results demonstrate that our decoupling paradigm effectively mitigates LLM hallucinations in complex AV tasks, providing a more reliable solution for automated generation of high-quality algorithm visualizations. Demo videos and code are available in the project repository.

2026-05-12T14:09:48Z Kunpeng Liao Yuexiao Ma Yisheng Lin Hualin Zeng Xiawu Zheng Rongrong Ji http://arxiv.org/abs/2604.12625v2 Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments 2026-05-12T07:55:51Z

High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead. To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation. Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations.

2026-04-14T11:52:57Z Accepted to CVPR 2026 Jianhui Wu Jian Zhou Zhi Zhou Zhangjin Huang Chao Li http://arxiv.org/abs/2605.11696v1 WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting 2026-05-12T07:53:27Z

Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

2026-05-12T07:53:27Z Companion paper to the CVPR26 findings paper 'WildRelight', introducing the physics-guided adaptation method evaluated on the dataset. Project Page: https://lez-s.github.io/wildrelight_proj/ Lezhong Wang Mehmet Onurcan Kaya Siavash Bigdeli Jeppe Revall Frisvad http://arxiv.org/abs/2605.11673v1 STA-FEM: Exact Streaming Assembly for Preplanned Dynamic Tetrahedral Topology Edits 2026-05-12T07:29:02Z

Dynamic tetrahedral simulation pipelines rebuild topology-dependent solver state after every fracture, refinement, or merge event - discarding structural continuity that survives each edit and spending global work on what are often local changes. We present STA-FEM, a streaming assembly method for simulations with topologically-dynamic tetrahedral meshes operating on a fixed superset mesh: when the candidate element pool is preallocated and the per-frame edit stream is exposed, the surrounding solver, preconditioner, and time-stepping layers stay unchanged while the per-frame assembly step is replaced with persistent incremental updates that match a full-rebuild approach exactly at every frame. Across various three-dimensional examples with up to 460k elements, the method delivers end-to-end speedups of 1.37x to 1.61x over full-rebuild with orders-of-magnitude reductions in matrix update cost, preserving exact matrix parity in all tested frames against a stronger exact local recomputation baseline. We test our algorithm in realistic fracture simulation pipelines and observe up to 76% speedups in fracture frame time with exact equivalence to a ground-truth full-rebuild algorithm. These results establish exact streaming assembly as a potentially practical approach for simulating tetrahedral meshes with dynamic topology.

2026-05-12T07:29:02Z 8 pages, 4 figures Manish Acharya David Hyde http://arxiv.org/abs/2304.09479v5 DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows 2026-05-12T07:05:20Z

We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies. Please visit our page: https://diffusion-face-relighting-pp.github.io

2023-04-19T08:03:20Z Published in IEEE TPAMI (vol. 48, no. 5, May 2026). This is an extended version of the ICCV 2023 paper (DiFaReli) IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 5, pp. 5068-5082, May 2026 Puntawat Ponglertnapakorn Nontawat Tritrong Supasorn Suwajanakorn 10.1109/TPAMI.2025.3648667 http://arxiv.org/abs/2605.11536v1 ToF ReSTIR: Time-of-Flight Rendering with Spatio-temporal Reservoir Resampling 2026-05-12T05:02:17Z

We present a novel spatio-temporal reuse framework for time-resolved light transport, enabling efficient Monte Carlo rendering of time-of-flight (ToF) phenomena such as time-gated imaging and transient light capture. Existing ToF rendering methods are computationally expensive, scale poorly to complex dynamic scenes, and are therefore unsuitable for applications with strict latency constraints. To address this limitation, we draw inspiration from ReSTIR, a reuse-based technique for steady-state real-time rendering, and adapt its core principles to interactive-rate ToF simulation. However, naively applying existing ReSTIR methods to ToF rendering leads to severe inefficiency, as reused paths frequently violate optical path-length constraints and thus contribute little or no signal. We overcome this challenge by introducing a path reuse formulation that explicitly enforces physically valid optical path lengths. The key idea is path-length-aware shift mapping, a geometric transformation based on Newton's method that adjusts reused light paths to satisfy temporal gating constraints, inspired by specular manifold exploration in steady-state caustics rendering. The resulting framework substantially improves the efficiency of ToF rendering across a wide range of scenarios, including complex scenes with glossy or specular materials and dynamic motion. Our method supports both time-gated and transient rendering at interactive frame rates, enabling simulation under practical latency constraints. We demonstrate the effectiveness of our approach through two downstream applications, including shape reconstruction and navigation.

2026-05-12T05:02:17Z Juhyeon Kim Wojciech Jarosz Adithya Pediredla 10.1145/3811299 http://arxiv.org/abs/2605.11489v1 3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering 2026-05-12T04:05:36Z

3D Gaussian Splatting (3DGS) enables high-quality real-time 3D rendering but faces challenges in efficiently scaling to ultra-dense scenes and high-resolution due to computational bottlenecks that limit its use in latency-sensitive applications. Instead of optimizing the splatting pipeline itself, we propose \textbf{3DGS$^3$}, a unified post-rendering framework that jointly performs super sampling and frame interpolation through differentiable processing of low-resolution outputs to achieve both high-resolution and high-frame-rate rendering. Our \textbf{Gradient\- \-Aware Super Sampling (GASS)} module leverages the continuous differentiability of 3DGS to extract image gradients that guide a GRU-based refinement network to enable high-fidelity super sampling. Furthermore, a \textbf{Lightweight Temporal Frame Interpolation (LTFI)} module based on a compact U-Net-like backbone fuses temporal and differentiable spatial cues from consecutive frames to synthesize temporally coherent intermediate frames. Experiments on public datasets demonstrate that 3DGS$^3$ achieves superior rendering efficiency and visual quality when compared with state-of-the-art methods and remains compatible with existing 3DGS acceleration techniques. The code will be publicly released upon acceptance.

2026-05-12T04:05:36Z Yibo Zhao Fan Gao Youcheng Cai Ligang Liu http://arxiv.org/abs/2605.11266v1 PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives 2026-05-11T21:43:43Z

Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.

2026-05-11T21:43:43Z Submitted to Artificial Intelligence. 52 pages Zachary Lee Maxwell Jacobson Yexiang Xue http://arxiv.org/abs/2605.11115v1 LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR 2026-05-11T18:24:04Z

High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.

2026-05-11T18:24:04Z Pedram Fekri WenChen Li William Chen Peter Altamirano http://arxiv.org/abs/2212.08790v2 Unphased Wrinkles: Estimating cloth elasticity parameters using a frequency-based loss 2026-05-11T17:12:09Z

Generating realistic clothing for virtual applications like online retail and digital avatars is crucial but requires expert knowledge of 3D tools to generating believable simulations. Recently, a number of works proposed to estimate cloth material properties from specialized capture setups. However, these systems tend to be monolithic, complex and expensive. We propose a simplified method for automatically determining parameters based on easily captured real-world fabrics. While existing methods carefully design experiments to isolate stretch parameters from bending modes, we embrace that stretching fabrics causes wrinkling and propose a novel specialized loss for comparing wrinkled fabrics. We designed our objective function to capture material-specific behavior, resulting in similar values for different wrinkle configurations of the same material. We estimate bending first, given that membrane stiffness has little effect on bending. We use differentiable simulation to find an optimal set of parameters that minimizes the difference between simulated cloth and deformed target cloth. Furthermore, our pipeline decouples the capture method from the optimization by registering a template mesh to the scanned data. These choices simplify the capture system and allow for wrinkles in scanned fabrics. We demonstrate our method on captured data of three different real-world fabrics and on three digital fabrics produced by a third-party simulator.

2022-12-17T03:43:45Z Egor Larionov Marie-Lena Eckert Katja Wolff Tuur Stuyck