https://arxiv.org/api/WLJ13KfBXjeMH5GLeQ2QYIX6doM 2026-06-28T19:15:08Z 9390 1950 15 http://arxiv.org/abs/2505.20129v3 Agentic 3D Scene Generation with Spatially Contextualized VLMs 2025-07-04T15:28:37Z

Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications. Project page: https://spatctxvlm.github.io/project_page/.

2025-05-26T15:28:17Z Project page: https://spatctxvlm.github.io/project_page/ Xinhang Liu Yu-Wing Tai Chi-Keung Tang http://arxiv.org/abs/2505.06227v2 Anymate: A Dataset and Baselines for Learning 3D Object Rigging 2025-07-04T02:11:50Z

Rigging and skinning are essential steps to create realistic 3D animations, often requiring significant expertise and manual effort. Traditional attempts at automating these processes rely heavily on geometric heuristics and often struggle with objects of complex geometry. Recent data-driven approaches show potential for better generality, but are often constrained by limited training data. We present the Anymate Dataset, a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information -- 70 times larger than existing datasets. Using this dataset, we propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction. We systematically design and experiment with various architectures as baselines for each module and conduct comprehensive evaluations on our dataset to compare their performance. Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning. Code and dataset can be found at https://anymate3d.github.io/.

2025-05-09T17:59:33Z SIGGRAPH 2025. Project page: https://anymate3d.github.io/ Yufan Deng Yuhao Zhang Chen Geng Shangzhe Wu Jiajun Wu http://arxiv.org/abs/2411.12089v3 FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting 2025-07-03T22:14:28Z

In the real world, objects reveal internal textures when sliced or cut, yet this behavior is not well-studied in 3D generation tasks today. For example, slicing a virtual 3D watermelon should reveal flesh and seeds. Given that no available dataset captures an object's full internal structure and collecting data from all slices is impractical, generative methods become the obvious approach. However, current 3D generation and inpainting methods often focus on visible appearance and overlook internal textures. To bridge this gap, we introduce FruitNinja, the first method to generate internal textures for 3D objects undergoing geometric and topological changes. Our approach produces objects via 3D Gaussian Splatting (3DGS) with both surface and interior textures synthesized, enabling real-time slicing and rendering without additional optimization. FruitNinja leverages a pre-trained diffusion model to progressively inpaint cross-sectional views and applies voxel-grid-based smoothing to achieve cohesive textures throughout the object. Our OpaqueAtom GS strategy overcomes 3DGS limitations by employing densely distributed opaque Gaussians, avoiding biases toward larger particles that destabilize training and sharp color transitions for fine-grained textures. Experimental results show that FruitNinja substantially outperforms existing approaches, showcasing unmatched visual quality in real-time rendered internal views across arbitrary geometry manipulations.

2024-11-18T22:00:19Z accepted in CVPR 2025, project page https://fanguw.github.io/FruitNinja3D Fangyu Wu Yuhao Chen http://arxiv.org/abs/2507.03170v1 ASCRIBE-XR: Virtual Reality for Visualization of Scientific Imagery 2025-07-03T20:52:40Z

ASCRIBE-XR, a novel computational platform designed to facilitate the visualization and exploration of 3D volumetric data and mesh data in the context of synchrotron experiments, is described. Using Godot and PC-VR technologies, the platform enables users to dynamically load and manipulate 3D data sets to gain deeper insights into their research. The program's multi-user capabilities, enabled through WebRTC, and MQTT, allow multiple users to share data and visualize together in real-time, promoting a more interactive and engaging research experience. We describe the design and implementation of ASCRIBE-XR, highlighting its key features and capabilities. We will also discuss its utility in the context of synchrotron research, including examples of its application and potential benefits for the scientific community.

2025-07-03T20:52:40Z Ronald J. Pandolfi Jeffrey J. Donatelli Julian Todd Daniela Ushizima http://arxiv.org/abs/2507.03166v1 Image-driven Robot Drawing with Rapid Lognormal Movements 2025-07-03T20:51:27Z

Large image generation and vision models, combined with differentiable rendering technologies, have become powerful tools for generating paths that can be drawn or painted by a robot. However, these tools often overlook the intrinsic physicality of the human drawing/writing act, which is usually executed with skillful hand/arm gestures. Taking this into account is important for the visual aesthetics of the results and for the development of closer and more intuitive artist-robot collaboration scenarios. We present a method that bridges this gap by enabling gradient-based optimization of natural human-like motions guided by cost functions defined in image space. To this end, we use the sigma-lognormal model of human hand/arm movements, with an adaptation that enables its use in conjunction with a differentiable vector graphics (DiffVG) renderer. We demonstrate how this pipeline can be used to generate feasible trajectories for a robot by combining image-driven objectives with a minimum-time smoothing criterion. We demonstrate applications with generation and robotic reproduction of synthetic graffiti as well as image abstraction.

2025-07-03T20:51:27Z Accepted at IEEE RO-MAN 2025 Daniel Berio Guillaume Clivaz Michael Stroh Oliver Deussen Réjean Plamondon Sylvain Calinon Frederic Fol Leymarie http://arxiv.org/abs/2501.04011v2 Adaptive Algebraic Reuse of Reordering in Cholesky Factorization with Dynamic Sparsity Pattern 2025-07-03T16:09:51Z

Cholesky linear solvers are a critical bottleneck in challenging applications within computer graphics and scientific computing. These applications include but are not limited to elastodynamic barrier methods such as Incremental Potential Contact (IPC), and geometric operations such as remeshing and morphology. In these contexts, the sparsity patterns of the linear systems frequently change across successive calls to the Cholesky solver, necessitating repeated symbolic analyses that dominate the overall solver runtime. To address this bottleneck, we evaluate our method on over 150,000 linear systems generated from diverse nonlinear problems with dynamic sparsity changes in Incremental Potential Contact (IPC) and patch remeshing on a wide range of triangular meshes of various sizes. Our analysis using three leading sparse Cholesky libraries, Intel MKL Pardiso, SuiteSparse CHOLMOD, and Apple Accelerate, reveals that the primary performance constraint lies in the symbolic re-ordering phase of the solver. Recognizing this, we introduce Parth, an innovative re-ordering method designed to update ordering vectors only where local connectivity changes occur adaptively. Parth employs a novel hierarchical graph decomposition algorithm to break down the dual graph of the input matrix into fine-grained subgraphs, facilitating the selective reuse of fill-reducing orderings when sparsity patterns exhibit temporal coherence. Our extensive evaluation demonstrates that Parth achieves up to a 255x and 13x speedup in fill-reducing ordering for our IPC and remeshing benchmark and a 6.85x and 10.7x acceleration in symbolic analysis. These enhancements translate to up to 2.95x and 5.89x reduction in overall solver runtime. Additionally, Parth's integration requires only three lines of code, resulting in significant computational savings without the requirement of changes to the computational stack.

2024-12-16T23:04:30Z Behrooz Zarebavani Danny M. Kaufman David I. W. Levin Maryam Mehri Dehnavi 10.1145/3731179 http://arxiv.org/abs/2503.05511v5 Free Your Hands: Lightweight Turntable-Based Object Capture Pipeline 2025-07-03T15:08:46Z

Novel view synthesis (NVS) from multiple captured photos of an object is a widely studied problem. Achieving high quality typically requires dense sampling of input views, which can lead to frustrating manual labor. Manually positioning cameras to maintain an optimal desired distribution can be difficult for humans, and if a good distribution is found, it is not easy to replicate. Additionally, the captured data can suffer from motion blur and defocus due to human error. In this paper, we use a lightweight object capture pipeline to reduce the manual workload and standardize the acquisition setup, with a consumer turntable to carry the object and a tripod to hold the camera. Of course, turntables and gantry systems have been frequently used to automatically capture dense samples under various views and lighting conditions; the key difference is that we use a turntable under natural environment lighting. This way, we can easily capture hundreds of valid images in several minutes without hands-on effort. However, in the object reference frame, the light conditions vary (rotate); this does not match the assumptions of standard NVS methods like 3D Gaussian splatting (3DGS). We design a neural radiance representation conditioned on light rotations, which addresses this issue and allows rendering with novel light rotations as an additional benefit. We further study the behavior of rotations and find optimal capturing strategies. We demonstrate our pipeline using 3DGS as the underlying framework, achieving higher quality and showcasing the method's potential for novel lighting and harmonization tasks.

2025-03-07T15:27:44Z Jiahui Fan Fujun Luan Jian Yang Miloš Hašan Beibei Wang http://arxiv.org/abs/2507.02674v1 Real-time Image-based Lighting of Glints 2025-07-03T14:38:37Z

Image-based lighting is a widely used technique to reproduce shading under real-world lighting conditions, especially in real-time rendering applications. A particularly challenging scenario involves materials exhibiting a sparkling or glittering appearance, caused by discrete microfacets scattered across their surface. In this paper, we propose an efficient approximation for image-based lighting of glints, enabling fully dynamic material properties and environment maps. Our novel approach is grounded in real-time glint rendering under area light illumination and employs standard environment map filtering techniques. Crucially, our environment map filtering process is sufficiently fast to be executed on a per-frame basis. Our method assumes that the environment map is partitioned into few homogeneous regions of constant radiance. By filtering the corresponding indicator functions with the normal distribution function, we obtain the probabilities for individual microfacets to reflect light from each region. During shading, these probabilities are utilized to hierarchically sample a multinomial distribution, facilitated by our novel dual-gated Gaussian approximation of binomial distributions. We validate that our real-time approximation is close to ground-truth renderings for a range of material properties and lighting conditions, and demonstrate robust and stable performance, with little overhead over rendering glints from a single directional light. Compared to rendering smooth materials without glints, our approach requires twice as much memory to store the prefiltered environment map.

2025-07-03T14:38:37Z Tom Kneiphof Reinhard Klein 10.1111/cgf.70175 http://arxiv.org/abs/2503.05020v2 GRIP: A General Robotic Incremental Potential Contact Simulation Dataset for Unified Deformable-Rigid Coupled Grasping 2025-07-03T12:20:11Z

Grasping is fundamental to robotic manipulation, and recent advances in large-scale grasping datasets have provided essential training data and evaluation benchmarks, accelerating the development of learning-based methods for robust object grasping. However, most existing datasets exclude deformable bodies due to the lack of scalable, robust simulation pipelines, limiting the development of generalizable models for compliant grippers and soft manipulands. To address these challenges, we present GRIP, a General Robotic Incremental Potential contact simulation dataset for universal grasping. GRIP leverages an optimized Incremental Potential Contact (IPC)-based simulator for multi-environment data generation, achieving up to 48x speedup while ensuring efficient, intersection- and inversion-free simulations for compliant grippers and deformable objects. Our fully automated pipeline generates and evaluates diverse grasp interactions across 1,200 objects and 100,000 grasp poses, incorporating both soft and rigid grippers. The GRIP dataset enables applications such as neural grasp generation and stress field prediction.

2025-03-06T22:46:13Z We release GRIP to advance research in robotic manipulation, soft-gripper control, and physics-driven simulation at: https://bell0o.github.io/GRIP/ Siyu Ma Wenxin Du Chang Yu Ying Jiang Zeshun Zong Tianyi Xie Yunuo Chen Yin Yang Xuchen Han Chenfanfu Jiang http://arxiv.org/abs/2503.08061v4 ForceGrip: Reference-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation 2025-07-03T08:24:20Z

Realistic Hand manipulation is a key component of immersive virtual reality (VR), yet existing methods often rely on kinematic approach or motion-capture datasets that omit crucial physical attributes such as contact forces and finger torques. Consequently, these approaches prioritize tight, one-size-fits-all grips rather than reflecting users' intended force levels. We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions, faithfully reflecting the user's grip force intention. Instead of mimicking predefined motion datasets, ForceGrip uses generated training scenarios-randomizing object shapes, wrist movements, and trigger input flows-to challenge the agent with a broad spectrum of physical interactions. To effectively learn from these complex tasks, we employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization. This progressive strategy ensures stable hand-object contact, adaptive force control based on user inputs, and robust handling under dynamic conditions. Additionally, a proximity reward function enhances natural finger motions and accelerates training convergence. Quantitative and qualitative evaluations reveal ForceGrip's superior force controllability and plausibility compared to state-of-the-art methods. Demo videos are available as supplementary material and the code is provided at https://han-dongheun.github.io/ForceGrip.

2025-03-11T05:39:07Z 11 pages, 11 figures. Accepted to SIGGRAPH Conference Papers '25. Project page: https://han-dongheun.github.io/ForceGrip SIGGRAPH Conference Papers '25, August 10-14, 2025, Vancouver, BC, Canada DongHeun Han Byungmin Kim RoUn Lee KyeongMin Kim Hyoseok Hwang HyeongYeop Kang 10.1145/3721238.3730738 http://arxiv.org/abs/2507.02393v1 PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection 2025-07-03T07:46:39Z

Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.

2025-07-03T07:46:39Z 18 pages, 16 figures Seokyeong Lee Sithu Aung Junyong Choi Seungryong Kim Ig-Jae Kim Junghyun Cho http://arxiv.org/abs/2507.02257v1 Gbake: Baking 3D Gaussian Splats into Reflection Probes 2025-07-03T03:09:19Z

The growing popularity of 3D Gaussian Splatting has created the need to integrate traditional computer graphics techniques and assets in splatted environments. Since 3D Gaussian primitives encode lighting and geometry jointly as appearance, meshes are relit improperly when inserted directly in a mixture of 3D Gaussians and thus appear noticeably out of place. We introduce GBake, a specialized tool for baking reflection probes from Gaussian-splatted scenes that enables realistic reflection mapping of traditional 3D meshes in the Unity game engine.

2025-07-03T03:09:19Z SIGGRAPH 2025 Posters Stephen Pasch Joel K. Salzman Changxi Zheng 10.1145/3721250.3742978 http://arxiv.org/abs/2506.17301v2 FramePrompt: In-context Controllable Animation with Zero Structural Changes 2025-07-02T16:33:38Z

Generating controllable character animation from a reference image and motion guidance remains a challenging task due to the inherent difficulty of injecting appearance and motion cues into video diffusion models. Prior works often rely on complex architectures, explicit guider modules, or multi-stage processing pipelines, which increase structural overhead and hinder deployment. Inspired by the strong visual context modeling capacity of pre-trained video diffusion transformers, we propose FramePrompt, a minimalist yet powerful framework that treats reference images, skeleton-guided motion, and target video clips as a unified visual sequence. By reformulating animation as a conditional future prediction task, we bypass the need for guider networks and structural modifications. Experiments demonstrate that our method significantly outperforms representative baselines across various evaluation metrics while also simplifying training. Our findings highlight the effectiveness of sequence-level visual conditioning and demonstrate the potential of pre-trained models for controllable animation without architectural changes.

2025-06-17T22:06:20Z Project page: https://frameprompt.github.io/ Guian Fang Yuchao Gu Mike Zheng Shou http://arxiv.org/abs/2504.05750v3 Radiative Backpropagation with Non-Static Geometry 2025-07-02T11:18:21Z

Radiative backpropagation-based (RB) methods efficiently compute reverse-mode derivatives in physically-based differentiable rendering by simulating the propagation of differential radiance. A key assumption is that differential radiance is transported like normal radiance. We observe that this holds only when scene geometry is static and demonstrate that current implementations of radiative backpropagation produce biased gradients when scene parameters change geometry. In this work, we derive the differential transport equation without assuming static geometry. An immediate consequence is that the parameterization matters when the sampling process is not differentiated: only surface integrals allow a local formulation of the derivatives, i.e., one in which moving surfaces do not affect the entire path geometry. While considerable effort has been devoted to handling discontinuities resulting from moving geometry, we show that a biased interior derivative compromises even the simplest inverse rendering tasks, regardless of discontinuities. An implementation based on our derivation leads to systematic convergence to the reference solution in the same setting and provides unbiased RB interior derivatives for path-space differentiable rendering.

2025-04-08T07:26:50Z EGSR 2025 Eurographics Symposium on Rendering (2025) Markus Worchel Ugo Finnendahl Marc Alexa 10.2312/sr.20251198 http://arxiv.org/abs/2408.12601v2 DreamCinema: Cinematic Transfer with Free Camera and 3D Character 2025-07-02T06:39:01Z

We are living in a flourishing era of digital media, where everyone has the potential to become a personal filmmaker. Current research on video generation suggests a promising avenue for controllable film creation in pixel space using Diffusion models. However, the reliance on overly verbose prompts and insufficient focus on cinematic elements (e.g., camera movement) results in videos that lack cinematic quality. Furthermore, the absence of 3D modeling often leads to failures in video generation, such as inconsistent character models at different frames, ultimately hindering the immersive experience for viewers. In this paper, we propose a new framework for film creation, Dream-Cinema, which is designed for user-friendly, 3D space-based film creation with generative models. Specifically, we decompose 3D film creation into four key elements: 3D character, driven motion, camera movement, and environment. We extract the latter three elements from user-specified film shots and generate the 3D character using a generative model based on a provided image. To seamlessly recombine these elements and ensure smooth film creation, we propose structure-guided character animation, shape-aware camera movement optimization, and environment-aware generative refinement. Extensive experiments demonstrate the effectiveness of our method in generating high-quality films with free camera and 3D characters.

2024-08-22T17:59:44Z Project page: https://liuff19.github.io/DreamCinema Weiliang Chen Fangfu Liu Diankun Wu Haowen Sun Jiwen Lu Yueqi Duan