https://arxiv.org/api/WLJ13KfBXjeMH5GLeQ2QYIX6doM2026-06-28T19:15:08Z9390195015http://arxiv.org/abs/2505.20129v3Agentic 3D Scene Generation with Spatially Contextualized VLMs2025-07-04T15:28:37ZDespite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications. Project page: https://spatctxvlm.github.io/project_page/.2025-05-26T15:28:17ZProject page: https://spatctxvlm.github.io/project_page/Xinhang LiuYu-Wing TaiChi-Keung Tanghttp://arxiv.org/abs/2505.06227v2Anymate: A Dataset and Baselines for Learning 3D Object Rigging2025-07-04T02:11:50ZRigging and skinning are essential steps to create realistic 3D animations, often requiring significant expertise and manual effort. Traditional attempts at automating these processes rely heavily on geometric heuristics and often struggle with objects of complex geometry. Recent data-driven approaches show potential for better generality, but are often constrained by limited training data. We present the Anymate Dataset, a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information -- 70 times larger than existing datasets. Using this dataset, we propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction. We systematically design and experiment with various architectures as baselines for each module and conduct comprehensive evaluations on our dataset to compare their performance. Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning. Code and dataset can be found at https://anymate3d.github.io/.2025-05-09T17:59:33ZSIGGRAPH 2025. Project page: https://anymate3d.github.io/Yufan DengYuhao ZhangChen GengShangzhe WuJiajun Wuhttp://arxiv.org/abs/2411.12089v3FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting2025-07-03T22:14:28ZIn the real world, objects reveal internal textures when sliced or cut, yet this behavior is not well-studied in 3D generation tasks today. For example, slicing a virtual 3D watermelon should reveal flesh and seeds. Given that no available dataset captures an object's full internal structure and collecting data from all slices is impractical, generative methods become the obvious approach. However, current 3D generation and inpainting methods often focus on visible appearance and overlook internal textures. To bridge this gap, we introduce FruitNinja, the first method to generate internal textures for 3D objects undergoing geometric and topological changes. Our approach produces objects via 3D Gaussian Splatting (3DGS) with both surface and interior textures synthesized, enabling real-time slicing and rendering without additional optimization. FruitNinja leverages a pre-trained diffusion model to progressively inpaint cross-sectional views and applies voxel-grid-based smoothing to achieve cohesive textures throughout the object. Our OpaqueAtom GS strategy overcomes 3DGS limitations by employing densely distributed opaque Gaussians, avoiding biases toward larger particles that destabilize training and sharp color transitions for fine-grained textures. Experimental results show that FruitNinja substantially outperforms existing approaches, showcasing unmatched visual quality in real-time rendered internal views across arbitrary geometry manipulations.2024-11-18T22:00:19Zaccepted in CVPR 2025, project page https://fanguw.github.io/FruitNinja3DFangyu WuYuhao Chenhttp://arxiv.org/abs/2507.03170v1ASCRIBE-XR: Virtual Reality for Visualization of Scientific Imagery2025-07-03T20:52:40ZASCRIBE-XR, a novel computational platform designed to facilitate the visualization and exploration of 3D volumetric data and mesh data in the context of synchrotron experiments, is described. Using Godot and PC-VR technologies, the platform enables users to dynamically load and manipulate 3D data sets to gain deeper insights into their research. The program's multi-user capabilities, enabled through WebRTC, and MQTT, allow multiple users to share data and visualize together in real-time, promoting a more interactive and engaging research experience. We describe the design and implementation of ASCRIBE-XR, highlighting its key features and capabilities. We will also discuss its utility in the context of synchrotron research, including examples of its application and potential benefits for the scientific community.2025-07-03T20:52:40ZRonald J. PandolfiJeffrey J. DonatelliJulian ToddDaniela Ushizimahttp://arxiv.org/abs/2507.03166v1Image-driven Robot Drawing with Rapid Lognormal Movements2025-07-03T20:51:27ZLarge image generation and vision models, combined with differentiable rendering technologies, have become powerful tools for generating paths that can be drawn or painted by a robot. However, these tools often overlook the intrinsic physicality of the human drawing/writing act, which is usually executed with skillful hand/arm gestures. Taking this into account is important for the visual aesthetics of the results and for the development of closer and more intuitive artist-robot collaboration scenarios. We present a method that bridges this gap by enabling gradient-based optimization of natural human-like motions guided by cost functions defined in image space. To this end, we use the sigma-lognormal model of human hand/arm movements, with an adaptation that enables its use in conjunction with a differentiable vector graphics (DiffVG) renderer. We demonstrate how this pipeline can be used to generate feasible trajectories for a robot by combining image-driven objectives with a minimum-time smoothing criterion. We demonstrate applications with generation and robotic reproduction of synthetic graffiti as well as image abstraction.2025-07-03T20:51:27ZAccepted at IEEE RO-MAN 2025Daniel BerioGuillaume ClivazMichael StrohOliver DeussenRéjean PlamondonSylvain CalinonFrederic Fol Leymariehttp://arxiv.org/abs/2501.04011v2Adaptive Algebraic Reuse of Reordering in Cholesky Factorization with Dynamic Sparsity Pattern2025-07-03T16:09:51ZCholesky linear solvers are a critical bottleneck in challenging applications within computer graphics and scientific computing. These applications include but are not limited to elastodynamic barrier methods such as Incremental Potential Contact (IPC), and geometric operations such as remeshing and morphology. In these contexts, the sparsity patterns of the linear systems frequently change across successive calls to the Cholesky solver, necessitating repeated symbolic analyses that dominate the overall solver runtime.
To address this bottleneck, we evaluate our method on over 150,000 linear systems generated from diverse nonlinear problems with dynamic sparsity changes in Incremental Potential Contact (IPC) and patch remeshing on a wide range of triangular meshes of various sizes. Our analysis using three leading sparse Cholesky libraries, Intel MKL Pardiso, SuiteSparse CHOLMOD, and Apple Accelerate, reveals that the primary performance constraint lies in the symbolic re-ordering phase of the solver. Recognizing this, we introduce Parth, an innovative re-ordering method designed to update ordering vectors only where local connectivity changes occur adaptively. Parth employs a novel hierarchical graph decomposition algorithm to break down the dual graph of the input matrix into fine-grained subgraphs, facilitating the selective reuse of fill-reducing orderings when sparsity patterns exhibit temporal coherence.
Our extensive evaluation demonstrates that Parth achieves up to a 255x and 13x speedup in fill-reducing ordering for our IPC and remeshing benchmark and a 6.85x and 10.7x acceleration in symbolic analysis. These enhancements translate to up to 2.95x and 5.89x reduction in overall solver runtime. Additionally, Parth's integration requires only three lines of code, resulting in significant computational savings without the requirement of changes to the computational stack.2024-12-16T23:04:30ZBehrooz ZarebavaniDanny M. KaufmanDavid I. W. LevinMaryam Mehri Dehnavi10.1145/3731179http://arxiv.org/abs/2503.05511v5Free Your Hands: Lightweight Turntable-Based Object Capture Pipeline2025-07-03T15:08:46ZNovel view synthesis (NVS) from multiple captured photos of an object is a widely studied problem. Achieving high quality typically requires dense sampling of input views, which can lead to frustrating manual labor. Manually positioning cameras to maintain an optimal desired distribution can be difficult for humans, and if a good distribution is found, it is not easy to replicate. Additionally, the captured data can suffer from motion blur and defocus due to human error. In this paper, we use a lightweight object capture pipeline to reduce the manual workload and standardize the acquisition setup, with a consumer turntable to carry the object and a tripod to hold the camera. Of course, turntables and gantry systems have been frequently used to automatically capture dense samples under various views and lighting conditions; the key difference is that we use a turntable under natural environment lighting. This way, we can easily capture hundreds of valid images in several minutes without hands-on effort. However, in the object reference frame, the light conditions vary (rotate); this does not match the assumptions of standard NVS methods like 3D Gaussian splatting (3DGS). We design a neural radiance representation conditioned on light rotations, which addresses this issue and allows rendering with novel light rotations as an additional benefit. We further study the behavior of rotations and find optimal capturing strategies. We demonstrate our pipeline using 3DGS as the underlying framework, achieving higher quality and showcasing the method's potential for novel lighting and harmonization tasks.2025-03-07T15:27:44ZJiahui FanFujun LuanJian YangMiloš HašanBeibei Wanghttp://arxiv.org/abs/2507.02674v1Real-time Image-based Lighting of Glints2025-07-03T14:38:37ZImage-based lighting is a widely used technique to reproduce shading under real-world lighting conditions, especially in real-time rendering applications. A particularly challenging scenario involves materials exhibiting a sparkling or glittering appearance, caused by discrete microfacets scattered across their surface. In this paper, we propose an efficient approximation for image-based lighting of glints, enabling fully dynamic material properties and environment maps. Our novel approach is grounded in real-time glint rendering under area light illumination and employs standard environment map filtering techniques. Crucially, our environment map filtering process is sufficiently fast to be executed on a per-frame basis. Our method assumes that the environment map is partitioned into few homogeneous regions of constant radiance. By filtering the corresponding indicator functions with the normal distribution function, we obtain the probabilities for individual microfacets to reflect light from each region. During shading, these probabilities are utilized to hierarchically sample a multinomial distribution, facilitated by our novel dual-gated Gaussian approximation of binomial distributions. We validate that our real-time approximation is close to ground-truth renderings for a range of material properties and lighting conditions, and demonstrate robust and stable performance, with little overhead over rendering glints from a single directional light. Compared to rendering smooth materials without glints, our approach requires twice as much memory to store the prefiltered environment map.2025-07-03T14:38:37ZTom KneiphofReinhard Klein10.1111/cgf.70175http://arxiv.org/abs/2503.05020v2GRIP: A General Robotic Incremental Potential Contact Simulation Dataset for Unified Deformable-Rigid Coupled Grasping2025-07-03T12:20:11ZGrasping is fundamental to robotic manipulation, and recent advances in large-scale grasping datasets have provided essential training data and evaluation benchmarks, accelerating the development of learning-based methods for robust object grasping. However, most existing datasets exclude deformable bodies due to the lack of scalable, robust simulation pipelines, limiting the development of generalizable models for compliant grippers and soft manipulands. To address these challenges, we present GRIP, a General Robotic Incremental Potential contact simulation dataset for universal grasping. GRIP leverages an optimized Incremental Potential Contact (IPC)-based simulator for multi-environment data generation, achieving up to 48x speedup while ensuring efficient, intersection- and inversion-free simulations for compliant grippers and deformable objects. Our fully automated pipeline generates and evaluates diverse grasp interactions across 1,200 objects and 100,000 grasp poses, incorporating both soft and rigid grippers. The GRIP dataset enables applications such as neural grasp generation and stress field prediction.2025-03-06T22:46:13ZWe release GRIP to advance research in robotic manipulation, soft-gripper control, and physics-driven simulation at: https://bell0o.github.io/GRIP/Siyu MaWenxin DuChang YuYing JiangZeshun ZongTianyi XieYunuo ChenYin YangXuchen HanChenfanfu Jianghttp://arxiv.org/abs/2503.08061v4ForceGrip: Reference-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation2025-07-03T08:24:20ZRealistic Hand manipulation is a key component of immersive virtual reality (VR), yet existing methods often rely on kinematic approach or motion-capture datasets that omit crucial physical attributes such as contact forces and finger torques. Consequently, these approaches prioritize tight, one-size-fits-all grips rather than reflecting users' intended force levels. We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions, faithfully reflecting the user's grip force intention. Instead of mimicking predefined motion datasets, ForceGrip uses generated training scenarios-randomizing object shapes, wrist movements, and trigger input flows-to challenge the agent with a broad spectrum of physical interactions. To effectively learn from these complex tasks, we employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization. This progressive strategy ensures stable hand-object contact, adaptive force control based on user inputs, and robust handling under dynamic conditions. Additionally, a proximity reward function enhances natural finger motions and accelerates training convergence. Quantitative and qualitative evaluations reveal ForceGrip's superior force controllability and plausibility compared to state-of-the-art methods. Demo videos are available as supplementary material and the code is provided at https://han-dongheun.github.io/ForceGrip.2025-03-11T05:39:07Z11 pages, 11 figures. Accepted to SIGGRAPH Conference Papers '25. Project page: https://han-dongheun.github.io/ForceGripSIGGRAPH Conference Papers '25, August 10-14, 2025, Vancouver, BC, CanadaDongHeun HanByungmin KimRoUn LeeKyeongMin KimHyoseok HwangHyeongYeop Kang10.1145/3721238.3730738http://arxiv.org/abs/2507.02393v1PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection2025-07-03T07:46:39ZMonocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.2025-07-03T07:46:39Z18 pages, 16 figuresSeokyeong LeeSithu AungJunyong ChoiSeungryong KimIg-Jae KimJunghyun Chohttp://arxiv.org/abs/2507.02257v1Gbake: Baking 3D Gaussian Splats into Reflection Probes2025-07-03T03:09:19ZThe growing popularity of 3D Gaussian Splatting has created the need to integrate traditional computer graphics techniques and assets in splatted environments. Since 3D Gaussian primitives encode lighting and geometry jointly as appearance, meshes are relit improperly when inserted directly in a mixture of 3D Gaussians and thus appear noticeably out of place. We introduce GBake, a specialized tool for baking reflection probes from Gaussian-splatted scenes that enables realistic reflection mapping of traditional 3D meshes in the Unity game engine.2025-07-03T03:09:19ZSIGGRAPH 2025 PostersStephen PaschJoel K. SalzmanChangxi Zheng10.1145/3721250.3742978http://arxiv.org/abs/2506.17301v2FramePrompt: In-context Controllable Animation with Zero Structural Changes2025-07-02T16:33:38ZGenerating controllable character animation from a reference image and motion guidance remains a challenging task due to the inherent difficulty of injecting appearance and motion cues into video diffusion models. Prior works often rely on complex architectures, explicit guider modules, or multi-stage processing pipelines, which increase structural overhead and hinder deployment. Inspired by the strong visual context modeling capacity of pre-trained video diffusion transformers, we propose FramePrompt, a minimalist yet powerful framework that treats reference images, skeleton-guided motion, and target video clips as a unified visual sequence. By reformulating animation as a conditional future prediction task, we bypass the need for guider networks and structural modifications. Experiments demonstrate that our method significantly outperforms representative baselines across various evaluation metrics while also simplifying training. Our findings highlight the effectiveness of sequence-level visual conditioning and demonstrate the potential of pre-trained models for controllable animation without architectural changes.2025-06-17T22:06:20ZProject page: https://frameprompt.github.io/Guian FangYuchao GuMike Zheng Shouhttp://arxiv.org/abs/2504.05750v3Radiative Backpropagation with Non-Static Geometry2025-07-02T11:18:21ZRadiative backpropagation-based (RB) methods efficiently compute reverse-mode derivatives in physically-based differentiable rendering by simulating the propagation of differential radiance. A key assumption is that differential radiance is transported like normal radiance. We observe that this holds only when scene geometry is static and demonstrate that current implementations of radiative backpropagation produce biased gradients when scene parameters change geometry. In this work, we derive the differential transport equation without assuming static geometry. An immediate consequence is that the parameterization matters when the sampling process is not differentiated: only surface integrals allow a local formulation of the derivatives, i.e., one in which moving surfaces do not affect the entire path geometry. While considerable effort has been devoted to handling discontinuities resulting from moving geometry, we show that a biased interior derivative compromises even the simplest inverse rendering tasks, regardless of discontinuities. An implementation based on our derivation leads to systematic convergence to the reference solution in the same setting and provides unbiased RB interior derivatives for path-space differentiable rendering.2025-04-08T07:26:50ZEGSR 2025Eurographics Symposium on Rendering (2025)Markus WorchelUgo FinnendahlMarc Alexa10.2312/sr.20251198http://arxiv.org/abs/2408.12601v2DreamCinema: Cinematic Transfer with Free Camera and 3D Character2025-07-02T06:39:01ZWe are living in a flourishing era of digital media, where everyone has the potential to become a personal filmmaker. Current research on video generation suggests a promising avenue for controllable film creation in pixel space using Diffusion models. However, the reliance on overly verbose prompts and insufficient focus on cinematic elements (e.g., camera movement) results in videos that lack cinematic quality. Furthermore, the absence of 3D modeling often leads to failures in video generation, such as inconsistent character models at different frames, ultimately hindering the immersive experience for viewers. In this paper, we propose a new framework for film creation, Dream-Cinema, which is designed for user-friendly, 3D space-based film creation with generative models. Specifically, we decompose 3D film creation into four key elements: 3D character, driven motion, camera movement, and environment. We extract the latter three elements from user-specified film shots and generate the 3D character using a generative model based on a provided image. To seamlessly recombine these elements and ensure smooth film creation, we propose structure-guided character animation, shape-aware camera movement optimization, and environment-aware generative refinement. Extensive experiments demonstrate the effectiveness of our method in generating high-quality films with free camera and 3D characters.2024-08-22T17:59:44ZProject page: https://liuff19.github.io/DreamCinemaWeiliang ChenFangfu LiuDiankun WuHaowen SunJiwen LuYueqi Duan