https://arxiv.org/api/7K6cZuev5k1xOW6X+NwaeD40zus2026-07-01T14:48:18Z9421217515http://arxiv.org/abs/2505.22400v1STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering2025-05-28T14:26:41ZAlthough dynamic scene reconstruction has long been a fundamental challenge in 3D vision, the recent emergence of 3D Gaussian Splatting (3DGS) offers a promising direction by enabling high-quality, real-time rendering through explicit Gaussian primitives. However, existing 3DGS-based methods for dynamic reconstruction often suffer from \textit{spatio-temporal incoherence} during initialization, where canonical Gaussians are constructed by aggregating observations from multiple frames without temporal distinction. This results in spatio-temporally entangled representations, making it difficult to model dynamic motion accurately. To overcome this limitation, we propose \textbf{STDR} (Spatio-Temporal Decoupling for Real-time rendering), a plug-and-play module that learns spatio-temporal probability distributions for each Gaussian. STDR introduces a spatio-temporal mask, a separated deformation field, and a consistency regularization to jointly disentangle spatial and temporal patterns. Extensive experiments demonstrate that incorporating our module into existing 3DGS-based dynamic scene reconstruction frameworks leads to notable improvements in both reconstruction quality and spatio-temporal consistency across synthetic and real-world benchmarks.2025-05-28T14:26:41ZZehao LiHao JiangYujun CaiJianing ChenBaolong BiShuqin GaoHonglong ZhaoYiwei WangTianlu MaoZhaoqi Wanghttp://arxiv.org/abs/2506.00222v1Power-Linear Polar Directional Fields2025-05-28T11:01:06ZWe introduce a novel method for directional-field design on meshes, enabling users to specify singularities at any location on a mesh. Our method uses a piecewise power-linear representation for phase and scale, offering precise control over field topology. The resulting fields are smooth and accommodate any singularity index and field symmetry. With this representation, we mitigate the artifacts caused by coarse or uneven meshes. We showcase our approach on meshes with diverse topologies and triangle qualities.2025-05-28T11:01:06ZAccepted to SIGGRAPH 2025 Conference TrackJiabao Brad WangAmir Vaxmanhttp://arxiv.org/abs/2505.21946v1Fluid Simulation on Vortex Particle Flow Maps2025-05-28T03:56:38ZWe propose the Vortex Particle Flow Map (VPFM) method to simulate incompressible flow with complex vortical evolution in the presence of dynamic solid boundaries. The core insight of our approach is that vorticity is an ideal quantity for evolution on particle flow maps, enabling significantly longer flow map distances compared to other fluid quantities like velocity or impulse. To achieve this goal, we developed a hybrid Eulerian-Lagrangian representation that evolves vorticity and flow map quantities on vortex particles, while reconstructing velocity on a background grid. The method integrates three key components: (1) a vorticity-based particle flow map framework, (2) an accurate Hessian evolution scheme on particles, and (3) a solid boundary treatment for no-through and no-slip conditions in VPFM. These components collectively allow a substantially longer flow map length (3-12 times longer) than the state-of-the-art, enhancing vorticity preservation over extended spatiotemporal domains. We validated the performance of VPFM through diverse simulations, demonstrating its effectiveness in capturing complex vortex dynamics and turbulence phenomena.2025-05-28T03:56:38ZACM Transactions on Graphics (SIGGRAPH 2025), 24 pagesSinan WangJunwei ZhouFan FengZhiqi LiYuchen SunDuowen ChenGreg TurkBo Zhu10.1145/3731198http://arxiv.org/abs/2505.21925v1RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination2025-05-28T03:20:46ZWe present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.2025-05-28T03:20:46ZAccepted to SIGGRAPH 2025. Project page: https://microsoft.github.io/renderformerACM SIGGRAPH 2025 Conference PapersChong ZengYue DongPieter PeersHongzhi WuXin Tong10.1145/3721238.3730595http://arxiv.org/abs/2505.21488v1Be Decisive: Noise-Induced Layouts for Multi-Subject Generation2025-05-27T17:54:24ZGenerating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.2025-05-27T17:54:24ZSIGGRAPH 2025. Project page: https://omer11a.github.io/be-decisive/Omer DaharyYehonathan CohenOr PatashnikKfir AbermanDaniel Cohen-Orhttp://arxiv.org/abs/2505.21437v1CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects2025-05-27T17:11:50ZSynthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility, and enables various capabilities such as object pose control, simultaneous walking and manipulation, and whole-body generation from hand-only data.2025-05-27T17:11:50ZProject page: https://phj128.github.io/page/CoDA/index.htmlHuaijin PiZhi CenZhiyang DouTaku Komurahttp://arxiv.org/abs/2505.21335v1Structure from Collision2025-05-27T15:30:01ZRecent advancements in neural 3D representations, such as neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS), have enabled the accurate estimation of 3D structures from multiview images. However, this capability is limited to estimating the visible external structure, and identifying the invisible internal structure hidden behind the surface is difficult. To overcome this limitation, we address a new task called Structure from Collision (SfC), which aims to estimate the structure (including the invisible internal structure) of an object from appearance changes during collision. To solve this problem, we propose a novel model called SfC-NeRF that optimizes the invisible internal structure of an object through a video sequence under physical, appearance (i.e., visible external structure)-preserving, and keyframe constraints. In particular, to avoid falling into undesirable local optima owing to its ill-posed nature, we propose volume annealing; that is, searching for global optima by repeatedly reducing and expanding the volume. Extensive experiments on 115 objects involving diverse structures (i.e., various cavity shapes, locations, and sizes) and material properties revealed the properties of SfC and demonstrated the effectiveness of the proposed SfC-NeRF.2025-05-27T15:30:01ZAccepted to CVPR 2025 (Highlight). Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/sfc/Takuhiro Kanekohttp://arxiv.org/abs/2505.21319v1efunc: An Efficient Function Representation without Neural Networks2025-05-27T15:16:56ZFunction fitting/approximation plays a fundamental role in computer graphics and other engineering applications. While recent advances have explored neural networks to address this task, these methods often rely on architectures with many parameters, limiting their practical applicability. In contrast, we pursue high-quality function approximation using parameter-efficient representations that eliminate the dependency on neural networks entirely. We first propose a novel framework for continuous function modeling. Most existing works can be formulated using this framework. We then introduce a compact function representation, which is based on polynomials interpolated using radial basis functions, bypassing both neural networks and complex/hierarchical data structures. We also develop memory-efficient CUDA-optimized algorithms that reduce computational time and memory consumption to less than 10% compared to conventional automatic differentiation frameworks. Finally, we validate our representation and optimization pipeline through extensive experiments on 3D signed distance functions (SDFs). The proposed representation achieves comparable or superior performance to state-of-the-art techniques (e.g., octree/hash-grid techniques) with significantly fewer parameters.2025-05-27T15:16:56ZProject website: https://efunc.github.io/efunc/Biao ZhangPeter Wonkahttp://arxiv.org/abs/2505.21252v1Hand Shadow Art: A Differentiable Rendering Perspective2025-05-27T14:32:42ZShadow art is an exciting form of sculptural art that produces captivating artistic effects through the 2D shadows cast by 3D shapes. Hand shadows, also known as shadow puppetry or shadowgraphy, involve creating various shapes and figures using your hands and fingers to cast meaningful shadows on a wall. In this work, we propose a differentiable rendering-based approach to deform hand models such that they cast a shadow consistent with a desired target image and the associated lighting configuration. We showcase the results of shadows cast by a pair of two hands and the interpolation of hand poses between two desired shadow images. We believe that this work will be a useful tool for the graphics community.2025-05-27T14:32:42ZPublished in Pacific Graphics 2023Aalok GangopadhyayPrajwal SinghAshish TiwariShanmuganathan Raman10.2312/pg.20231279http://arxiv.org/abs/2505.21146v1IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model2025-05-27T12:57:37ZExisting human motion generation methods with trajectory and pose inputs operate global processing on both modalities, leading to suboptimal outputs. In this paper, we propose IKMo, an image-keyframed motion generation method based on the diffusion model with trajectory and pose being decoupled. The trajectory and pose inputs go through a two-stage conditioning framework. In the first stage, the dedicated optimization module is applied to refine inputs. In the second stage, trajectory and pose are encoded via a Trajectory Encoder and a Pose Encoder in parallel. Then, motion with high spatial and semantic fidelity is guided by a motion ControlNet, which processes the fused trajectory and pose data. Experiment results based on HumanML3D and KIT-ML datasets demonstrate that the proposed method outperforms state-of-the-art on all metrics under trajectory-keyframe constraints. In addition, MLLM-based agents are implemented to pre-process model inputs. Given texts and keyframe images from users, the agents extract motion descriptions, keyframe poses, and trajectories as the optimized inputs into the motion generation model. We conducts a user study with 10 participants. The experiment results prove that the MLLM-based agents pre-processing makes generated motion more in line with users' expectation. We believe that the proposed method improves both the fidelity and controllability of motion generation by the diffusion model.2025-05-27T12:57:37ZYang ZhaoYan ZhangXubo Yanghttp://arxiv.org/abs/2505.07843v2PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation2025-05-27T02:41:23ZIn poster design, content-aware layout generation is crucial for automatically arranging visual-textual elements on the given image. With limited training data, existing work focused on image-centric enhancement. However, this neglects the diversity of layouts and fails to cope with shape-variant elements or diverse design intents in generalized settings. To this end, we proposed a layout-centric approach that leverages layout knowledge implicit in large language models (LLMs) to create posters for omnifarious purposes, hence the name PosterO. Specifically, it structures layouts from datasets as trees in SVG language by universal shape, design intent vectorization, and hierarchical node representation. Then, it applies LLMs during inference to predict new layout trees by in-context learning with intent-aligned example selection. After layout trees are generated, we can seamlessly realize them into poster designs by editing the chat with LLMs. Extensive experimental results have demonstrated that PosterO can generate visually appealing layouts for given images, achieving new state-of-the-art performance across various benchmarks. To further explore PosterO's abilities under the generalized settings, we built PStylish7, the first dataset with multi-purpose posters and various-shaped elements, further offering a challenging test for advanced research.2025-05-06T18:42:24ZAccepted to CVPR 2025. Minor editing issue fixed. Code and dataset are available at https://thekinsley.github.io/PosterO/HsiaoYuan HsuYuxin Penghttp://arxiv.org/abs/2505.20473v1Stochastic Preconditioning for Neural Field Optimization2025-05-26T19:13:41ZNeural fields are a highly effective representation across visual computing. This work observes that fitting these fields is greatly improved by incorporating spatial stochasticity during training, and that this simple technique can replace or even outperform custom-designed hierarchies and frequency space constructions. The approach is formalized as implicitly operating on a blurred version of the field, evaluated in-expectation by sampling with Gaussian-distributed offsets. Querying the blurred field during optimization greatly improves convergence and robustness, akin to the role of preconditioners in numerical linear algebra. This implicit, sampling-based perspective fits naturally into the neural field paradigm, comes at no additional cost, and is extremely simple to implement. We describe the basic theory of this technique, including details such as handling boundary conditions, and extending to a spatially-varying blur. Experiments demonstrate this approach on representations including coordinate MLPs, neural hashgrids, triplanes, and more, across tasks including surface reconstruction and radiance fields. In settings where custom-designed hierarchies have already been developed, stochastic preconditioning nearly matches or improves their performance with a simple and unified approach; in settings without existing hierarchies it provides an immediate boost to quality and robustness.2025-05-26T19:13:41Z15 pages, 11 figures, SIGGRAPH 2025 (Journal track)Selena LingMerlin Nimier-DavidAlec JacobsonNicholas Sharp10.1145/3731161http://arxiv.org/abs/2505.20434v1SZ Sequences: Binary-Based $(0, 2^q)$-Sequences2025-05-26T18:30:40ZLow-discrepancy sequences have seen widespread adoption in computer graphics thanks to their superior convergence rates. Since rendering integrals often comprise products of lower-dimensional integrals, recent work has focused on developing sequences that are also well-distributed in lower-dimensional projections. To this end, we introduce a novel construction of binary-based (0, 4)-sequences; that is, progressive fully multi-stratified sequences of 4D points, and extend the idea to higher power-of-two dimensions. We further show that not only it is possible to nest lower-dimensional sequences in higher-dimensional ones -- for example, embedding a (0, 2)-sequence within our (0, 4)-sequence -- but that we can ensemble two (0, 2)-sequences into a (0, 4)-sequence, four (0, 4)-sequences into a (0, 16)-sequence, and so on. Such sequences can provide excellent convergence rates when integrals include lower-dimensional integration problems in 2, 4, 16, ... dimensions. Our construction is based on using 2$\times$2 block matrices as symbols to construct larger matrices that potentially generate a sequence with the target (0, s)-sequence in base $s$ property. We describe how to search for suitable alphabets and identify two distinct, cross-related alphabets of block symbols, which we call S and Z, hence \emph{SZ} for the resulting family of sequences. Given the alphabets, we construct candidate generator matrices and search for valid sets of matrices. We then infer a formula to construct full-resolution (64-bit) matrices. Our binayr generator matrices allow highly efficient implementation using bitwise operations, and can be used as a drop-in replacement for Sobol matrices in existing applications. We compare SZ sequences to state-of-the-art low discrepancy sequences, and demonstrate mean relative squared error improvements up to $1.93\times$ in common rendering applications.2025-05-26T18:30:40ZAbdalla G. M. AhmedMatt PharrVictor OstromoukhovHui Huanghttp://arxiv.org/abs/2505.20421v1Precise Gradient Discontinuities in Neural Fields for Subspace Physics2025-05-26T18:15:04ZDiscontinuities in spatial derivatives appear in a wide range of physical systems, from creased thin sheets to materials with sharp stiffness transitions. Accurately modeling these features is essential for simulation but remains challenging for traditional mesh-based methods, which require discontinuity-aligned remeshing -- entangling geometry with simulation and hindering generalization across shape families.
Neural fields offer an appealing alternative by encoding basis functions as smooth, continuous functions over space, enabling simulation across varying shapes. However, their smoothness makes them poorly suited for representing gradient discontinuities. Prior work addresses discontinuities in function values, but capturing sharp changes in spatial derivatives while maintaining function continuity has received little attention.
We introduce a neural field construction that captures gradient discontinuities without baking their location into the network weights. By augmenting input coordinates with a smoothly clamped distance function in a lifting framework, we enable encoding of gradient jumps at evolving interfaces.
This design supports discretization-agnostic simulation of parametrized shape families with heterogeneous materials and evolving creases, enabling new reduced-order capabilities such as shape morphing, interactive crease editing, and simulation of soft-rigid hybrid structures. We further demonstrate that our method can be combined with previous lifting techniques to jointly capture both gradient and value discontinuities, supporting simultaneous cuts and creases within a unified model.2025-05-26T18:15:04ZMengfei LiuYue ChangZhecheng WangPeter Yichen ChenEitan Grinspunhttp://arxiv.org/abs/2505.20271v1In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation2025-05-26T17:49:10ZRecent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly "brushes" user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose "In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head "latent feature shifting" within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head "attention reweighting" across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.2025-05-26T17:49:10ZYu XuFan TangYou WuLin GaoOliver DeussenHongbin YanJintao LiJuan CaoTong-Yee Lee