https://arxiv.org/api/iaKryzQ+IBR6ffKS2CjMSvgDCOM 2026-06-13T14:51:57Z 9323 120 15 http://arxiv.org/abs/2605.30925v1 MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance 2026-05-29T07:14:48Z

Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: https://natsala13.github.io/multiact.github.io.

2026-05-29T07:14:48Z Accepted to SIGGRAPH 2026 conference. Project page: https://natsala13.github.io/multiact.github.io Nathan Sala Ofir Abramovich Ariel Shamir Daniel Cohen-Or Andreas Aristidou Sigal Raab http://arxiv.org/abs/2605.30863v1 DSD-GS: Dynamic-Static Decomposition of Gaussian Splatting for Efficient and High-Fidelity Dynamic Scene Reconstruction 2026-05-29T05:38:00Z

Dynamic scene reconstruction and novel view synthesis are fundamental to next-generation visual intelligence applications such as virtual reality, robotics, and digital twins. However, high-fidelity reconstruction of complex, time-varying scenes from arbitrary viewpoints remains a significant challenge. Existing dynamic 3DGS methods suffer from computational inefficiency, since they model all Gaussians as dynamic components. While recent decomposition-based approaches address this issue, they still struggle with degraded reconstruction quality and prolonged training time. To mitigate these limitations, we propose a novel dynamic reconstruction framework built upon an efficient static-dynamic decomposition strategy using a Feed-Forward Gaussian Splatting encoder and an optical flow model. By eliminating redundant computations on static regions, our method achieves state-of-the-art performance, outperforming existing baselines across rendering quality, training and rendering speed, and storage efficiency. Notably, on the Neural 3D dataset, our framework requires only 10 minutes for training and achieves a rendering speed of over 700 FPS on a single NVIDIA RTX 5090 GPU at resolution of 1352x1014. Furthermore, our decomposition strategy eliminates the need for COLMAP preprocessing and enables deterministic initialization, thereby enhancing both efficiency and reproducibility.

2026-05-29T05:38:00Z 23 pages, 9 figures, 7 tables Youngtae Han Sung-hwan Han Youngmin Yi http://arxiv.org/abs/2605.30819v1 Function2Scene: 3D Indoor Scene Layout from Functional Specifications 2026-05-29T04:10:15Z

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

2026-05-29T04:10:15Z project page: https://function2scene.github.io/ Ruiqi Wang Qimin Chen Daniel Ritchie Angel X. Chang Manolis Savva Kai Wang Hao Zhang http://arxiv.org/abs/2605.29655v2 SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation 2026-05-29T03:58:01Z

Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.

2026-05-28T09:17:11Z Yuan Li Congyi Zhang Xifeng Gao Xiaohu Guo http://arxiv.org/abs/2605.30744v1 BijectiveRemesh: Maintaining Bijective Mappings for Data Transfer Across Remeshed Manifolds 2026-05-29T02:21:35Z

We introduce BijectiveRemesh, a robust algorithm for maintaining a continuous, bijective mapping across complex remeshing sequences on both 2D triangle surfaces and 3D tetrahedral meshes. Unlike traditional data transfer methods that rely on interpolation or projection, our approach constructs a mathematically rigorous composite map from the input mesh to the output mesh by chaining local bijective atlases defined for each primitive remeshing operation. Our framework represents the overall mapping as a composition of local bijective atlases, one per remeshing operation. Building upon successive self-parameterization, we introduce a Shared Scaffold structure for 2D triangle meshes that enforces global bijectivity through local orientation preservation. We extend this approach to handle edge splits, edge swaps, and vertex smoothing beyond the original edge collapses. For 3D tetrahedral meshes, we generalize the local atlas construction using Steinitz's Theorem and Maxwell-Cremona lifting to ensure valid embeddings. This enables exact tracking of geometric entities, including points, curves, and surfaces, across remeshing, with applications from texture transfer to volumetric simulations.

2026-05-29T02:21:35Z Leyi Zhu Michael Tao Yixin Hu Daniele Panozzo Denis Zorin http://arxiv.org/abs/2605.24700v3 SRUG: Shadow-Guided Relightable Urban Scene with Generation Model 2026-05-29T01:32:54Z

Creating relightable urban scenes from images or videos is widely useful but highly ill-posed. Urban environments are typically unbounded and extend beyond the visible regions. As a result, many portions of the scene remain unobserved, yet these invisible regions can cast shadows onto visible areas. Reasonably modeling shadows cast by such invisible regions is challenging and poses a significant obstacle to creating relightable urban scenes. At the same time, sparse input views and complex illumination conditions further complicate relighting, as they introduce severe ambiguities in material decomposition. In this paper, we propose Shadow-guided Relightable Urban Scene with Generation model (SRUG), a novel framework designed to address relighting challenges in urban scenes. SRUG leverages shadows to guide a 3D completion model for recovering the geometry of invisible regions, promoting the synthesis of physically reasonable shadows. In addition, SRUG employs an iterative material decomposition scheme that applies the large material model (LMM) to provide material supervision and iteratively decompose the scene's material properties, enabling robust material decomposition. Building upon these components, we introduce a physically-based lighting model that captures the complex illumination of urban scenes and supports reliable relighting. Extensive quantitative evaluations and visual comparisons demonstrate that our method outperforms existing approaches in both novel view synthesis and relighting tasks.

2026-05-23T18:37:46Z Yonghao Zhao Zexin Yin Jian Yang Beibei Wang Jin Xie http://arxiv.org/abs/2606.00137v1 Advances in Neural 3D Mesh Texturing: A Survey 2026-05-28T23:18:08Z

Texturing 3D meshes plays a vital role in determining the visual realism of digital objects and scenes. Although recent generative 3D approaches based on Neural Radiance Fields and Gaussian Splatting can produce textured assets directly, polygonal meshes remain the core representation across modeling, animation, visual effects, and gaming pipelines. Neural 3D mesh texturing therefore continues to be an essential and active area of research. In this survey, we present a comprehensive review of recent advances in neural 3D mesh texturing, covering methods for texture synthesis, transfer, and completion. We first summarize key foundations in mesh geometry, texture mapping, differentiable rendering, and neural generative models, and then organize the literature into a unified taxonomy spanning early GAN-based methods to modern diffusion-based pipelines. We further analyze common architectures and supervision strategies, review datasets and evaluation protocols, and discuss emerging applications, practical/commercial systems, and open challenges. Together, these insights provide a structured perspective on the current landscape and help guide future developments in learning-based 3D mesh texturing.

2026-05-28T23:18:08Z Eurographics STAR (Computer Graphics Forum), 2026. Project Page: https://sairajk.github.io/neural-mesh-texturing/ Eurographics STAR (State of The Art Report), Computer Graphics Forum, Volume 45, Number 2, 2026 Sai Raj Kishore Perla Hao Zhang Ali Mahdavi-Amiri 10.1111/cgf.70392 http://arxiv.org/abs/2605.30347v1 NeuROK: Generative 4D Neural Object Kinematics 2026-05-28T17:59:53Z

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under various physical conditions -- remains challenging and often ad hoc, despite its importance in building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space representing all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this neural simulation framework across diverse dynamic object types, showing clear advantages over prior works. Project page: https://chen-geng.com/neurok

2026-05-28T17:59:53Z CVPR 2026 Chen Geng Guangzhao He Yue Gao Yunzhi Zhang Shangzhe Wu Jiajun Wu http://arxiv.org/abs/2605.30318v1 Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes 2026-05-28T17:55:09Z

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

2026-05-28T17:55:09Z Ruixiang Jiang Chang Wen Chen http://arxiv.org/abs/2605.30310v1 City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images 2026-05-28T17:53:26Z

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.

2026-05-28T17:53:26Z Accepted to the USM3D Workshop Proceedings at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 as an Oral Presentation. Project page: https://citymesh3r.github.io/ Sayan Paul Sourav Ghosh Siddharth Katageri Soumyadip Maity Sanjana Sinha Brojeshwar Bhowmick http://arxiv.org/abs/2605.30294v1 RAFI -- A Ray/Work Forwarding Infrastructure for Data Parallel Multi-Node/Multi-GPU Computing 2026-05-28T17:42:03Z

We present RaFI, a CUDA and MPI based software framework that simplifies the task of building GPU-enabled data-parallel software where rays or similar work items need to migrate between different GPUs. RaFI provides a simple interface for CUDA kernels to forward such work items to other GPUs, while under the hood managing all the CUDA and MPI related work required to make this happen. We describe RaFI's motivation and implementation, and show its potential in several example applications.

2026-05-28T17:42:03Z Ingo Wald Serkan Demirci Alper Sahistan Stefan Zellmann Andrea Paris Patrick Moran Milan Jaros Tatiana von Landesberger Ugur Gudukbay Valerio Pascucci http://arxiv.org/abs/2512.12151v3 Robust and Efficient Penetration-Free Elastodynamics without Barriers 2026-05-28T17:21:03Z

We introduce a barrier-free optimization framework for non-penetration elastodynamic simulation that matches the robustness of Incremental Potential Contact (IPC) while overcoming its two primary efficiency bottlenecks: (1) reliance on logarithmic barrier functions to enforce non-penetration constraints, which leads to ill-conditioned systems and significantly slows down the convergence of iterative linear solvers; and (2) the time-of-impact (TOI) locking issue, which restricts active-set exploration in collision-intensive scenes and requires a large number of Newton iterations. We propose a novel second-order constrained optimization framework featuring a custom augmented Lagrangian solver that avoids TOI locking by immediately incorporating all requisite contact pairs detected via CCD, enabling more efficient active-set exploration and leading to significantly fewer Newton iterations. By adaptively updating Lagrange multipliers rather than increasing penalty stiffness, our method prevents stagnation at zero TOI while maintaining a well-conditioned system. We further introduce a constraint filtering and decay mechanism to keep the active set compact and stable, along with a theoretical justification of our method's finite-step termination and first-order time integration accuracy under a cumulative TOI-based termination criterion. A comprehensive set of experiments demonstrates the efficiency, robustness, and accuracy of our method. With a GPU-optimized simulator design, our method achieves an up to 103x speedup over GIPC on challenging, contact-rich benchmarks - scenarios that were previously tractable only with barrier-based methods. Our code and data are open-sourced at https://simulation-intelligence.github.io/barrier-free .

2025-12-13T03:17:12Z ACM Transactions on Graphics, 2026 (presentation at SIGGRAPH 2026) Juntian Zheng Zhaofeng Luo Minchen Li 10.1145/3811035 http://arxiv.org/abs/2605.30250v1 Ambient-robust Inverse Rendering using Active RGB-NIR Imaging 2026-05-28T17:13:41Z

Inverse rendering aims to reconstruct geometry and reflectance of objects from images. Despite recent progress, existing methods often produces inaccurate reconstructions that are sensitive to ambient illumination conditions. Here we introduce an ambient-robust inverse rendering method enabled by active RGB-NIR imaging. Our key insight is to leverage near-infrared (NIR) flash illumination-imperceptible to human observers-to obtain stable point-light shading that is largely invariant to ambient illumination. By using multi-view RGB images illuminated by ambient light and NIR images acquired with active NIR flash illumination, we reconstruct accurate geometry and reflectance by exploiting the complementary benefits of RGB and NIR images via a three-stage inverse rendering method. To enable dense multi-view acquisition, we develop an active imaging system equipped with a RGB-NIR camera and a NIR flash mounted on a mobile base. Using this system, we collect the first multi-view RGB-NIR inverse rendering dataset captured under multiple ambient illumination conditions. Experiments demonstrate that our method outperforms prior approaches, achieving accurate geometry and reflectance estimation across multiple ambient lighting scenarios.

2026-05-28T17:13:41Z 11 pages Hoon-Gyu Chung Jinnyeong Kim Hyunwoo Kang Seung-Hwan Baek http://arxiv.org/abs/2605.30396v1 Smaller and Faster 3DGS via Post-Training Dictionary Learning 2026-05-28T14:14:40Z

3D Gaussian Splatting (3DGS) is a promising neural scene representation for real-time rendering, but trained models often suffer from large memory footprints, limiting deployment on less powerful devices. Existing compression techniques often lead to architectures with several additional trainable parameters. While achieving outstanding compression ratios, they introduce noticeable drops in image quality. In this work, we introduce the first dictionary-learning-based compression framework for 3DGS. The proposed post-training compression pipeline can be deployed in virtually any 3DGS model without the need for re-training or modifications to existing 3DGS models. Our compression framework is straightforward to implement, yet provides significant compression capabilities, preserves image quality, and improves real-time rendering performance. Across 13 benchmark scenes, our approach achieves an average compression ratio of 3.95x, 3.10x, and 4.55x when applied to 3DGS, 3DGS-MCMC, and PixelGS, respectively. This yields consistent rendering speedups of 23.3%, 24.3%, and 25.3%, while maintaining image quality.

2026-05-28T14:14:40Z Jiarong Gong Jonas Unger Ehsan Miandji http://arxiv.org/abs/2605.25975v2 F-RNG: Feed-Forward Relightable Neural Gaussians 2026-05-28T13:37:35Z

Capturing relightable 3D assets from real-world objects is a widely researched problem. Several per-scene optimization-based methods, based on 3D Gaussian splatting (3DGS), support relighting; however, they usually require dense input views, and their overfitting nature makes it difficult to generalize across scenes. Unlike per-scene optimization methods, generalized feed-forward models can directly reconstruct Gaussians from sparse input views. However, the resulting assets have baked-in illumination and cannot be easily used for relighting. In this paper, we present F-RNG, a feed-forward framework that directly generates relightable 3DGS assets from sparse-view inputs. Training such a model from scratch can require massive data and computing resources, and it is especially challenging to generate relightable assets in a feed-forward manner with acceptable cost. We develop F-RNG upon an existing large reconstruction model (LRM) to extract relightable representations, while also utilizing priors from an intrinsic decomposition model (IDM). Specifically, we first introduce a latent-interpolated fine-grained geometry synthesis to enhance the LRM's geometry representation. Second, we propose a prior-guided relightable appearance distillation to extract relightable neural representations by incorporating IDM priors. Finally, a universal neural renderer enables flexible and high-fidelity relighting. F-RNG requires neither re-training nor fine-tuning of the underlying LRMs, thus can automatically benefit from better LRMs and IDMs in the future. With only small networks that can be trained with affordable data and computational resources, F-RNG avoids the repetitive inference of large models under different light conditions. By comparison to the state-of-the-art LRM-based relighting method, F-RNG achieves ~25x faster relighting, as well as superior quality (~+2.0 dB).

2026-05-25T15:48:57Z Guangming Fu Jiahui Fan Jian Yang Miloš Hašan Beibei Wang