https://arxiv.org/api/ZWFPmjhkzEbLXh7jb7Ps59fmLfE 2026-06-14T01:23:07Z 9323 255 15 http://arxiv.org/abs/2605.15597v1 CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage 2026-05-15T04:10:42Z

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.

2026-05-15T04:10:42Z 35 pages including appendix. Code and dataset: https://github.com/Strange-animalss/CM-EVS Jiale Liu Jungang Li Jieming Yu Xinglin Yu Zihao Dongfang Zongjian Ding Kaifeng Ding Yi Yang Lidong Chen Yang Zou Shunwen Bai Jiahuan Zhang Haoran Huang Shan Huang Yudong Gao Mingjun Cheng http://arxiv.org/abs/2605.14716v2 AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro 2026-05-15T02:07:43Z

Sparse anchors provide a compact interface for human motion authoring: users specify a few root positions, planar trajectory samples, or body-point targets, while the system synthesizes the full-body motion that completes the under-specified intent. We present AnchorRoute, a sparse-anchor motion synthesis framework that uses anchors as a shared scaffold for both generation and refinement. Before generation, AnchorRoute converts sparse anchors into anchor-condition features and injects the resulting condition memory into a frozen Transition Masked Diffusion prior through AnchorKV and dual-context conditioning. This preserves the generation quality of the pretrained text-to-motion prior while learning sparse spatial control. After generation, the same anchors are evaluated as residuals: their timestamps define refinement intervals, and their residuals determine where correction should be concentrated. RouteSolver then refines the motion by projecting soft-token updates onto anchor-defined piecewise-affine interval bases. This couples generation-time anchor conditioning with residual-routed refinement under one anchor scaffold. AnchorRoute supports root-3D, planar-root, and body-point control within the same formulation. In benchmark evaluations, AnchorRoute outperforms prior sparse-control methods under the sparse keyjoint protocol and consistently improves anchor adherence across control families. The results show that the learned anchor-conditioned generator and RouteSolver refinement are complementary: the generator preserves text-motion quality, while RouteSolver provides a controllable path toward stronger anchor adherence.

2026-05-14T11:36:18Z Pengcheng Fang Tengjiao Sun Dongjie Fu Xiaoyu Zhan Yanwen Guo Hansung Kim Xiaohao Cai http://arxiv.org/abs/2605.15398v1 3DEditSafe: Defending 3D Editing Pipelines from Unsafe Generation 2026-05-14T20:30:50Z

Recent advances in 3D generative editing, particularly pipelines based on 3D Gaussian Splatting (3DGS), have achieved high-fidelity, multi-view-consistent scene manipulation from text prompts. However, we find that these pipelines also introduce new safety risks when unsafe prompts produce edits that are propagated and optimized across views. In this work, we study unsafe generation in 3D editing pipelines and show that such behavior can lead to coherent, undesirable Not-Safe-For-Work (NSFW) content in the final 3D representation. To address this, we propose 3DEditSafe, a safety-regularized 3D editing framework that constrains unsafe semantic propagation during optimization. 3DEditSafe combines generation-stage safety guidance with rendered-view 3D safety regularization, safe semantic projection, residue suppression, and mask-aware preservation to steer optimization away from unsafe editing directions. We evaluate our approach on EditSplat scenes using an object-compatible unsafe prompt benchmark and show that 2D safety guidance alone is not consistently sufficient to prevent unsafe 3D edits. 3DEditSafe reduces unsafe semantic alignment and view-level attack success rates, while revealing a safety-quality tradeoff in which stronger unsafe suppression can introduce artifacts or reduce unsafe-prompt fidelity. To our knowledge, this work is the first attempt to study and defend against unsafe generation in text-driven 3D editing pipelines, highlighting the need for safety mechanisms that operate directly on optimized 3D representations.

2026-05-14T20:30:50Z Nicole Meng Zheyuan Liu Meng Jiang Yingjie Lao http://arxiv.org/abs/2605.15369v1 OffsetAxis: UDF Mesh Reconstruction via Offset-Volume Medial Axis Extraction 2026-05-14T19:49:15Z

Unsigned distance fields (UDFs) offer broader modeling capabilities than signed distance fields (SDFs), enabling the representation of shapes with open boundaries, non-manifold structures or mixed curve and surface parts. However, extracting coherent meshes from UDFs is fundamentally harder, as classical grid-based iso-surfacing techniques are not applicable since they require a way to distinguish the inside from the outside of the shape. We introduce OffsetAxis, a new UDF reconstruction pipeline that supports open, non-manifold, and curve-like geometries. Our key insight is that the 0-level set extraction problem can be restated as the extraction of the medial axis of the $α$-offset volume of the UDF. This formulation unlocks mature medial-axis machinery that naturally supports boundaries, non-manifold junctions and curves. To avoid the biases of grid-based techniques, we sample the $α$-offset surface using ray casting and optimize medial balls inside the offset volume with an efficient variant of Variational Medial Axis Sampling. The final mesh is recovered by taking the dual of the connectivity of the medial ball clusters, producing structurally coherent reconstructions for a wide range of topologies. The robustness and versatility of the approach allow it to handle imperfect distance fields, including neural UDFs trained on noisy inputs, the Quasi-Medial Distance Field (Q-MDF), as well as distances computed directly on triangle soups or point clouds. Extensive experiments demonstrate that our method produces more faithful mesh reconstruction and better alignment with the underlying shape structure than prior techniques.

2026-05-14T19:49:15Z Qijia Huang Pierre Kraemer Dominique Bechmann http://arxiv.org/abs/2605.15368v1 Discretizing Group-Convolutional Neural Networks for 3D Geometry in Feature Space 2026-05-14T19:47:43Z

Group-convolutional neural networks (GCNNs) are among the most important methods for introducing symmetry as an inductive bias in deep learning: In each linear layer, GCNNs sample a transformation group $G$ densely and correlate data and filters in different poses (with suitable anti-aliasing for steerable GCNNs) to maintain equivariance with respect to $G$. Unfortunately, applying filters to many data items resulting from this sampling is expensive (even for translations alone, i.e., in ordinary CNNs), and costs grow exponentially with increasing degrees of freedom (such as translations and rotations in 3D), which often hinders practical applications. In this paper, we propose sampling in feature space, i.e., replacing geometrically dense samples with representative samples selected by feature similarity. This decouples geometric resolution from memory and processing costs during training and inference, providing a novel way to trade off computational effort and accuracy. Our main empirical finding is that a coarse feature-space sampling already preserves classification accuracy remarkably well, which permits precomputation based on geometric similarity, accelerating the training of equivariant 3D classifiers substantially.

2026-05-14T19:47:43Z 11 pages, 7 figures, 2 tables Daniel Franzen Jean Philip Filling Michael Wand http://arxiv.org/abs/2605.15320v1 FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction 2026-05-14T18:33:49Z

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.

2026-05-14T18:33:49Z Project Page: https://ffavatar.github.io Thuan Hoang Nguyen Jiahao Luo Yinyu Nie Hao Li Gordon Guocheng Qian Jian Wang http://arxiv.org/abs/2605.15307v1 Sound Sparks Motion: Audio and Text Tuning for Video Editing 2026-05-14T18:20:50Z

Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model's multimodal conditioning. Code and data are available via our project page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/

2026-05-14T18:20:50Z Project Page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion AmirHossein Naghi Razlighi Aryan Mikaeili Ali Mahdavi-Amiri Daniel Cohen-Or Yiorgos Chrysanthou http://arxiv.org/abs/2605.15187v1 Articraft: An Agentic System for Scalable Articulated 3D Asset Generation 2026-05-14T17:59:18Z

A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.

2026-05-14T17:59:18Z Project page: https://articraft3d.github.io/ Matt Zhou Ruining Li Xiaoyang Lyu Zhaomou Song Zhening Huang Chuanxia Zheng Christian Rupprecht Andrea Vedaldi Shangzhe Wu http://arxiv.org/abs/2601.16981v2 SyncLight: Single-Edit Multi-View Relighting 2026-05-14T17:35:56Z

We present SyncLight, a method to enable consistent, parametric control over light sources across multiple uncalibrated views of a static scene conditioned on a single view. While single-view relighting has advanced significantly, existing generative approaches struggle to maintain the rigorous lighting consistency essential for multi-camera broadcasts, stereoscopic cinema, and virtual production. SyncLight addresses this by enabling precise control over light intensity and color across a multi-view capture of a scene, conditioned on a single reference edit. Our method leverages a multi-view diffusion transformer trained using a latent bridge matching formulation, achieving high-fidelity relighting of the entire image set in a single inference step. To facilitate training, we introduce a large-scale hybrid dataset comprising diverse synthetic environments -- curated from existing sources and newly designed scenes -- alongside high-fidelity, real-world multi-view captures under calibrated illumination. Though trained only on image pairs, SyncLight generalizes zero-shot to an arbitrary number of viewpoints, effectively propagating lighting changes across all views, without requiring camera pose information. SyncLight enables practical relighting workflows for multi-view capture systems.

2026-01-23T18:59:57Z Project page: http://sync-light.github.io David Serrano-Lozano Anand Bhattad Luis Herranz Jean-François Lalonde Javier Vazquez-Corral http://arxiv.org/abs/2605.14960v1 Meschers: Geometry Processing of Impossible Objects 2026-05-14T15:28:14Z

Impossible objects, geometric constructions that humans can perceive but that cannot exist in real life, have been a topic of intrigue in visual arts, perception, and graphics, yet no satisfying computer representation of such objects exists. Previous work embeds impossible objects in 3D, cutting them or twisting/bending them in the depth axis. Cutting an impossible object changes its local geometry at the cut, which can hamper downstream graphics applications, such as smoothing, while bending makes it difficult to relight the object. Both of these can invalidate geometry operations, such as distance computation. As an alternative, we introduce Meschers, meshes capable of representing impossible constructions akin to those found in M.C. Escher's woodcuts. Our representation has a theoretical foundation in discrete exterior calculus and supports the use-cases above, as we demonstrate in a number of example applications. Moreover, because we can do discrete geometry processing on our representation, we can inverse-render impossible objects. We also compare our representation to cut and bend representations of impossible objects.

2026-05-14T15:28:14Z ACM Trans. Graph. 44, 4, Article 70 (August 2025) Ana Dodik Isabella Yu Kartik Chandra Jonathan Ragan-Kelley Joshua Tenenbaum Vincent Sitzmann Justin Solomon 10.1145/3731422 http://arxiv.org/abs/2504.01571v2 Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation 2026-05-14T15:23:21Z

We use hierarchical procedural rules for the generation of control maps within the stable diffusion framework to produce photo-realistic architectural facade images. Starting from a single input image and its segmentation, we apply an inverse procedural module to identify the facade's hierarchical layout. Leveraging this hierarchy and structural features, we introduce a novel ControlNet pipeline that generates new facade imagery guided by procedural transformations. Our method enables various structural edits, including floor duplication and window rearrangement, by integrating hierarchical alignment directly into control maps. This precisely guides the diffusion-based generative process, ensuring local appearance fidelity alongside extensive structural modifications. Comprehensive evaluations, including comparisons with inpainting-based approaches and synthetic benchmarks, confirm our approach's superior capability in preserving architectural identity and achieving accurate, controllable edits. Quantitative results and user feedback validate our method's effectiveness.

2025-04-02T10:16:19Z 17 pages, 15 figures, Computer Graphics Forum 2026 Journal Paper Aleksander Plocharski Jan Swidzinski Przemyslaw Musialski 10.1111/cgf.70487 http://arxiv.org/abs/2605.14880v1 Denoising-GS: Gaussian Splatting with Spatial-aware Denoising 2026-05-14T14:24:21Z

Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.

2026-05-14T14:24:21Z Qingyuan Zhou Xinyi Liu Weidong Yang Ning Wang Shuquan Ye Ben Fei Ying He Wanli Ouyang http://arxiv.org/abs/2605.14835v1 The Racial Character of Computer Graphics Research 2026-05-14T13:42:20Z

Computer graphics algorithms for generating photorealistic imagery are widely perceived to be universal, and capable of conjuring anything that a filmmaker or game designer can imagine. However, recent works have suggested that 3D algorithms for depicting synthetic humans are far from generic, and instead favor historically hegemonic characteristics. We present the first systematic review of human depiction in the top computer graphics conference and the journal of record (SIGGRAPH and ACM Transactions on Graphics) that confirms previous hypotheses. Algorithms that claim to be generically rendering "human skin'' are in fact imagined and formulated for translucent, "high albedo" materials such as white skin. Algorithms claiming to apply generically to "human hair" are formulated for "rods", "wires" and "threads" which are analogous to straight hair. Our analysis reveals conceptual binarization, where algorithms for white skin are treated as computational substrate for "all" skin, imposing a hierarchical assumption that all skin descends from the math and physics of white skin. Hair algorithms follow a similar historical pattern, with the first examples of computer-generated Type 4 hair only appearing after the murder of George Floyd in 2020. We offer a new conceptual label, McDaniels Methods, for characterizing and critiquing computer graphics algorithms that reinforce racial hierarchy under a false cover of diversity. We also offer an inverse label, Durald Methods, for algorithms that were closely co-designed with the people being depicted. Our analysis points the way towards several neglected avenues for future research.

2026-05-14T13:42:20Z Theodore Kim Alexa Schor Julian Posada Alka V. Menon http://arxiv.org/abs/2605.14772v1 BioHuman: Learning Biomechanical Human Representations from Video 2026-05-14T12:36:53Z

Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.

2026-05-14T12:36:53Z Yujun Huo He Zhang Chentao Song Honglin Song Zongyu Zuo Tao Yu http://arxiv.org/abs/2605.14731v1 UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars 2026-05-14T11:56:03Z

Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.

2026-05-14T11:56:03Z Xiaoyu Zhan Xinyu Fu Chenghao Yang Xiaohong Zhang Dongjie Fu Pengcheng Fang Tengjiao Sun Xiaohao Cai Hansung Kim Yuanqi Li Jie Guo Yanwen Guo