https://arxiv.org/api/lQdQzCIvN3L1wdYhtYX6y9SNobs 2026-06-13T12:41:31Z 9323 90 15 http://arxiv.org/abs/2606.03506v1 AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization 2026-06-02T11:24:31Z

Existing 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user's body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user's physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: https://larsph.github.io/avatarmix/

2026-06-02T11:24:31Z CVPR 2026 Findings. 16 pages, including supplementary material Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 425-435 Zhaorong Wang Yoshihiro Kanamori Yuki Endo http://arxiv.org/abs/2606.03479v1 PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting 2026-06-02T10:57:15Z

Dynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal and degrade. Existing approaches to incomplete observations in neural reconstruction rely on learned generative priors that prioritize visual plausibility over physical correctness. We propose $\textbf{PersistGS}$, a method that restores object permanence during occlusion by coupling differentiable rigid body simulation with 3D Gaussian Splatting. Our approach decomposes the scene into per-object Gaussians and collision meshes, estimates friction and velocity from the observed pre-occlusion trajectory via differentiable simulation, and uses the resulting SE(3) trajectory to position object Gaussians throughout the occlusion period. Because the predicted trajectory satisfies the governing equations of rigid body dynamics, it faithfully captures contact events (bounces, friction-based deceleration, direction changes) that kinematic extrapolation cannot model. We introduce a centroid silhouette loss that isolates positional gradients from appearance noise, yielding 40% lower trajectory error than photometric supervision. We evaluate using cameras withheld from training that observe the object during its occlusion. Experiments on synthetic scenes show that PersistGS outperforms constant velocity extrapolation by +2.46dB PSNR and comes within 0.19dB of a ground-truth trajectory upper bound.

2026-06-02T10:57:15Z Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4687-4696 Adrian Ramlal John S. Zelek http://arxiv.org/abs/2605.16813v2 QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning 2026-06-02T02:18:35Z

The generation of production-ready quad-dominant meshes is a cornerstone of modern 3D content creation. Generating anisotropic quad-dominant meshes from point clouds is challenging, as existing methods are typically limited to producing either pure triangular meshes or pure quadrilateral meshes with isotropic densities. In this paper, we present QuadLink, a unified framework consisting of three stages for quad-dominant mesh generation by linking points into structured faces. QuadLink formulates polygonal mesh generation as a hybrid centroid-conditioned vertex linking model: it first predicts a unified set of anchors (vertices and face centroids), then learns centroid-conditioned links that associate vertices with face centroids, and finally assembles polygonal faces with a quad-first strategy guided by robust geometric verification strategies. This link-based formulation enables efficient generation of sparse and anisotropic quad-dominant meshes with coherent edge flow and meanwhile supporting hybrid polygonal topology. To construct training data for this model, we further introduce a Tri-to-Quad Operator that converts artistic triangle meshes into quad-dominant training data via global merge selection. Extensive experiments show that QuadLink produces production-ready quad-dominant meshes from point clouds and achieves improved geometric fidelity and topological quality compared to prior baselines. Our method natively supports hybrid polygonal topology, generalizing to arbitrary n-gon meshes without architectural changes.

2026-05-16T05:04:10Z Yiheng Zhang Zhe Zhu Tingrui Shen Zhuojiang Cai Tianxiao Li Zixing Zhao Qiujie Dong Zhiyang Dou Jiepeng Wang Le Wan Yuwang Wang Wenping Wang Yuan Liu Cheng Lin http://arxiv.org/abs/2606.02226v1 Composable function systems as a general-purpose rendering framework 2026-06-01T13:24:23Z

Function systems exist as a natural language for the meshless creation and manipulation of complex objects while maintaining minimal memory on the Graphics Processing Unit (GPU) or Central Processing Unit (CPU). This paper proposes a new method for general-purpose (non-fractal) visualizations and simulations with function systems and introduces Quibble, a metaprogramming framework for composing such systems on the GPU. We also discuss several core advantages of this method including runtime performance, the creation of topologically non-trivial objects, and interoperability with other graphical algorithms. Beyond general-purpose imagery and animations, this method can also be used to give artists more control over in-between frames in low-framerate animations, controllably deform point clouds, and metaprogram difficult animation workflows.

2026-06-01T13:24:23Z 7 pages; 4 figures James Schloss http://arxiv.org/abs/2606.02153v1 Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances 2026-06-01T12:20:31Z

Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.

2026-06-01T12:20:31Z CVPR 2026 - Computer Vision and Pattern Recognition Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, pp. 7036-7046 Dominik Hollidt Tommaso Bendinelli Christian Holz http://arxiv.org/abs/2606.01910v1 Single-Line Drawing Generation via Semantics-Driven Optimization 2026-06-01T08:46:22Z

Line drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve, ensuring that the drawing consists of a single continuous stroke by design. This representation provides fine-grained control over the level of detail, while additional loss terms allow us to steer the final artistic style. We demonstrate that our method outperforms state-of-the-art text-to-image models and optimization pipelines for this task, producing results that are both more aesthetically pleasing and more faithful to the style of continuous line drawing artists. Furthermore, because our method generates a vectorized curve, it directly supports downstream fabrication processes such as embroidery, laser engraving and wire bending. Our code and results are available at https://github.com/tanguymagne/SLDgen.

2026-06-01T08:46:22Z 18 pages, published in Computer Graphics Forum 2026 Tanguy Magne Alexandre Binninger Ruben Wiersma Olga Sorkine-Hornung 10.1111/cgf.70502 http://arxiv.org/abs/2606.01891v1 MidSurfNet: Learnable Face Pairing and Interference Implicit Fields for Generalized Mid-surface Abstraction 2026-06-01T08:36:02Z

Mid-surface abstraction is essential for finite element analysis of thin-walled CAD models. Existing face pairing-based methods rely on handcrafted geometric heuristics, yet real-world industrial models frequently exhibit multi-wall-thickness regions, self-matching face configurations, and demand for non-center offset surfaces--scenarios where rule-based approaches consistently fail. We present MidSurfNet, a learning-augmented framework that addresses these limitations through two novel components: (1) a neural face pairing module that learns to predict face pair confidence from geometric and topological features, handling complex pairing scenarios beyond rule-based methods; and (2) an interference implicit field that represents mid-surfaces as the interference of two signed distance functions, enabling generalized offset control for flexible positioning in downstream CAE/FEA-oriented workflows. We construct a large-scale mid-surface dataset containing over 1,500 manually annotated CAD models. Experiments demonstrate that MidSurfNet achieves 87.32% face pairing accuracy and successfully handles multi-wall-thickness (61.90% completion) and self-matching (52.94% completion) scenarios that confound all existing methods. Furthermore, MidSurfNet provides a learning-based approach to generalized mid-surface abstraction with arbitrary offset control for CAE-oriented applications.

2026-06-01T08:36:02Z 20 pages, 12 figures, 5 tables Li Ye Xinhang Zhou Xingyu Yang Ruofeng Tong Hailong Li Peng Du Min Tang http://arxiv.org/abs/2606.01702v1 KDH-CAD: Knowledge-data hybrid CAD learning under data scarcity 2026-06-01T05:11:54Z

Deep learning in computer-aided design (CAD) remains fundamentally constrained by the data scarcity challenge: authentic CAD data is difficult to collect at scale, while synthetic data may not faithfully reflect real design practice. Rather than pursuing ever-larger CAD datasets, this paper alternatively treats CAD learning as a knowledge completion and calibration problem. It introduces KDH-CAD, a knowledge-data hybrid framework that integrates pretrained knowledge in foundation models, structured domain knowledge from textbooks/tutorials, and a very small amount of labeled CAD data. Domain knowledge is used to elicit and complete CAD-relevant concepts that are weakly expressed or under-represented in pretrained foundation models, while labeled CAD data calibrates these concepts in the latent space to account for task-specific geometric variability, without fine-tuning the foundation model. Experiments on real-world mechanical part classification show that KDH-CAD achieves strong performance in low-data regimes, reaching 92.6\% accuracy with only 250 training samples, 95.8\% with 1,000 samples, and continuing to improve with additional data. This matches or exceeds state-of-the-art performance that typically requires an order of magnitude more data. These results suggest that combining pretrained foundation models with structured domain knowledge can substantially reduce reliance on large-scale CAD datasets, providing a principled and practical direction for data-efficient CAD learning.

2026-06-01T05:11:54Z 18 pages Ziqin Gao Zhijie Yang Qiang Zou http://arxiv.org/abs/2602.04672v4 AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation 2026-06-01T04:06:12Z

Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.

2026-02-04T15:42:58Z 16 pages, SIGGRAPH 2026 Jin-Chuan Shi Binhong Ye Tao Liu Junzhe He Yangjinhui Xu Xiaoyang Liu Zeju Li Hao Chen Chunhua Shen http://arxiv.org/abs/2606.01590v1 Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis 2026-06-01T02:37:56Z

Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io

2026-06-01T02:37:56Z Zhengfei Kuang Adam Sun Liyuan Zhu Tong Wu Shengqu Cai Jonathan Tremblay Iro Armeni Ehsan Adeli Lior Yariv Gordon Wetzstein http://arxiv.org/abs/2606.01518v1 MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes 2026-06-01T00:42:31Z

Motion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.

2026-06-01T00:42:31Z 18 pages, 7 figures Ye Tao Yuxin Yao Kendong Liu Dapeng Wu Junhui Hou http://arxiv.org/abs/2502.08884v3 ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models 2026-05-31T18:34:53Z

We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abstractions. Our system accepts two forms of user-provided design intent: high-level text descriptions of functions to include in the output library and a small seed set of exemplar shapes. We discover a library of abstractions that matches this design intent with a guided LLM workflow that first proposes different ways of applying and implementing functions, and then validates these functions are helpful in representing seed set shapes. To extend beyond the seed set, we develop library-specific recognition networks that map shapes (represented as primitives, voxels, or point clouds) to programs that use these newly discovered abstractions. Across multiple modeling domains (split by shape category), we find that LLMs, when thoughtfully combined with geometric reasoning, can be guided to author libraries of abstraction functions that generalize across shape distributions. Our framework takes a step towards realizing the long-standing shape analysis aspiration of discovering reusable, programmatic shape abstractions while exposing interpretable, semantically aligned interfaces. Our extensive evaluation demonstrates that ShapeLib provides distinct advantages over prior alternative abstraction discovery works in terms of generalization, usability, and maintaining plausibility under manipulation. Finally, we demonstrate that ShapeLib's abstraction functions unlock a number of downstream applications, combining LLM reasoning over shape programs with geometry processing tools to support shape editing and generation workflows.

2025-02-13T01:52:02Z R. Kenny Jones Paul Guerrero Niloy J. Mitra Daniel Ritchie http://arxiv.org/abs/2606.01362v1 AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance 2026-05-31T17:33:14Z

Video generative models have achieved remarkable progress in synthesizing photorealistic video sequences. However, enabling broader and more creative downstream applications requires fine-grained instance-level video editing, including object insertion, object removal, and texture editing, which has emerged as a prominent yet challenging problem. Existing approaches either propose unified generative frameworks with only coarse semantic control, or design task-specific frameworks for individual editing tasks, limiting their flexibility and applicability across diverse real-world scenarios. To address these limitations, we propose AlbedoEdit, a unified generative video editing framework that jointly supports object insertion, object removal, and texture editing. Our key insight is that the intrinsic albedo map, which is invariant to lighting and contains no specularity, shadowing and inter-reflection effects, provides an effective and user-friendly mechanism for specifying fine-grained appearance edits. Built upon video foundation models, AlbedoEdit is fine-tuned to translate source RGB videos into edited RGB videos, conditioned on a user-edited first-frame albedo. Trained on a new paired synthetic dataset covering all three editing tasks, AlbedoEdit implicitly learns to harmonize edited contents and simulate complex real-world visual effects triggered by editing operations, including specular highlights, soft shadows, and mirror reflections. AlbedoEdit demonstrates superior performance over state-of-the-art video editing approaches, both qualitatively and quantitatively. Project webpage is https://vcai.mpi-inf.mpg.de/projects/AlbedoEdit/.

2026-05-31T17:33:14Z Xilong Zhou Bao-Huy Nguyen Zheng Zeng Jacob Munkberg Jon Hasselgren Thomas Leimkühler Nima Kalantari Miloš Hašan Christian Theobalt http://arxiv.org/abs/2606.01057v1 3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code 2026-05-31T06:59:49Z

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

2026-05-31T06:59:49Z Project Page: https://www.3dcodebench.com/; 11 pages (main), with appendix Yipeng Gao Lei Shu Genzhi Ye Xi Xiong Ameesh Makadia Meiqi Guo Laurent Itti Jindong Chen http://arxiv.org/abs/2606.01031v1 Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation 2026-05-31T05:44:42Z

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

2026-05-31T05:44:42Z Research report Zhicheng Zhang Lei Wang Yu Zhang Yongsheng Gao