https://arxiv.org/api/lQdQzCIvN3L1wdYhtYX6y9SNobs2026-06-13T12:41:31Z93239015http://arxiv.org/abs/2606.03506v1AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization2026-06-02T11:24:31ZExisting 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user's body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user's physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: https://larsph.github.io/avatarmix/2026-06-02T11:24:31ZCVPR 2026 Findings. 16 pages, including supplementary materialProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 425-435Zhaorong WangYoshihiro KanamoriYuki Endohttp://arxiv.org/abs/2606.03479v1PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting2026-06-02T10:57:15ZDynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal and degrade. Existing approaches to incomplete observations in neural reconstruction rely on learned generative priors that prioritize visual plausibility over physical correctness.
We propose $\textbf{PersistGS}$, a method that restores object permanence during occlusion by coupling differentiable rigid body simulation with 3D Gaussian Splatting. Our approach decomposes the scene into per-object Gaussians and collision meshes, estimates friction and velocity from the observed pre-occlusion trajectory via differentiable simulation, and uses the resulting SE(3) trajectory to position object Gaussians throughout the occlusion period. Because the predicted trajectory satisfies the governing equations of rigid body dynamics, it faithfully captures contact events (bounces, friction-based deceleration, direction changes) that kinematic extrapolation cannot model. We introduce a centroid silhouette loss that isolates positional gradients from appearance noise, yielding 40% lower trajectory error than photometric supervision. We evaluate using cameras withheld from training that observe the object during its occlusion. Experiments on synthetic scenes show that PersistGS outperforms constant velocity extrapolation by +2.46dB PSNR and comes within 0.19dB of a ground-truth trajectory upper bound.2026-06-02T10:57:15ZAccepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D ReconstructionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4687-4696Adrian RamlalJohn S. Zelekhttp://arxiv.org/abs/2605.16813v2QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning2026-06-02T02:18:35ZThe generation of production-ready quad-dominant meshes is a cornerstone of modern 3D content creation. Generating anisotropic quad-dominant meshes from point clouds is challenging, as existing methods are typically limited to producing either pure triangular meshes or pure quadrilateral meshes with isotropic densities. In this paper, we present QuadLink, a unified framework consisting of three stages for quad-dominant mesh generation by linking points into structured faces. QuadLink formulates polygonal mesh generation as a hybrid centroid-conditioned vertex linking model: it first predicts a unified set of anchors (vertices and face centroids), then learns centroid-conditioned links that associate vertices with face centroids, and finally assembles polygonal faces with a quad-first strategy guided by robust geometric verification strategies. This link-based formulation enables efficient generation of sparse and anisotropic quad-dominant meshes with coherent edge flow and meanwhile supporting hybrid polygonal topology. To construct training data for this model, we further introduce a Tri-to-Quad Operator that converts artistic triangle meshes into quad-dominant training data via global merge selection. Extensive experiments show that QuadLink produces production-ready quad-dominant meshes from point clouds and achieves improved geometric fidelity and topological quality compared to prior baselines. Our method natively supports hybrid polygonal topology, generalizing to arbitrary n-gon meshes without architectural changes.2026-05-16T05:04:10ZYiheng ZhangZhe ZhuTingrui ShenZhuojiang CaiTianxiao LiZixing ZhaoQiujie DongZhiyang DouJiepeng WangLe WanYuwang WangWenping WangYuan LiuCheng Linhttp://arxiv.org/abs/2606.02226v1Composable function systems as a general-purpose rendering framework2026-06-01T13:24:23ZFunction systems exist as a natural language for the meshless creation and manipulation of complex objects while maintaining minimal memory on the Graphics Processing Unit (GPU) or Central Processing Unit (CPU). This paper proposes a new method for general-purpose (non-fractal) visualizations and simulations with function systems and introduces Quibble, a metaprogramming framework for composing such systems on the GPU. We also discuss several core advantages of this method including runtime performance, the creation of topologically non-trivial objects, and interoperability with other graphical algorithms. Beyond general-purpose imagery and animations, this method can also be used to give artists more control over in-between frames in low-framerate animations, controllably deform point clouds, and metaprogram difficult animation workflows.2026-06-01T13:24:23Z7 pages; 4 figuresJames Schlosshttp://arxiv.org/abs/2606.02153v1Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances2026-06-01T12:20:31ZMethods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.2026-06-01T12:20:31ZCVPR 2026 - Computer Vision and Pattern RecognitionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, pp. 7036-7046Dominik HollidtTommaso BendinelliChristian Holzhttp://arxiv.org/abs/2606.01910v1Single-Line Drawing Generation via Semantics-Driven Optimization2026-06-01T08:46:22ZLine drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve, ensuring that the drawing consists of a single continuous stroke by design. This representation provides fine-grained control over the level of detail, while additional loss terms allow us to steer the final artistic style. We demonstrate that our method outperforms state-of-the-art text-to-image models and optimization pipelines for this task, producing results that are both more aesthetically pleasing and more faithful to the style of continuous line drawing artists. Furthermore, because our method generates a vectorized curve, it directly supports downstream fabrication processes such as embroidery, laser engraving and wire bending. Our code and results are available at https://github.com/tanguymagne/SLDgen.2026-06-01T08:46:22Z18 pages, published in Computer Graphics Forum 2026Tanguy MagneAlexandre BinningerRuben WiersmaOlga Sorkine-Hornung10.1111/cgf.70502http://arxiv.org/abs/2606.01891v1MidSurfNet: Learnable Face Pairing and Interference Implicit Fields for Generalized Mid-surface Abstraction2026-06-01T08:36:02ZMid-surface abstraction is essential for finite element analysis of thin-walled CAD models. Existing face pairing-based methods rely on handcrafted geometric heuristics, yet real-world industrial models frequently exhibit multi-wall-thickness regions, self-matching face configurations, and demand for non-center offset surfaces--scenarios where rule-based approaches consistently fail. We present MidSurfNet, a learning-augmented framework that addresses these limitations through two novel components: (1) a neural face pairing module that learns to predict face pair confidence from geometric and topological features, handling complex pairing scenarios beyond rule-based methods; and (2) an interference implicit field that represents mid-surfaces as the interference of two signed distance functions, enabling generalized offset control for flexible positioning in downstream CAE/FEA-oriented workflows. We construct a large-scale mid-surface dataset containing over 1,500 manually annotated CAD models. Experiments demonstrate that MidSurfNet achieves 87.32% face pairing accuracy and successfully handles multi-wall-thickness (61.90% completion) and self-matching (52.94% completion) scenarios that confound all existing methods. Furthermore, MidSurfNet provides a learning-based approach to generalized mid-surface abstraction with arbitrary offset control for CAE-oriented applications.2026-06-01T08:36:02Z20 pages, 12 figures, 5 tablesLi YeXinhang ZhouXingyu YangRuofeng TongHailong LiPeng DuMin Tanghttp://arxiv.org/abs/2606.01702v1KDH-CAD: Knowledge-data hybrid CAD learning under data scarcity2026-06-01T05:11:54ZDeep learning in computer-aided design (CAD) remains fundamentally constrained by the data scarcity challenge: authentic CAD data is difficult to collect at scale, while synthetic data may not faithfully reflect real design practice. Rather than pursuing ever-larger CAD datasets, this paper alternatively treats CAD learning as a knowledge completion and calibration problem. It introduces KDH-CAD, a knowledge-data hybrid framework that integrates pretrained knowledge in foundation models, structured domain knowledge from textbooks/tutorials, and a very small amount of labeled CAD data. Domain knowledge is used to elicit and complete CAD-relevant concepts that are weakly expressed or under-represented in pretrained foundation models, while labeled CAD data calibrates these concepts in the latent space to account for task-specific geometric variability, without fine-tuning the foundation model. Experiments on real-world mechanical part classification show that KDH-CAD achieves strong performance in low-data regimes, reaching 92.6\% accuracy with only 250 training samples, 95.8\% with 1,000 samples, and continuing to improve with additional data. This matches or exceeds state-of-the-art performance that typically requires an order of magnitude more data. These results suggest that combining pretrained foundation models with structured domain knowledge can substantially reduce reliance on large-scale CAD datasets, providing a principled and practical direction for data-efficient CAD learning.2026-06-01T05:11:54Z18 pagesZiqin GaoZhijie YangQiang Zouhttp://arxiv.org/abs/2602.04672v4AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation2026-06-01T04:06:12ZReconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.2026-02-04T15:42:58Z16 pages, SIGGRAPH 2026Jin-Chuan ShiBinhong YeTao LiuJunzhe HeYangjinhui XuXiaoyang LiuZeju LiHao ChenChunhua Shenhttp://arxiv.org/abs/2606.01590v1Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis2026-06-01T02:37:56ZModern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation.
Our website: https://streetnvs.github.io2026-06-01T02:37:56ZZhengfei KuangAdam SunLiyuan ZhuTong WuShengqu CaiJonathan TremblayIro ArmeniEhsan AdeliLior YarivGordon Wetzsteinhttp://arxiv.org/abs/2606.01518v1MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes2026-06-01T00:42:31ZMotion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.2026-06-01T00:42:31Z18 pages, 7 figuresYe TaoYuxin YaoKendong LiuDapeng WuJunhui Houhttp://arxiv.org/abs/2502.08884v3ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models2026-05-31T18:34:53ZWe present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abstractions. Our system accepts two forms of user-provided design intent: high-level text descriptions of functions to include in the output library and a small seed set of exemplar shapes. We discover a library of abstractions that matches this design intent with a guided LLM workflow that first proposes different ways of applying and implementing functions, and then validates these functions are helpful in representing seed set shapes. To extend beyond the seed set, we develop library-specific recognition networks that map shapes (represented as primitives, voxels, or point clouds) to programs that use these newly discovered abstractions. Across multiple modeling domains (split by shape category), we find that LLMs, when thoughtfully combined with geometric reasoning, can be guided to author libraries of abstraction functions that generalize across shape distributions. Our framework takes a step towards realizing the long-standing shape analysis aspiration of discovering reusable, programmatic shape abstractions while exposing interpretable, semantically aligned interfaces. Our extensive evaluation demonstrates that ShapeLib provides distinct advantages over prior alternative abstraction discovery works in terms of generalization, usability, and maintaining plausibility under manipulation. Finally, we demonstrate that ShapeLib's abstraction functions unlock a number of downstream applications, combining LLM reasoning over shape programs with geometry processing tools to support shape editing and generation workflows.2025-02-13T01:52:02ZR. Kenny JonesPaul GuerreroNiloy J. MitraDaniel Ritchiehttp://arxiv.org/abs/2606.01362v1AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance2026-05-31T17:33:14ZVideo generative models have achieved remarkable progress in synthesizing photorealistic video sequences. However, enabling broader and more creative downstream applications requires fine-grained instance-level video editing, including object insertion, object removal, and texture editing, which has emerged as a prominent yet challenging problem. Existing approaches either propose unified generative frameworks with only coarse semantic control, or design task-specific frameworks for individual editing tasks, limiting their flexibility and applicability across diverse real-world scenarios. To address these limitations, we propose AlbedoEdit, a unified generative video editing framework that jointly supports object insertion, object removal, and texture editing. Our key insight is that the intrinsic albedo map, which is invariant to lighting and contains no specularity, shadowing and inter-reflection effects, provides an effective and user-friendly mechanism for specifying fine-grained appearance edits. Built upon video foundation models, AlbedoEdit is fine-tuned to translate source RGB videos into edited RGB videos, conditioned on a user-edited first-frame albedo. Trained on a new paired synthetic dataset covering all three editing tasks, AlbedoEdit implicitly learns to harmonize edited contents and simulate complex real-world visual effects triggered by editing operations, including specular highlights, soft shadows, and mirror reflections. AlbedoEdit demonstrates superior performance over state-of-the-art video editing approaches, both qualitatively and quantitatively. Project webpage is https://vcai.mpi-inf.mpg.de/projects/AlbedoEdit/.2026-05-31T17:33:14ZXilong ZhouBao-Huy NguyenZheng ZengJacob MunkbergJon HasselgrenThomas LeimkühlerNima KalantariMiloš HašanChristian Theobalthttp://arxiv.org/abs/2606.01057v13DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code2026-05-31T06:59:49ZProcedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.2026-05-31T06:59:49ZProject Page: https://www.3dcodebench.com/; 11 pages (main), with appendixYipeng GaoLei ShuGenzhi YeXi XiongAmeesh MakadiaMeiqi GuoLaurent IttiJindong Chenhttp://arxiv.org/abs/2606.01031v1Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation2026-05-31T05:44:42ZAudio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.2026-05-31T05:44:42ZResearch reportZhicheng ZhangLei WangYu ZhangYongsheng Gao