https://arxiv.org/api/Wfi9z0p5tFRiz24WVniZ14lTsXw 2026-06-25T17:04:44Z 9383 1290 15 http://arxiv.org/abs/2511.12935v2 PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos 2025-11-18T05:47:59Z

We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from Outfit of the Day(OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48x speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

2025-11-17T03:40:43Z Accepted by AAAI 2026 Dianbing Xi Guoyuan An Jingsen Zhu Zhijian Liu Yuan Liu Ruiyuan Zhang Jiayuan Lu Yuchi Huo Rui Wang http://arxiv.org/abs/2511.13988v1 B2F: End-to-End Body-to-Face Motion Generation with Style Reference 2025-11-17T23:46:58Z

Human motion naturally integrates body movements and facial expressions, forming a unified perception. If a virtual character's facial expression does not align well with its body movements, it may weaken the perception of the character as a cohesive whole. Motivated by this, we propose B2F, a model that generates facial motions aligned with body movements. B2F takes a facial style reference as input, generating facial animations that reflect the provided style while maintaining consistency with the associated body motion. To achieve this, B2F learns a disentangled representation of content and style, using alignment and consistency-based objectives. We represent style using discrete latent codes learned via the Gumbel-Softmax trick, enabling diverse expression generation with a structured latent representation. B2F outputs facial motion in the FLAME format, making it compatible with SMPL-X characters, and supports ARKit-style avatars through a dedicated conversion module. Our evaluations show that B2F generates expressive and engaging facial animations that synchronize with body movements and style intent, while mitigating perceptual dissonance from mismatched cues, and generalizing across diverse characters and styles.

2025-11-17T23:46:58Z Pacific Graphics 2025 Bokyung Jang Eunho Jung Yoonsang Lee 10.2312/pg.20251256 http://arxiv.org/abs/2506.08334v3 iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos 2025-11-17T18:31:53Z

Articulated objects are prevalent in daily life. Interactable digital twins of such objects have numerous applications in embodied AI and robotics. Unfortunately, current methods to digitize articulated real-world objects require carefully captured data, preventing practical, scalable, and generalizable acquisition. We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones. However, this setting is challenging due to simultaneous object and camera motion and significant occlusions as the person interacts with the object. To tackle these challenges, we introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a dataset of 784 videos containing 284 objects across 11 categories that is 20$\times$ larger than available in prior work. We then compare our approach with existing methods that also take video as input. Our experiments show that iTACO outperforms existing articulated object digital twin methods on both synthetic and real casually captured RGBD videos.

2025-06-10T01:41:46Z 3DV 2026 camera-ready version. Project website can be found at https://3dlg-hcvc.github.io/video2articulation/ Weikun Peng Jun Lv Cewu Lu Manolis Savva http://arxiv.org/abs/2507.15979v2 Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars 2025-11-17T13:55:41Z

We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on the ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.

2025-07-21T18:20:09Z Accepted to 3DV 2026 Marcel C. Bühler Ye Yuan Xueting Li Yangyi Huang Koki Nagano Umar Iqbal http://arxiv.org/abs/2509.16336v3 Neural Atlas Graphs for Dynamic Scene Decomposition and Editing 2025-11-17T13:07:04Z

Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes. Project Page: https://princeton-computational-imaging.github.io/nag/

2025-09-19T18:24:41Z Jan Philipp Schneider Pratik Singh Bisht Ilya Chugunov Andreas Kolb Michael Moeller Felix Heide http://arxiv.org/abs/2511.13247v1 Force-Aware 3D Contact Modeling for Stable Grasp Generation 2025-11-17T11:08:32Z

Contact-based grasp generation plays a crucial role in various applications. Recent methods typically focus on the geometric structure of objects, producing grasps with diverse hand poses and plausible contact points. However, these approaches often overlook the physical attributes of the grasp, specifically the contact force, leading to reduced stability of the grasp. In this paper, we focus on stable grasp generation using explicit contact force predictions. First, we define a force-aware contact representation by transforming the normal force value into discrete levels and encoding it using a one-hot vector. Next, we introduce force-aware stability constraints. We define the stability problem as an acceleration minimization task and explicitly relate stability with contact geometry by formulating the underlying physical constraints. Finally, we present a pose optimizer that systematically integrates our contact representation and stability constraints to enable stable grasp generation. We show that these constraints can help identify key contact points for stability which provide effective initialization and guidance for optimization towards a stable grasp. Experiments are carried out on two public benchmarks, showing that our method brings about 20% improvement in stability metrics and adapts well to novel objects.

2025-11-17T11:08:32Z AAAI'26 Camera Ready. Project page at https://chzh9311.github.io/force-aware-grasp-project/ Zhuo Chen Zhongqun Zhang Yihua Cheng Ales Leonardis Hyung Jin Chang http://arxiv.org/abs/2511.13009v1 TR-Gaussians: High-fidelity Real-time Rendering of Planar Transmission and Reflection with 3D Gaussian Splatting 2025-11-17T06:09:21Z

We propose Transmission-Reflection Gaussians (TR-Gaussians), a novel 3D-Gaussian-based representation for high-fidelity rendering of planar transmission and reflection, which are ubiquitous in indoor scenes. Our method combines 3D Gaussians with learnable reflection planes that explicitly model the glass planes with view-dependent reflectance strengths. Real scenes and transmission components are modeled by 3D Gaussians and the reflection components are modeled by the mirrored Gaussians with respect to the reflection plane. The transmission and reflection components are blended according to a Fresnel-based, view-dependent weighting scheme, allowing for faithful synthesis of complex appearance effects under varying viewpoints. To effectively optimize TR-Gaussians, we develop a multi-stage optimization framework incorporating color and geometry constraints and an opacity perturbation mechanism. Experiments on different datasets demonstrate that TR-Gaussians achieve real-time, high-fidelity novel view synthesis in scenes with planar transmission and reflection, and outperform state-of-the-art approaches both quantitatively and qualitatively.

2025-11-17T06:09:21Z 15 pages, 12 figures Yong Liu Keyang Ye Tianjia Shao Kun Zhou http://arxiv.org/abs/2511.12785v1 Lightweight Optimal-Transport Harmonization on Edge Devices 2025-11-16T21:29:46Z

Color harmonization adjusts the colors of an inserted object so that it perceptually matches the surrounding image, resulting in a seamless composite. The harmonization problem naturally arises in augmented reality (AR), yet harmonization algorithms are not currently integrated into AR pipelines because real-time solutions are scarce. In this work, we address color harmonization for AR by proposing a lightweight approach that supports on-device inference. For this, we leverage classical optimal transport theory by training a compact encoder to predict the Monge-Kantorovich transport map. We benchmark our MKL-Harmonizer algorithm against state-of-the-art methods and demonstrate that for real composite AR images our method achieves the best aggregated score. We release our dedicated AR dataset of composite images with pixel-accurate masks and data-gathering toolkit to support further data acquisition by researchers.

2025-11-16T21:29:46Z AAAI 2026, Oral Maria Larchenko Dmitry Guskov Alexander Lobashev Georgy Derevyanko http://arxiv.org/abs/2511.12507v1 Hierarchical Frequency-Decomposition Graph Neural Networks for Road Network Representation Learning 2025-11-16T08:48:02Z

Road networks are critical infrastructures underpinning intelligent transportation systems and their related applications. Effective representation learning of road networks remains challenging due to the complex interplay between spatial structures and frequency characteristics in traffic patterns. Existing graph neural networks for modeling road networks predominantly fall into two paradigms: spatial-based methods that capture local topology but tend to over-smooth representations, and spectral-based methods that analyze global frequency components but often overlook localized variations. This spatial-spectral misalignment limits their modeling capacity for road networks exhibiting both coarse global trends and fine-grained local fluctuations. To bridge this gap, we propose HiFiNet, a novel hierarchical frequency-decomposition graph neural network that unifies spatial and spectral modeling. HiFiNet constructs a multi-level hierarchy of virtual nodes to enable localized frequency analysis, and employs a decomposition-updating-reconstruction framework with a topology-aware graph transformer to separately model and fuse low- and high-frequency signals. Theoretically justified and empirically validated on multiple real-world datasets across four downstream tasks, HiFiNet demonstrates superior performance and generalization ability in capturing effective road network representations.

2025-11-16T08:48:02Z Jingtian Ma Jingyuan Wang Leong Hou U http://arxiv.org/abs/2511.12251v1 Locomotion in CAVE: Enhancing Immersion through Full-Body Motion 2025-11-15T15:16:14Z

Cave Automatic Virtual Environment (CAVE) is one of the virtual reality (VR) immersive devices currently used to present virtual environments. However, the locomotion methods in the CAVE are limited by unnatural interaction methods, severely hindering the user experience and immersion in the CAVE. We proposed a locomotion framework for CAVE environments aimed at enhancing the immersive locomotion experience through optimized human motion recognition technology. Firstly, we construct a four-sided display CAVE system, then through the dynamic method based on Perspective-n-Point to calibrate the camera, using the obtained camera intrinsics and extrinsic parameters, and an action recognition architecture to get the action category. At last, transform the action category to a graphical workstation that renders display effects on the screen. We designed a user study to validate the effectiveness of our method. Compared to the traditional methods, our method has significant improvements in realness and self-presence in the virtual environment, effectively reducing motion sickness.

2025-11-15T15:16:14Z Xiaohui Li Xiaolong Liu Zhongchen Shi Wei Chen Liang Xie Meng Gai Jun Cao Suxia Zhang Erwei Yin 10.2139/ssrn.5626635 http://arxiv.org/abs/2508.03457v3 READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation 2025-11-15T13:28:53Z

The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference processes of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.

2025-08-05T13:57:03Z Project page: https://readportrait.github.io/READ/ Haotian Wang Yuzhe Weng Jun Du Haoran Xu Xiaoyan Wu Shan He Bing Yin Cong Liu Jianqing Gao Qingfeng Liu http://arxiv.org/abs/2509.19995v2 MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly 2025-11-14T05:52:48Z

Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing transformer-based methods suffer from long-sequence bottlenecks and limited quantization resolution, primarily due to the large number of tokens required and constrained quantization granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns. We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles--substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions. This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure. Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.

2025-09-24T11:02:03Z Project is available at: https://xrvitd.github.io/MeshMosaic/index.html Rui Xu Tianyang Xue Qiujie Dong Le Wan Zhe Zhu Peng Li Zhiyang Dou Cheng Lin Shiqing Xin Yuan Liu Wenping Wang Taku Komura http://arxiv.org/abs/2505.23740v3 LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization 2025-11-13T10:26:57Z

Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored optimization-based and learning-based layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler's success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.

2025-05-29T17:58:03Z Project Page: https://layerpeeler.github.io/ Ronghuan Wu Wanchao Su Jing Liao http://arxiv.org/abs/2511.09558v1 IFG: Internet-Scale Guidance for Functional Grasping Generation 2025-11-12T18:59:49Z

Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at https://ifgrasping.github.io/

2025-11-12T18:59:49Z Website at https://ifgrasping.github.io/ Ray Muxin Liu Mingxuan Li Kenneth Shaw Deepak Pathak http://arxiv.org/abs/2406.17582v2 Crafting Dynamic Virtual Activities with Advanced Multimodal Models 2025-11-12T17:16:09Z

In this paper, we investigate the use of multimodal large language models (MLLMs) for generating virtual activities, leveraging the integration of vision-language modalities to enable the interpretation of virtual environments. Our approach recognizes and abstracts key scene elements including scene layouts, semantic contexts, and object identities with MLLMs' multimodal reasoning capabilities. By correlating these abstractions with massive knowledge about human activities, MLLMs are capable of generating adaptive and contextually relevant virtual activities. We propose a structured framework to articulate abstract activity descriptions, emphasizing detailed multi-character interactions within virtual spaces. Utilizing the derived high-level contexts, our approach accurately positions virtual characters and ensures that their interactions and behaviors are realistically and contextually appropriate through strategic optimization. Experiment results demonstrate the effectiveness of our approach, providing a novel direction for enhancing the realism and context-awareness in simulated virtual environments.

2024-03-15T13:56:29Z C. Li, Q. Yan, M. Kim, Z. Li, Y. Xu and L. -F. Yu, "Crafting Dynamic Virtual Activities with Advanced Multimodal Models," 2025 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 120-130 Changyang Li Qingan Yan Minyoung Kim Zhan Li Yi Xu Lap-Fai Yu 10.1109/ISMAR67309.2025.00025