https://arxiv.org/api/ce8Qc9lOm9r4JNFNydOtTfO2RWE 2026-06-13T20:47:16Z 9323 195 15 http://arxiv.org/abs/2605.20185v2 PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars 2026-05-20T11:18:31Z

Existing Gaussian avatar methods typically parameterize geometry on a body-template surface, which entangles the avatar's representation space with the template's deformation space and limits the capture of layered, off-body, and non-rigid clothing geometry. We present PiG-Avatar, which addresses this limitation by using the parametric body model solely for kinematic transport, while representing the avatar as Gaussians anchored in a volumetric canonical space governed by a continuous neural field. This decouples representation from template topology, avoiding the geometric constraints of surface-based parameterizations. Kinematic coherence is maintained through 3D barycentric anchor transport, which guides motion without constraining geometry and allows anchors to deviate freely from the template surface, yielding dense, stable temporal surface correspondences by construction. To make this unconstrained formulation tractable, we introduce dual-level spatially coherent optimization, combining Sobolev-preconditioned neural-field updates with a novel KNN-based preconditioning of canonical anchor geometry. Together, these mechanisms induce an emergent self-organization of anchor density: anchors migrate toward regions of high curvature, appearance variation, and non-coherent motion without explicit heuristics. As a result, complex clothing geometry and layered surfaces emerge as natural, high-fidelity outputs. This single representation further supports hierarchical reconstruction across multiple levels of detail, with coarse-level supervision propagating to finer levels through the shared field and coupled anchor graph. On established benchmarks featuring subjects with complex clothing and challenging non-rigid motion, PiG-Avatar achieves state-of-the-art rendering quality, generalizes robustly to imperfect body model initialization, and renders in real time across all detail levels.

2026-05-19T17:59:54Z Julian Kaltheuner Jan Spindler Sina Kitz Patrick Stotko Reinhard Klein http://arxiv.org/abs/2605.20941v1 PaintCopilot: Modeling Painting as Autonomous Artistic Continuation 2026-05-20T09:27:06Z

We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.

2026-05-20T09:27:06Z Yunge Wen Yuancheng Shen Paul Pu Liang http://arxiv.org/abs/2605.04128v2 JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation 2026-05-20T08:56:54Z

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

2026-05-05T15:49:47Z Code: https://github.com/jd-opensource/JoyAI-Image Lin Song Wenbo Li Guoqing Ma Wei Tang Bo Wang Yuan Zhang Yijun Yang Yicheng Xiao Jianhui Liu Yanbing Zhang Guohui Zhang Wenhu Zhang Hang Xu Nan Jiang Xin Han Haoze Sun Maoquan Zhang Haoyang Huang Nan Duan http://arxiv.org/abs/2603.27309v2 MeshTailor: Cutting Seams via Generative Mesh Traversal 2026-05-20T08:12:01Z

We present MeshTailor, the first mesh-native generative framework for synthesizing edge-aligned seams on 3D surfaces. Unlike prior optimization-based or extrinsic learning-based methods, MeshTailor operates directly on the mesh graph, eliminating projection artifacts and fragile snapping heuristics. We introduce ChainingSeams, a hierarchical serialization of the seam graph that orders chains from global structural cuts down to local details in a coarse-to-fine manner, and a dual-stream encoder that fuses topological and geometric context. Leveraging this hierarchical representation and dual-stream vertex embeddings, our MeshTailor Transformer utilizes an autoregressive pointer layer to trace seams vertex-by-vertex within local neighborhoods. Extensive evaluations show that MeshTailor produces more coherent and structurally regular seam layouts compared to recent optimization-based and learning-based baselines.

2026-03-28T15:30:24Z Xueqi Ma Xingguang Yan Congyue Zhang Hui Huang http://arxiv.org/abs/2605.20872v1 CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation 2026-05-20T08:08:39Z

Adaptive densification is the engine of 3D Gaussian Splatting (3DGS). However, when transposed to the optimization-based Generative Distillation paradigm, this reconstruction-native mechanism reveals fundamental limitations, resulting in inefficient representations cluttered with redundant primitives. We diagnose this failure as a Densification Dilemma stemming from the stochastic nature of generative guidance: the standard magnitude-based accumulation indiscriminately aggregates transient noise alongside geometric signals, making it difficult to strike a balance between over-densification and under-fitting. To resolve this, we introduce Context-Adaptive Moment Estimation (CAdam), a novel framework that reinterprets densification as a statistically grounded signal verification problem. CAdam leverages the first moment of gradients to exploit the interference principle, where stochastic fluctuations cancel out via destructive interference while consistent geometric drifts accumulate via constructive interference, effectively disentangling the underlying signal from the generative noise floor. This is further augmented by a quantile-based context awareness and an intrinsic Signal-to-Noise Ratio (SNR) gating mechanism, which ensure robust adaptation across optimization stages and enable the soft termination of densification. Extensive experiments across diverse objectives (SDS, ISM, VFDS) and strong generative 3DGS backbones show that CAdam reduces Gaussian count by 85%-97% relative to standard densification while preserving overall comparable perceptual quality. These results highlight signal-aware density control as a practical way to improve memory efficiency in optimization-based generative distillation.

2026-05-20T08:08:39Z Accepted to SIGGRAPH 2026 Conference Papers. 12 pages, 8 figures SeungJeh Chung Geonho Park Misong Kim HyeongYeop Kang 10.1145/3799902.3811215 http://arxiv.org/abs/2605.14382v3 Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation 2026-05-20T08:07:54Z

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

2026-05-14T05:06:57Z Yuheng Wu Xiangbo Gao Tianhao Chen Xinghao Chen Qing Yin Zhengzhong Tu Dongman Lee http://arxiv.org/abs/2205.13524v4 PREF: Phasorial Embedding Fields for Compact Neural Representations 2026-05-20T07:43:28Z

We present an efficient frequency-based neural representation termed PREF: a shallow MLP augmented with a phasor volume that covers significant border spectra than previous Fourier feature mapping or Positional Encoding. At the core is our compact 3D phasor volume where frequencies distribute uniformly along a 2D plane and dilate along a 1D axis. To this end, we develop a tailored and efficient Fourier transform that combines both Fast Fourier transform and local interpolation to accelerate naïve Fourier mapping. We also introduce a Parsvel regularizer that stables frequency-based learning. In these ways, Our PREF reduces the costly MLP in the frequency-based representation, thereby significantly closing the efficiency gap between it and other hybrid representations, and improving its interpretability. Comprehensive experiments demonstrate that our PREF is able to capture high-frequency details while remaining compact and robust, including 2D image generalization, 3D signed distance function regression and 5D neural radiance field reconstruction.

2022-05-26T17:43:03Z Binbin Huang Xinhao Yan Anpei Chen Shenghua Gao Jingyi Yu http://arxiv.org/abs/2506.07209v2 HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance 2026-05-19T16:26:07Z

We present HOI-PAGE, a new approach that prioritizes part-level affordance reasoning to generate high-fidelity 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion. In contrast to prior works that focus on global, whole body-object motion synthesis, our approach explicitly reasons about the underlying part-level mechanics of interactions using large language models (LLMs). We capture this reasoning in a structured part affordance graph (PAG) representation, serving as a high-level interaction scaffolding to guide a three-stage synthesis: first, decomposing input 3D objects into semantic parts; then, generating reference HOI videos from text prompts to extract part-based motion constraints; and finally, optimizing for 4D HOI motion sequences that mimic the reference dynamics while satisfying part-level contact constraints. Extensive experiments show that our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences, with significantly improved realism and text alignment for zero-shot 4D HOI generation.

2025-06-08T16:15:39Z ICML 2026. Project page: https://craigleili.github.io/projects/hoipage/ Video: https://www.youtube.com/watch?v=gwXjOffCFyk Lei Li Angela Dai http://arxiv.org/abs/2605.19889v1 GLUT: 3D Gaussian Lookup Table for Continuous Color Transformation 2026-05-19T14:17:44Z

3D Lookup Tables (3D LUTs) are widely used for color mapping, but their grid-based representation requires discretizing the RGB space, leading to a capacity-memory trade-off that becomes prohibitive when storing large numbers of LUTs. Recent approaches adopt implicit neural representations to improve scalability, yet their black-box nature limits interpretability and hinders intuitive, localized editing. In this paper, we propose Gaussian LUT (GLUT), a continuous and explicit color representation that models color transformations using a set of learnable 3D Gaussian primitives. By avoiding fixed-resolution grids, GLUT achieves flexible representational capacity while maintaining a compact memory footprint. Its explicit, spatially localized formulation further enables both accurate modeling and interpretability. Building on this representation, we introduce a compact conditional generator (CGLUT) that predicts GLUT parameters for multiple LUT instances, encoding diverse color styles in a single framework to enable smooth and controllable LUT style blending. Moreover, GLUT supports efficient, user-friendly editing by allowing localized adjustments to specific color regions without global retraining. Experimental results demonstrate that our approach outperforms prior neural LUT representations in both accuracy and efficiency, while offering improved interpretability and interactive control.

2026-05-19T14:17:44Z Project page: https://color.cvc.uab.cat/glut/ Danna Xue David Serrano-Lozano Shaolin Su Javier Vazquez-Corral http://arxiv.org/abs/2404.07106v2 3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion 2026-05-19T13:59:07Z

Point cloud completion aims to generate a complete and high-fidelity point cloud from an initially incomplete and low-quality input. A prevalent strategy involves leveraging Transformer-based models to encode global features and facilitate the reconstruction process. However, the adoption of pooling operations to obtain global feature representations often results in the loss of local details within the point cloud. Moreover, the attention mechanism inherent in Transformers introduces additional computational complexity, rendering it challenging to handle long sequences effectively. To address these issues, we propose 3DMambaComplete, a point cloud completion network built on the novel Mamba framework. It comprises three modules: HyperPoint Generation encodes point cloud features using Mamba's selection mechanism and predicts a set of Hyperpoints. A specific offset is estimated, and the down-sampled points become HyperPoints. The HyperPoint Spread module disperses these HyperPoints across different spatial locations to avoid concentration. Finally, a deformation method transforms the 2D mesh representation of HyperPoints into a fine-grained 3D structure for point cloud reconstruction. Extensive experiments conducted on various established benchmarks demonstrate that 3DMambaComplete surpasses state-of-the-art point cloud completion methods, as confirmed by qualitative and quantitative analyses.

2024-04-10T15:45:03Z 24 pages, 14 figures, 10 tables Yixuan Li Weidong Yang Ben Fei http://arxiv.org/abs/2404.01063v2 Chat Modeling: Interaction-Enhanced Agent Framework for Visualizing Literature-Grounded Biological Structures 2026-05-19T12:38:34Z

Bioscientists frequently seek to visualize the biological systems they have empirically characterized and reported in the literature. Realizing such visualizations requires biological structure modeling, an inherently complex process that demands both biological and geometric understanding. This paper addresses the problem of constructing such 3D models for visualization. In this paper, we introduce a novel agent framework that mitigates the challenges of operating 3D modeling software by transforming user inputs, including natural language descriptions, research publication content, and textual descriptions of the existing objects and structures in the current scene, into modeling operations in a structured JSON format and final 3D results. The major technical contribution lies in the collaborative agent design that simultaneously supports model planning, execution, and novel user interaction design, such as interactive modeling execution and dynamic widget generation that fuse text and mouse interaction within the chat window. The framework further incorporates a customized modeling memory to enhance user interaction, featuring components such as personalized memory management, feedback collection, and skill library design. This modeling memory is leveraged to enable improved 3D modeling performance over time. The quantitative evaluation on our collected dataset showcases the effectiveness of our framework. We also develop a prototype tool, Chat Modeling, and demonstrate its usage through two modeling case studies. Our user study and expert interviews highlight the potential of our approach for use in scientific workflows.

2024-04-01T11:53:39Z Donggang Jia Yunhai Wang Ivan Viola http://arxiv.org/abs/2605.19737v1 Decentralized Direct Volume Rendering: A Browser-Native GPU Architecture for MRI Digital Twins in Resource-Constrained Settings 2026-05-19T12:09:18Z

Digital Twin (DT) technology holds immense potential for surgical planning and personalized medicine. However, generating interactive, patient-specific anatomical twins currently relies on computationally heavy Server-Side Rendering (SSR) or expensive local workstations, creating significant barriers to deployment, especially in resource-constrained settings (RCS). This paper presents a decentralized, client-side WebGPU architecture that democratizes access to high-fidelity anatomical Digital Twins. By bypassing standard server-side rendering pipelines, the framework executes deterministic single-pass raymarching and morphological gradient calculations directly on low-cost integrated edge GPUs. Eliminating the network latency inherent to cloud-rendered solutions, the system achieves a Time to First Pixel (TTFP) of under 920.0ms and maintains stable interactivity at >= 82.0 FPS. Continuous Interaction Fidelity is maintained via uniform buffers, enabling zero-latency manipulation of tissue parameters for dynamic clinical decision-making. By proving that complex 3D medical simulations of patient-specific MRI scan can be executed natively in the browser without deep learning or external computational dependencies, this architecture provides a scalable, affordable foundation for the widespread clinical adoption of healthcare Digital Twins.

2026-05-19T12:09:18Z 10 pages, 4 figures. Live interactive browser demo available at: https://webgpu-mri.vercel.app/ . Source code repository: https://github.com/Bahdmanbabzo/webgpu-mri Oserebameh Augustine Beckley http://arxiv.org/abs/2601.20308v2 Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion 2026-05-19T10:19:52Z

Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.

2026-01-28T06:59:55Z 12 pages, 9 figures Shuoyan Wei Feng Li Chen Zhou Runmin Cong Yao Zhao Huihui Bai http://arxiv.org/abs/2605.15497v2 AnyAct: Towards Human Reenactment of Character Motion From Video 2026-05-19T10:09:02Z

We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

2026-05-15T00:23:36Z 12 pages Liuhan Chen Lei Zhong Jiewei Wang Qin Shuai Li Yuan Leidong Fan Qing Li Kanglin Liu http://arxiv.org/abs/2412.13111v2 Motion-2-To-3: Leveraging 2D Motion Data for 3D Motion Generations 2026-05-19T09:35:40Z

Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation. Our code is publicly available at https://zju3dv.github.io/Motion-2-to-3/.

2024-12-17T17:34:52Z Project page: https://zju3dv.github.io/Motion-2-to-3/ 2025 IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 2025, pp. 14305-14316 Ruoxi Guo Huaijin Pi Zehong Shen Qing Shuai Zechen Hu Zhumei Wang Yajiao Dong Ruizhen Hu Taku Komura Sida Peng Xiaowei Zhou 10.1109/ICCV51701.2025.01327