https://arxiv.org/api/ppxDhDu8j5qf0ZoEPriLJTWTemQ 2026-06-14T05:16:18Z 9323 315 15 http://arxiv.org/abs/2604.07959v2 Seeing enough: non-reference perceptual resolution selection for power-efficient client-side rendering 2026-05-09T14:57:09Z

Many client-side applications, especially games, render video at high resolution and frame rate on power-constrained devices, even when users perceive little or no benefit from all those extra pixels. Existing perceptual video quality metrics can indicate when a lower resolution is "good enough", but they are full-reference and computationally expensive, making them impractical for real-world applications and deployment on-device. In this work, we leverage the spatio-temporal limits of the human visual system and propose a non-reference method that predicts, from the rendered video alone, the lowest resolution that remains perceptually indistinguishable from the best available option, enabling power-efficient client-side rendering. Our approach is codec-agnostic and requires only minimal modifications to existing infrastructure. The network is trained on a large dataset of rendered content labeled with a full-reference perceptual video quality metric. The prediction significantly enhances perceptual quality while substantially reducing computational costs, suggesting a practical path toward perception-guided, power-efficient client-side rendering.

2026-04-09T08:22:17Z Withdrawn to complete standard internal institutional regulatory clearance processes prior to publication. Yaru Liu Dayllon Vinícius Xavier Lemos Ali Bozorgian Chengxi Zeng Alexander Hepburn Arnau Raventos http://arxiv.org/abs/2605.10995v1 Streaming of rendered content with adaptive frame rate and resolution 2026-05-09T14:48:10Z

Streaming rendered content is an attractive way to bring high-quality graphics to billions of mobile devices that do not have sufficient rendering power. Existing solutions render content on a server at a fixed frame rate, typically 30 or 60 frames per second, and reduce resolution when bandwidth is restricted. However, this strategy leads to suboptimal rendering quality under the bandwidth constraints. In this work, we exploit the spatio-temporal limits of the human visual system to improve perceived quality while reducing rendering costs by adaptively adjusting both frame rate and resolution based on scene content and motion. Our approach is codec-agnostic and requires only minimal modifications to existing rendering infrastructure. We propose a system in which a lightweight neural network predicts the optimal combination of frame rate and resolution for a given transmission bandwidth, content, and motion velocity. This prediction significantly enhances perceptual quality while minimizing computational cost under bandwidth constraints. The network is trained on a large dataset of rendered content labeled with a perceptual video quality metric. The dataset and further information can be found at the project web page: https://www.cl.cam.ac.uk/research/rainbow/projects/adaptive_streaming/.

2026-05-09T14:48:10Z Yaru Liu Joseph G. March Rafal K. Mantiuk http://arxiv.org/abs/2604.23629v2 From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation 2026-05-09T12:26:33Z

Three-dimensional content generation has progressed from producing isolated, visually plausible shapes to constructing structured assets that can be deployed in real-time interactive environments. This trajectory is driven by converging demands from game development, embodied AI, world simulation, digital twins, and spatial computing, all of which require 3D content that goes beyond surface appearance to satisfy engine-level constraints on topology, UV parameterization, physically based materials, skeletal rigging, and physics-aware scene layout. Despite rapid advances in generative modeling, a persistent gap separates the outputs of current methods from the production-ready standard expected by interactive applications. This survey addresses that gap by organizing the literature around the asset production pipeline rather than algorithmic families. Along the horizontal axis we distinguish three asset tiers, namely general objects, characters, and scenes, while the vertical axis traces each tier through the full production lifecycle from data foundations and geometry synthesis through topology optimization, UV unwrapping, PBR appearance, rigging, and scene assembly. Through this two-dimensional taxonomy we assess not only what current methods can generate but whether their outputs are directly usable in downstream engines and simulation platforms. We further consolidate evaluation metrics and protocols that span geometric fidelity, appearance quality, asset usability, and scene-level physical plausibility. The survey concludes by identifying open challenges in data quality, generation controllability, end-to-end assetization, and physically grounded generation, and by situating production-ready 3D content as foundational infrastructure for emerging interactive world models and embodied intelligent systems.

2026-04-26T09:44:06Z Preprint. Jiafeng Wu and Zhuofan Lou contributed equally. Project page: https://christinebobby.github.io/production-ready-3d-survey/ Jiafeng Wu Zhuofan Lou Jian Liu Dazhao Du Chunchao Guo Song Guo http://arxiv.org/abs/2605.08824v1 HairGPT: Strand-as-Language Autoregressive Modeling for Realistic 3D Hairstyle Synthesis 2026-05-09T09:19:36Z

Hair is a rich medium of visual and cultural expression, yet its digital modeling remains challenging due to the duality of fluidity and structure. Many existing generative approaches rely primarily on continuous diffusion fields, which entangle global topology with local texture and obscure the semantic and structural organization of hairstyles. To address this, we propose HairGPT, a strand-centric framework that treats strands as generative primitives and formulates realistic 3D hairstyle synthesis as a dual-decoupled autoregressive sequence modeling problem. Our method applies spatial decoupling across semantic scalp regions and structural decoupling along a hierarchical strand representation, progressing from global layout to fine-grained style. We further introduce a geometric tokenizer and region-aware semantic annotations to guide strand-level generation, enabling compositional editing, synthesis of rare and complex hairstyles, and adaptation to stylized domains. By aligning generative modeling with the workflow of digital grooming, HairGPT turns hair generation from opaque texture synthesis into a structured and semantically controllable authoring process, supporting robust semantic conditioning and high-fidelity results across realistic and stylized domains. Project Page: https://haiminluo.github.io/hairgpt/

2026-05-09T09:19:36Z Accepted to SIGGRAPH 2026 (Journal Track) Haimin Luo Min Ouyang Lan Xu Jingyi Yu 10.1145/3811350 http://arxiv.org/abs/2605.08744v1 MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation 2026-05-09T07:11:26Z

Autoregressive (AR) models can generate high-quality low-poly meshes from point clouds, but they still operate in an all-or-nothing manner: when a local region is unsatisfactory, the entire mesh must be regenerated, wasting computation and destroying satisfactory mesh structure elsewhere. We introduce MeshFIM, a Fill-in-the-Middle (FIM) framework that regenerates a target region of a low-poly mesh conditioned on the surrounding context. MeshFIM addresses three mesh-specific challenges: enforcing exact attachment along the exposed boundary, preserving topological order in the context, and suppressing overflow beyond the intended region. It does so with five complementary design choices: boundary vertex markers, context positional embeddings, expanded context width, context augmentation, and a low-poly geometry encoder whose gated subtraction mechanism focuses generation on the missing region by leveraging the difference between the reference surface and the existing mesh. Detailed ablation studies are presented to show the effectiveness of every introduced component. Based on MeshFIM, we demonstrate two applications: interactive brush-based editing and automatic defect repair on low-poly mesh (see Figure 1). Last but not least, experiments show that MeshFIM outperforms a range of baselines in mesh refinement, mesh repair and whole mesh generation plus stitch-back scheme.

2026-05-09T07:11:26Z Dingdong Yang Jian Liu Biwen Lei Haohan Weng Zhuo Chen Song Guo Hao Richard Zhang Ali Mahdavi Amiri Chunchao Guo http://arxiv.org/abs/2605.08729v1 Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation 2026-05-09T06:32:54Z

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

2026-05-09T06:32:54Z Shihao Cheng Jiaxu Zhang Quanyue Song Shansong Liu Zhizhi Guo Xiaolei Zhang Chi Zhang Xuelong Li Zhigang Tu http://arxiv.org/abs/2605.16355v1 Generative 3D Gaussians with Learned Density Control 2026-05-08T17:54:25Z

We present Density-Sampled Gaussians (DeG), a novel 3D representation designed to bridge the gap between adaptive rendering primitives and scalable generative modeling. Unlike existing approaches that constrain 3D Gaussians to fixed voxel grids or arrays, DeG models Gaussian centers as samples from a learnable probability density function defined over an octree. This formulation provides a rigorous mathematical framework for adaptive density control: by jointly optimizing the spatial density and Gaussian attributes under rendering supervision, our model naturally concentrates primitives in regions of high geometric complexity. We achieve this via a new render loss contribution gradient that serves as a fully differentiable analogue to the discrete densification and pruning heuristics used in standard Gaussian Splatting. The resulting representation is highly flexible, supporting variable-resolution decoding from a single latent code by simply adjusting the sampling budget. To enable generative synthesis, we train a latent diffusion model on DeG. We identify a critical challenge in applying diffusion to unordered set-structured latents, which can significantly slow convergence, and propose VecSeq, a canonical re-indexing mechanism that anchors latent tokens to a deterministic 3D Sobol sequence. This transforms the ambiguous set-generation problem into a robust sequence modeling task. Extensive experiments demonstrate that our pipeline achieves state-of-the-art quality in single-image-to-3D generation, combining the structural adaptivity of unstructured primitives with the training stability of grid-based methods.

2026-05-08T17:54:25Z 19 pages, 16 figures, SIGGRAPH Conference Papers '26 Runjie Yan Yan-Pei Cao Peng Wang Ding Liang Yuan-Chen Guo http://arxiv.org/abs/2604.19568v2 SpUDD: Superpower Contouring of Unsigned Distance Data 2026-05-08T14:50:46Z

Unsigned distance functions offer a powerful and flexible implicit surface representation that, unlike their signed counterparts, allow for surfaces that are open, non-orientable, or non-manifold. We consider the problem of reconstructing arbitrary surfaces from a finite set of samples of unsigned distance data. Existing methods for mesh reconstruction from distance data rely on sign information, accurate gradients, a corresponding continuous distance function, or extensive data-dependent training. However, they fail when applied to input that is both discrete and unsigned. Inspired by this challenge, we study the power diagram generated by the distance samples and propose a novel theoretical concept, the superpower contour, which we prove converges to the true surface in the limit of sampling density. We use this superpower contour as an initial surface proxy and design an algorithm that leverages it to produce a polygonal mesh approximating the unknown true geometry. Our method vastly outperforms other conceivable strategies for the discrete unsigned distance reconstruction task, and sets the stage for future work on this mathematically rich problem.

2026-04-21T15:20:21Z Ningna Wang Xiana Carrera Christopher Batty Oded Stein Silvia Sellán http://arxiv.org/abs/2605.07450v1 LoBoFit: Flexible Garment Refitting via Local Bone Mapping Blending 2026-05-08T08:56:06Z

Garment refitting, the task of adapting a garment from a source to a target avatar, must preserve the original design features and fine-scale wrinkles, a challenge exacerbated by significant shape variations and varying poses without registration to a shared canonical pose. Existing methods struggle to balance robustness, efficiency, and fidelity of detail: physics-based simulation is costly, data-driven approaches lack generalizability, and geometry optimization in the full vertex space is often ill-conditioned and prone to local minima with unsatisfactory quality. We identify that a fundamental limitation lies in the representation: deforming garments directly in global coordinates couples vertices non-locally, creating a complex and poorly-structured optimization landscape. Therefore, we introduce LoBoFit, a robust refitting method built upon a novel Local Bone Mapping Blending (LoBoMap Blending) representation. Instead of manipulating global vertex positions, LoBoMap Blending expresses garment geometry as a linear blend of its mappings into local bone coordinate frames. This representation is highly expressive and flexible: local bone mappings yield a pose-robust initialization and a well-conditioned parameterization, while blending weights smooth the optimization landscape and broaden the space of plausible solutions for stable convergence with fine-scale detail preservation. The subsequent refinement efficiently resolves collisions and preserves details by optimizing localized residuals, effectively decomposing the complex global deformation into manageable subproblems. Our experiments demonstrate that LoBoFit reliably refits high-resolution, single- and multi-layer garments across avatars with large shape and topological differences, while faithfully preserving intricate wrinkles and the intended fit style, outperforming state-of-the-art methods in robustness and output quality.

2026-05-08T08:56:06Z 14 pages including references Meng Zhang Yu Xin Feiya Guo Kaizhang Kang Mengyu Chu Ruizhen Hu http://arxiv.org/abs/2605.07385v1 Velocity-Space 3D Asset Editing 2026-05-08T07:42:12Z

Editing a 3D asset locally, modifying a target region while preserving the rest, is a fundamental requirement of native 3D editing. Existing methods enforce locality through mechanisms external to the generator, such as manual 3D masks, post-hoc voxel merging, or 2D multi-view lifting. None of them intervene where the corruption actually originates: inside the ODE sampler. For a rectified-flow generator to achieve faithful local editing, its velocity field should be strong over the target editing region while vanishing on preserved content. Yet a single velocity field can hardly satisfy both requirements simultaneously, leading to three problems: (i) identity leakage that keeps the edit signal non-zero on preserved regions; (ii) no dedicated edit-amplification channel, so strengthening the edit inevitably perturbs identity; and (iii) an identity drag at the geometry and material stages, where a global condition pulls every token toward the target. We propose VS3D (Velocity-Space 3D Asset editing}), an inversion-free, training-free, and mask-free framework that addresses each problem with a targeted intervention inside the sampler. VS3D integrates three complementary modules, each corresponding to a specific stage of the editing pipeline. Reconstruction-Anchored Source Injection (RASI) absorbs identity leakage by turning the unconditional embedding into a per-step, asset-specific anchor calibrated through source reconstruction. Partial-Mean Guidance (PMG) amplifies the edit signal by contrasting high- and low-quality subsample estimates of the velocity difference, active only where a consistent edit exists. Twin-Agreement Residual injection (TAR) lets the sampler decide token by token what to preserve at the geometry and material stages.

2026-05-08T07:42:12Z Hao Liu Yuxuan Lin Jingfeng Guo Ruihang Chu Junjie Wang Ruotong Li Yujiu Yang http://arxiv.org/abs/2605.07254v1 High-Fidelity Surface Splatting-Based 3D Reconstruction from Multi-View Images 2026-05-08T05:23:28Z

Multi-view mesh reconstruction remains a core challenge in computer graphics and vision, especially for recovering high-frequency geometry from sparse observations. Recent methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) rely on post-processing for mesh extraction, thereby limiting joint optimization of geometry and appearance. Implicit Moving Least Squares (IMLS) instead enables direct conversion of point clouds into signed distance and texture fields, supporting end-to-end reconstruction and rendering. However, existing IMLS formulations use exponential kernels that struggle with high-frequency detail. We introduce a compact polynomial kernel with local support and greater flexibility, allowing better control over frequency content and improved geometric fidelity. To further enhance fine details, we incorporate stochastic regularization with Laplacian filtering. Together, these improve the preservation of high-frequency structure while maintaining stable optimization. Experiments show state-of-the-art performance in both surface reconstruction and rendering, yielding more accurate geometry and sharper visuals from multi-view data.

2026-05-08T05:23:28Z 19 pages, 9 figures Nandhana Sunil Abhirami R Iyer Avirup Mandal http://arxiv.org/abs/2605.07252v1 PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation 2026-05-08T05:20:03Z

Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt. Our project page with demo videos is available at https://danny-nus.github.io/PersonaGest/

2026-05-08T05:20:03Z 26 pages, 10 figures, 12 tables Junchuan Zhao Qifan Liang Ye Wang http://arxiv.org/abs/2605.06593v1 ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting 2026-05-07T17:20:15Z

Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We propose a bilevel optimization framework that jointly adapts reference motions to a robot's morphology while training a tracking policy using reinforcement learning. To make the optimization tractable, we derive an approximate gradient for the upper-level loss. Our framework requires only a sparse set of semantic rigid-body correspondences and eliminates the need for manual tuning by identifying optimal values for a parameterization expressive enough to preserve characteristic motion across different embodiments. Moreover, by integrating retargeting directly with physics simulation, we produce physically plausible motions that facilitate robust imitation learning. We validate our method in simulation and on hardware, demonstrating challenging motions for morphologies that differ significantly from a human, including retargeting onto a quadruped.

2026-05-07T17:20:15Z SIGGRAPH 2026 David Müller Agon Serifi Sammy Christen Ruben Grandia Espen Knoop Moritz Bächer 10.1145/3811378 http://arxiv.org/abs/2604.26799v2 MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching 2026-05-07T06:27:58Z

3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with entropy coding. On the configuration side, it treats the reserve ratio and bit-width allocation as the dominant rate-distortion knobs and jointly optimizes them under a target storage budget via discrete sampling and 0--1 integer linear programming. We further propose a linear size estimator and a CUDA parallel quantization operator to accelerate the hyperparameter searching process. Extensive experiments show that MesonGS++ achieves over 34$\times$ compression while preserving rendering fidelity, outperforming state-of-the-art post-training methods and accurately meeting target size budgets. Remarkably, without any training, MesonGS++ can even surpass the PSNR of vanilla 3DGS at a 20$\times$ compression rate on the Stump scene. Our code is available at https://github.com/mmlab-sigs/mesongs_plus

2026-04-29T15:30:06Z https://github.com/mmlab-sigs/mesongs_plus Shuzhao Xie Junchen Ge Weixiang Zhang Jiahang Liu Chen Tang Yunpeng Bai Shijia Ge Jingyan Jiang Yuzhi Huang Fengnian Yang Cong Zhang Xiaoyi Fan Zhi Wang http://arxiv.org/abs/2605.05711v1 Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling 2026-05-07T05:55:50Z

Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.

2026-05-07T05:55:50Z Anh H. Vo Sungyo Lee Phil-Joong Kim Soo-Mi Choi Yong-Guk Kim