https://arxiv.org/api/QLMuyl1/9VFB71PaZRXqZAqKnzM 2026-06-28T06:46:39Z 9390 1785 15 http://arxiv.org/abs/2508.04966v1 Laplacian Analysis Meets Dynamics Modelling: Gaussian Splatting for 4D Reconstruction 2025-08-07T01:39:29Z

While 3D Gaussian Splatting (3DGS) excels in static scene modeling, its extension to dynamic scenes introduces significant challenges. Existing dynamic 3DGS methods suffer from either over-smoothing due to low-rank decomposition or feature collision from high-dimensional grid sampling. This is because of the inherent spectral conflicts between preserving motion details and maintaining deformation consistency at different frequency. To address these challenges, we propose a novel dynamic 3DGS framework with hybrid explicit-implicit functions. Our approach contains three key innovations: a spectral-aware Laplacian encoding architecture which merges Hash encoding and Laplacian-based module for flexible frequency motion control, an enhanced Gaussian dynamics attribute that compensates for photometric distortions caused by geometric deformation, and an adaptive Gaussian split strategy guided by KDTree-based primitive control to efficiently query and optimize dynamic areas. Through extensive experiments, our method demonstrates state-of-the-art performance in reconstructing complex dynamic scenes, achieving better reconstruction fidelity.

2025-08-07T01:39:29Z Yifan Zhou Beizhen Zhao Pengcheng Wu Hao Wang http://arxiv.org/abs/2508.04965v1 Perceive-Sample-Compress: Towards Real-Time 3D Gaussian Splatting 2025-08-07T01:34:38Z

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated remarkable capabilities in real-time and photorealistic novel view synthesis. However, traditional 3DGS representations often struggle with large-scale scene management and efficient storage, particularly when dealing with complex environments or limited computational resources. To address these limitations, we introduce a novel perceive-sample-compress framework for 3D Gaussian Splatting. Specifically, we propose a scene perception compensation algorithm that intelligently refines Gaussian parameters at each level. This algorithm intelligently prioritizes visual importance for higher fidelity rendering in critical areas, while optimizing resource usage and improving overall visible quality. Furthermore, we propose a pyramid sampling representation to manage Gaussian primitives across hierarchical levels. Finally, to facilitate efficient storage of proposed hierarchical pyramid representations, we develop a Generalized Gaussian Mixed model compression algorithm to achieve significant compression ratios without sacrificing visual fidelity. The extensive experiments demonstrate that our method significantly improves memory efficiency and high visual quality while maintaining real-time rendering speed.

2025-08-07T01:34:38Z Zijian Wang Beizhen Zhao Hao Wang http://arxiv.org/abs/2508.04962v1 Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework 2025-08-07T01:20:41Z

Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios. However, existing methods rely on resource-intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra-class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-novel-class-one-click), HOW-Seg matches or surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.

2025-08-07T01:20:41Z To be published in IEEE Transactions on Circuits and Systems for Video Technology Peng Zhang Songru Yang Jinsheng Sun Weiqing Li Zhiyong Su 10.1109/TCSVT.2025.3596238 http://arxiv.org/abs/2508.04687v1 MienCap: Realtime Performance-Based Facial Animation with Live Mood Dynamics 2025-08-06T17:50:01Z

Our purpose is to improve performance-based animation which can drive believable 3D stylized characters that are truly perceptual. By combining traditional blendshape animation techniques with multiple machine learning models, we present both non-real time and real time solutions which drive character expressions in a geometrically consistent and perceptually valid way. For the non-real time system, we propose a 3D emotion transfer network makes use of a 2D human image to generate a stylized 3D rig parameters. For the real time system, we propose a blendshape adaption network which generates the character rig parameter motions with geometric consistency and temporally stability. We demonstrate the effectiveness of our system by comparing to a commercial product Faceware. Results reveal that ratings of the recognition, intensity, and attractiveness of expressions depicted for animated characters via our systems are statistically higher than Faceware. Our results may be implemented into the animation pipeline, and provide animators with a system for creating the expressions they wish to use more quickly and accurately.

2025-08-06T17:50:01Z IEEE VR extended authors version of the article published in 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). This work was supported by the European Union's Horizon 2020 research and innovation programme under Grant 101017779 Ye Pan Ruisi Zhang Jingying Wang Nengfu Chen Yilin Qiu Yu Ding Kenny Mitchell 10.1109/VRW55335.2022.00178 http://arxiv.org/abs/2508.04508v1 Surf3R: Rapid Surface Reconstruction from Sparse RGB Views in Seconds 2025-08-06T14:53:42Z

Current multi-view 3D reconstruction methods rely on accurate camera calibration and pose estimation, requiring complex and time-intensive pre-processing that hinders their practical deployment. To address this challenge, we introduce Surf3R, an end-to-end feedforward approach that reconstructs 3D surfaces from sparse views without estimating camera poses and completes an entire scene in under 10 seconds. Our method employs a multi-branch and multi-view decoding architecture in which multiple reference views jointly guide the reconstruction process. Through the proposed branch-wise processing, cross-view attention, and inter-branch fusion, the model effectively captures complementary geometric cues without requiring camera calibration. Moreover, we introduce a D-Normal regularizer based on an explicit 3D Gaussian representation for surface reconstruction. It couples surface normals with other geometric parameters to jointly optimize the 3D geometry, significantly improving 3D consistency and surface detail accuracy. Experimental results demonstrate that Surf3R achieves state-of-the-art performance on multiple surface reconstruction metrics on ScanNet++ and Replica datasets, exhibiting excellent generalization and efficiency.

2025-08-06T14:53:42Z Haodong Zhu Changbai Li Yangyang Ren Zichao Feng Xuhui Liu Hanlin Chen Xiantong Zhen Baochang Zhang http://arxiv.org/abs/2508.03445v2 Neighborhood-Preserving Voronoi Treemaps 2025-08-06T11:28:05Z

Voronoi treemaps are used to depict nodes and their hierarchical relationships simultaneously. However, in addition to the hierarchical structure, data attributes, such as co-occurring features or similarities, frequently exist. Examples include geographical attributes like shared borders between countries or contextualized semantic information such as embedding vectors derived from large language models. In this work, we introduce a Voronoi treemap algorithm that leverages data similarity to generate neighborhood-preserving treemaps. First, we extend the treemap layout pipeline to consider similarity during data preprocessing. We then use a Kuhn-Munkres matching of similarities to centroidal Voronoi tessellation (CVT) cells to create initial Voronoi diagrams with equal cell sizes for each level. Greedy swapping is used to improve the neighborhoods of cells to match the data's similarity further. During optimization, cell areas are iteratively adjusted to their respective sizes while preserving the existing neighborhoods. We demonstrate the practicality of our approach through multiple real-world examples drawn from infographics and linguistics. To quantitatively assess the resulting treemaps, we employ treemap metrics and measure neighborhood preservation.

2025-08-05T13:41:55Z Patrick Paetzold Rebecca Kehlbeck Yumeng Xue Bin Chen Yunhai Wang Oliver Deussen http://arxiv.org/abs/2507.23777v2 XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding 2025-08-06T09:51:03Z

Current auto-regressive models can generate high-quality, topologically precise meshes; however, they necessitate thousands-or even tens of thousands-of next-token predictions during inference, resulting in substantial latency. We introduce XSpecMesh, a quality-preserving acceleration method for auto-regressive mesh generation models. XSpecMesh employs a lightweight, multi-head speculative decoding scheme to predict multiple tokens in parallel within a single forward pass, thereby accelerating inference. We further propose a verification and resampling strategy: the backbone model verifies each predicted token and resamples any tokens that do not meet the quality criteria. In addition, we propose a distillation strategy that trains the lightweight decoding heads by distilling from the backbone model, encouraging their prediction distributions to align and improving the success rate of speculative predictions. Extensive experiments demonstrate that our method achieves a 1.7x speedup without sacrificing generation quality. Our code will be released.

2025-07-31T17:58:30Z Dian Chen Yansong Qu Xinyang Li Ming Li Shengchuan Zhang http://arxiv.org/abs/2412.03489v3 Stochastic Gradient Estimation for Higher-order Differentiable Rendering 2025-08-06T06:04:26Z

We derive methods to compute higher order differentials (Hessians and Hessian-vector products) of the rendering operator. Our approach is based on importance sampling of a convolution that represents the differentials of rendering parameters and shows to be applicable to both rasterization and path tracing. We further suggest an aggregate sampling strategy to importance-sample multiple dimensions of one convolution kernel simultaneously. We demonstrate that this information improves convergence when used in higher-order optimizers such as Newton or Conjugate Gradient relative to a gradient descent baseline in several inverse rendering tasks.

2024-12-04T17:26:21Z Zican Wang Michael Fischer Tobias Ritschel http://arxiv.org/abs/2508.04078v1 RLGS: Reinforcement Learning-Based Adaptive Hyperparameter Tuning for Gaussian Splatting 2025-08-06T04:37:39Z

Hyperparameter tuning in 3D Gaussian Splatting (3DGS) is a labor-intensive and expert-driven process, often resulting in inconsistent reconstructions and suboptimal results. We propose RLGS, a plug-and-play reinforcement learning framework for adaptive hyperparameter tuning in 3DGS through lightweight policy modules, dynamically adjusting critical hyperparameters such as learning rates and densification thresholds. The framework is model-agnostic and seamlessly integrates into existing 3DGS pipelines without architectural modifications. We demonstrate its generalization ability across multiple state-of-the-art 3DGS variants, including Taming-3DGS and 3DGS-MCMC, and validate its robustness across diverse datasets. RLGS consistently enhances rendering quality. For example, it improves Taming-3DGS by 0.7dB PSNR on the Tanks and Temple (TNT) dataset, under a fixed Gaussian budget, and continues to yield gains even when baseline performance saturates. Our results suggest that RLGS provides an effective and general solution for automating hyperparameter tuning in 3DGS training, bridging a gap in applying reinforcement learning to 3DGS.

2025-08-06T04:37:39Z 14 pages, 9 figures Zhan Li Huangying Zhan Changyang Li Qingan Yan Yi Xu http://arxiv.org/abs/2508.04732v1 LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation 2025-08-05T20:53:43Z

Text-to-Image (T2I) generation has made significant advancements with diffusion models, yet challenges persist in handling complex instructions, ensuring fine-grained content control, and maintaining deep semantic consistency. Existing T2I models often struggle with tasks like accurate text rendering, precise pose generation, or intricate compositional coherence. Concurrently, Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following. We propose LumiGen, a novel LVLM-enhanced iterative framework designed to elevate T2I model performance, particularly in areas requiring fine-grained control, through a closed-loop, LVLM-driven feedback mechanism. LumiGen comprises an Intelligent Prompt Parsing & Augmentation (IPPA) module for proactive prompt enhancement and an Iterative Visual Feedback & Refinement (IVFR) module, which acts as a "visual critic" to iteratively correct and optimize generated images. Evaluated on the challenging LongBench-T2I Benchmark, LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines. Notably, our framework demonstrates significant improvements in critical dimensions such as text rendering and pose expression, validating the effectiveness of LVLM integration for more controllable and higher-quality image generation.

2025-08-05T20:53:43Z Xiaoqi Dong Xiangyu Zhou Nicholas Evans Yujia Lin http://arxiv.org/abs/2505.10576v2 Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View 2025-08-05T19:39:13Z

High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ a single-view mesh-rendered image prior to enhancing gesture generation quality. However, the spatial complexity of hand gestures and the inherent limitations of single-view rendering make it difficult to capture complete gesture information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance in both quantitative metrics and qualitative evaluations. The source code is available at https://github.com/fuqifan/MUFEN.

2025-05-14T00:15:29Z This nine pages paper has been accepted for publication in Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM 2025). This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI https://doi.org/10.1145/3746027.3755828 Qifan Fu Xu Chen Muhammad Asad Shanxin Yuan Changjae Oh Gregory Slabaugh 10.1145/3746027.3755828 http://arxiv.org/abs/2411.17044v2 4D Scaffold Gaussian Splatting with Dynamic-Aware Anchor Growing for Efficient and High-Fidelity Dynamic Scene Reconstruction 2025-08-05T13:18:37Z

Modeling dynamic scenes through 4D Gaussians offers high visual fidelity and fast rendering speeds, but comes with significant storage overhead. Recent approaches mitigate this cost by aggressively reducing the number of Gaussians. However, this inevitably removes Gaussians essential for high-quality rendering, leading to severe degradation in dynamic regions. In this paper, we introduce a novel 4D anchor-based framework that tackles the storage cost in different perspective. Rather than reducing the number of Gaussians, our method retains a sufficient quantity to accurately model dynamic contents, while compressing them into compact, grid-aligned 4D anchor features. Each anchor is processed by an MLP to spawn a set of neural 4D Gaussians, which represent a local spatiotemporal region. We design these neural 4D Gaussians to capture temporal changes with minimal parameters, making them well-suited for the MLP-based spawning. Moreover, we introduce a dynamic-aware anchor growing strategy to effectively assign additional anchors to under-reconstructed dynamic regions. Our method adjusts the accumulated gradients with Gaussians' temporal coverage, significantly improving reconstruction quality in dynamic regions. Experimental results highlight that our method achieves state-of-the-art visual quality in dynamic regions, outperforming all baselines by a large margin with practical storage costs.

2024-11-26T02:22:07Z Woong Oh Cho In Cho Seoha Kim Jeongmin Bae Youngjung Uh Seon Joo Kim http://arxiv.org/abs/2508.03415v1 Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN 2025-08-05T12:59:37Z

This paper presents Fd-CycleGAN, an image-to-image (I2I) translation framework that enhances latent representation learning to approximate real data distributions. Building upon the foundation of CycleGAN, our approach integrates Local Neighborhood Encoding (LNE) and frequency-aware supervision to capture fine-grained local pixel semantics while preserving structural coherence from the source domain. We employ distribution-based loss metrics, including KL/JS divergence and log-based similarity measures, to explicitly quantify the alignment between real and generated image distributions in both spatial and frequency domains. To validate the efficacy of Fd-CycleGAN, we conduct experiments on diverse datasets -- Horse2Zebra, Monet2Photo, and a synthetically augmented Strike-off dataset. Compared to baseline CycleGAN and other state-of-the-art methods, our approach demonstrates superior perceptual quality, faster convergence, and improved mode diversity, particularly in low-data regimes. By effectively capturing local and global distribution characteristics, Fd-CycleGAN achieves more visually coherent and semantically consistent translations. Our results suggest that frequency-guided latent learning significantly improves generalization in image translation tasks, with promising applications in document restoration, artistic style transfer, and medical image synthesis. We also provide comparative insights with diffusion-based generative models, highlighting the advantages of our lightweight adversarial approach in terms of training efficiency and qualitative output.

2025-08-05T12:59:37Z This paper is currently under review for publication in an IEEE Transactions. If accepted, the copyright will be transferred to IEEE Shivangi Nigam Adarsh Prasad Behera Shekhar Verma P. Nagabhushan http://arxiv.org/abs/2508.03179v1 Advancing Precision in Multi-Point Cloud Fusion Environments 2025-08-05T07:43:52Z

This research focuses on visual industrial inspection by evaluating point clouds and multi-point cloud matching methods. We also introduce a synthetic dataset for quantitative evaluation of registration method and various distance metrics for point cloud comparison. Additionally, we present a novel CloudCompare plugin for merging multiple point clouds and visualizing surface defects, enhancing the accuracy and efficiency of automated inspection systems.

2025-08-05T07:43:52Z Accpeted for publication in Communications in Computer and Information Science, Springer Ulugbek Alibekov Vanessa Staderini Philipp Schneider Doris Antensteiner http://arxiv.org/abs/2508.01242v2 MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh 2025-08-05T05:55:00Z

We present MeshLLM, a novel framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes. Our approach addresses key limitations in existing methods, including the limited dataset scale when catering to LLMs' token length and the loss of 3D structural information during mesh serialization. We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits. This enables the creation of a large-scale dataset with 1500k+ samples, almost 50 times larger than previous methods, which aligns better with the LLM scaling law principles. Furthermore, we propose inferring face connectivity from vertices and local mesh assembly training strategies, significantly enhancing the LLMs' ability to capture mesh topology and spatial structures. Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding, highlighting its great potential in processing text-serialized 3D meshes.

2025-08-02T07:37:37Z Accepted by ICCV. Project Website: https://sk-fun.fun/MeshLLM Shuangkang Fang I-Chao Shen Yufeng Wang Yi-Hsuan Tsai Yi Yang Shuchang Zhou Wenrui Ding Takeo Igarashi Ming-Hsuan Yang