https://arxiv.org/api/AUpaWX78az7dr7SJM1d5vaY8HAg 2026-06-25T13:24:00Z 9383 1245 15 http://arxiv.org/abs/2412.08484v4 MeshCone: Second-Order Cone Programming for Geometrically-Constrained Mesh Enhancement 2025-11-26T15:05:52Z

Modern mesh generation pipelines whether learning-based or classical often produce outputs requiring post-processing to achieve production-quality geometry. This work introduces MeshCone, a convex optimization framework for guided mesh refinement that leverages reference geometry to correct deformed or degraded meshes. We formulate the problem as a second-order cone program where vertex positions are optimized to align with target geometry while enforcing smoothness through convex edge-length regularization. MeshCone performs geometry-aware optimization that preserves fine details while correcting structural defects. We demonstrate robust performance across 56 diverse object categories from ShapeNet and ThreeDScans, achieving superior refinement quality compared to Laplacian smoothing and unoptimized baselines while maintaining sub-second inference times. MeshCone is particularly suited for applications where reference geometry is available, such as mesh-from-template workflows, scan-to-CAD alignment, and quality assurance in asset production pipelines.

2024-12-11T15:48:25Z Alexander Valverde http://arxiv.org/abs/2511.21459v1 Resolution Where It Counts: Hash-based GPU-Accelerated 3D Reconstruction via Variance-Adaptive Voxel Grids 2025-11-26T14:50:24Z

Efficient and scalable 3D surface reconstruction from range data remains a core challenge in computer graphics and vision, particularly in real-time and resource-constrained scenarios. Traditional volumetric methods based on fixed-resolution voxel grids or hierarchical structures like octrees often suffer from memory inefficiency, computational overhead, and a lack of GPU support. We propose a novel variance-adaptive, multi-resolution voxel grid that dynamically adjusts voxel size based on the local variance of signed distance field (SDF) observations. Unlike prior multi-resolution approaches that rely on recursive octree structures, our method leverages a flat spatial hash table to store all voxel blocks, supporting constant-time access and full GPU parallelism. This design enables high memory efficiency and real-time scalability. We further demonstrate how our representation supports GPU-accelerated rendering through a parallel quad-tree structure for Gaussian Splatting, enabling effective control over splat density. Our open-source CUDA/C++ implementation achieves up to 13x speedup and 4x lower memory usage compared to fixed-resolution baselines, while maintaining on par results in terms of reconstruction accuracy, offering a practical and extensible solution for high-performance 3D reconstruction.

2025-11-26T14:50:24Z Accepted for publication in ACM Transaction on Graphics. Project site: https://rvp-group.github.io/mrhash/ Lorenzo De Rebotti Emanuele Giacomini Giorgio Grisetti Luca Di Giammarino 10.1145/3777909 http://arxiv.org/abs/2505.14306v2 A Remeshing Method via Adaptive Multiple Original-Facet-Clipping and Centroidal Voronoi Tessellation 2025-11-26T08:03:47Z

CVT (Centroidal Voronoi Tessellation)-based remeshing optimizes mesh quality by leveraging the Voronoi-Delaunay framework to optimize vertex distribution and produce uniformly distributed vertices with regular triangles. Current CVT-based approaches can be classified into two categories: (1) exact methods (e.g., Geodesic CVT, Restricted Voronoi Diagrams) that ensure high quality but require significant computation; and (2) approximate methods that try to reduce computational complexity yet result in fair quality. To address this trade-off, we propose a CVT-based surface remeshing approach that achieves balanced optimization between quality and efficiency through multiple clipping times of 3D Centroidal Voronoi cells with curvature-adaptive original surface facets. The core idea of the method is that we adaptively adjust the number of clipping times according to local curvature, and use the angular relationship between the normal vectors of neighboring facets to represent the magnitude of local curvature. Experimental results demonstrate the effectiveness of our method.

2025-05-20T12:55:18Z Yue Fei Jingjing Liu Yuyou Yao Yusheng Peng Liping Zheng http://arxiv.org/abs/2511.21129v1 CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion 2025-11-26T07:27:11Z

We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

2025-11-26T07:27:11Z 27 pages, 18 figures, 9 tables. Project page: https://tele-ai.github.io/CtrlVDiff/ Dianbing Xi Jiepeng Wang Yuanzhi Liang Xi Qiu Jialun Liu Hao Pan Yuchi Huo Rui Wang Haibin Huang Chi Zhang Xuelong Li http://arxiv.org/abs/2511.21098v1 Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction 2025-11-26T06:34:58Z

Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.

2025-11-26T06:34:58Z Gayoung Lee Junho Kim Jin-Hwa Kim Junmo Kim http://arxiv.org/abs/2511.20640v1 MotionV2V: Editing Motion in a Video 2025-11-25T18:57:25Z

While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V

2025-11-25T18:57:25Z Ryan Burgert Charles Herrmann Forrester Cole Michael S Ryoo Neal Wadhwa Andrey Voynov Nataniel Ruiz http://arxiv.org/abs/2511.20422v1 VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning 2025-11-25T15:48:49Z

Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.

2025-11-25T15:48:49Z Bo Pang Chenxi Xu Jierui Ren Guoping Wang Sheng Li http://arxiv.org/abs/2511.16249v2 Controllable Layer Decomposition for Reversible Multi-Layer Image Generation 2025-11-25T13:01:28Z

This work presents Controllable Layer Decomposition (CLD), a method for achieving fine-grained and controllable multi-layer separation of raster images. In practical workflows, designers typically generate and edit each RGBA layer independently before compositing them into a final raster image. However, this process is irreversible: once composited, layer-level editing is no longer possible. Existing methods commonly rely on image matting and inpainting, but remain limited in controllability and segmentation precision. To address these challenges, we propose two key modules: LayerDecompose-DiT (LD-DiT), which decouples image elements into distinct layers and enables fine-grained control; and Multi-Layer Conditional Adapter (MLCA), which injects target image information into multi-layer tokens to achieve precise conditional generation. To enable a comprehensive evaluation, we build a new benchmark and introduce tailored evaluation metrics. Experimental results show that CLD consistently outperforms existing methods in both decomposition quality and controllability. Furthermore, the separated layers produced by CLD can be directly manipulated in commonly used design tools such as PowerPoint, highlighting its practical value and applicability in real-world creative workflows. Our project is available at https://monkek123king.github.io/CLD_page/.

2025-11-20T11:27:21Z 19 pages, 14 figures Zihao Liu Zunnan Xu Shi Shu Jun Zhou Ruicheng Zhang Zhenchao Tang Xiu Li http://arxiv.org/abs/2505.19976v2 MAMM: Motion Control via Metric-Aligning Motion Matching 2025-11-25T11:51:24Z

We introduce a novel method for controlling a motion sequence using an arbitrary temporal control sequence using temporal alignment. Temporal alignment of motion has gained significant attention owing to its applications in motion control and retargeting. Traditional methods rely on either learned or hand-craft cross-domain mappings between frames in the original and control domains, which often require large, paired, or annotated datasets and time-consuming training. Our approach, named Metric-Aligning Motion Matching, achieves alignment by solely considering within-domain distances. It computes distances among patches in each domain and seeks a matching that optimally aligns the two within-domain distances. This framework allows for the alignment of a motion sequence to various types of control sequences, including sketches, labels, audio, and another motion sequence, all without the need for manually defined mappings or training with annotated data. We demonstrate the effectiveness of our approach through applications in efficient motion control, showcasing its potential in practical scenarios.

2025-05-26T13:36:27Z 12 pages, SIGGRAPH 2025 (Conference Track) Project Page: https://ataga101.github.io/mamm-project-page/ Naoki Agata Takeo Igarashi 10.1145/3721238.3730665 http://arxiv.org/abs/2508.17811v2 MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting 2025-11-25T08:48:19Z

Surface reconstruction has been widely studied in computer vision and graphics. However, existing surface reconstruction works struggle to recover accurate scene geometry when the input views are extremely sparse. To address this issue, we propose MeshSplat, a generalizable sparse-view surface reconstruction framework via Gaussian Splatting. Our key idea is to leverage 2DGS as a bridge, which connects novel view synthesis to learned geometric priors and then transfers these priors to achieve surface reconstruction. Specifically, we incorporate a feed-forward network to predict per-view pixel-aligned 2DGS, which enables the network to synthesize novel view images and thus eliminates the need for direct 3D ground-truth supervision. To improve the accuracy of 2DGS position and orientation prediction, we propose a Weighted Chamfer Distance Loss to regularize the depth maps, especially in overlapping areas of input views, and also a normal prediction network to align the orientation of 2DGS with normal vectors predicted by a monocular normal estimator. Extensive experiments validate the effectiveness of our proposed improvement, demonstrating that our method achieves state-of-the-art performance in generalizable sparse-view mesh reconstruction tasks. Project Page: https://hanzhichang.github.io/meshsplat_web

2025-08-25T09:04:20Z Accepted by AAAI 2026 Hanzhi Chang Ruijie Zhu Wenjie Chang Mulin Yu Yanzhe Liang Jiahao Lu Zhuoyuan Li Tianzhu Zhang http://arxiv.org/abs/2412.05700v2 Temporally Compressed 3D Gaussian Splatting for Dynamic Scenes 2025-11-25T08:18:56Z

Recent advancements in high-fidelity dynamic scene reconstruction have leveraged dynamic 3D Gaussians and 4D Gaussian Splatting for realistic scene representation. However, to make these methods viable for real-time applications such as AR/VR, gaming, and rendering on low-power devices, substantial reductions in memory usage and improvements in rendering efficiency are required. While many state-of-the-art methods prioritize lightweight implementations, they struggle in handling {scenes with complex motions or long sequences}. In this work, we introduce Temporally Compressed 3D Gaussian Splatting (TC3DGS), a novel technique designed specifically to effectively compress dynamic 3D Gaussian representations. TC3DGS selectively prunes Gaussians based on their temporal relevance and employs gradient-aware mixed-precision quantization to dynamically compress Gaussian parameters. In addition, TC3DGS exploits an adapted version of the Ramer-Douglas-Peucker algorithm to further reduce storage by interpolating Gaussian trajectories across frames. Our experiments on multiple datasets demonstrate that TC3DGS achieves up to 67$\times$ compression with minimal or no degradation in visual quality. More results and videos are provided in the supplementary. Project Page: https://ahmad-jarrar.github.io/tc-3dgs/

2024-12-07T17:03:09Z Accepted at British Machine Vision Conference (BMVC) 2025 Saqib Javed Ahmad Jarrar Khan Corentin Dumery Chen Zhao Mathieu Salzmann http://arxiv.org/abs/2412.05718v3 RLZero: Direct Policy Inference from Language Without In-Domain Supervision 2025-11-25T07:32:45Z

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

2024-12-07T18:31:16Z NeurIPS 2025, 26 pages Harshit Sikchi Siddhant Agarwal Pranaya Jajoo Samyak Parajuli Caleb Chuck Max Rudolph Peter Stone Amy Zhang Scott Niekum http://arxiv.org/abs/2505.23738v2 How Animals Dance (When You're Not Looking) 2025-11-25T06:13:27Z

We present a framework for generating music-synchronized, choreography aware animal dance videos. Our framework introduces choreography patterns -- structured sequences of motion beats that define the long-range structure of a dance -- as a novel high-level control signal for dance video generation. These patterns can be automatically estimated from human dance videos. Starting from a few keyframes representing distinct animal poses, generated via text-to-image prompting or GPT-4o, we formulate dance synthesis as a graph optimization problem that seeks the optimal keyframe structure to satisfy a specified choreography pattern of beats. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 seconds dance videos across a wide range of animals and music tracks.

2025-05-29T17:58:02Z Project page: https://how-animals-dance.github.io/ Xiaojuan Wang Aleksander Holynski Brian Curless Ira Kemelmacher Steve Seitz http://arxiv.org/abs/2511.19850v1 DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction 2025-11-25T02:28:53Z

Automatic extraction of road networks from aerial imagery is a fundamental task, yet prevailing methods rely on polylines that struggle to model curvilinear geometry. We maintain that road geometry is inherently curve-based and introduce the Bézier Graph, a differentiable parametric curve-based representation. The primary obstacle to this representation is to obtain the difficult-to-construct vector ground-truth (GT). We sidestep this bottleneck by reframing the task as a global optimization problem over the Bézier Graph. Our framework, DOGE, operationalizes this paradigm by learning a parametric Bézier Graph directly from segmentation masks, eliminating the need for curve GT. DOGE holistically optimizes the graph by alternating between two complementary modules: DiffAlign continuously optimizes geometry via differentiable rendering, while TopoAdapt uses discrete operators to refine its topology. Our method sets a new state-of-the-art on the large-scale SpaceNet and CityScale benchmarks, presenting a new paradigm for generating high-fidelity vector maps of road networks. We will release our code and related data.

2025-11-25T02:28:53Z 11 pages, 6 figures Jiahui Sun Junran Lu Jinhui Yin Yishuo Xu Yuanqi Li Yanwen Guo http://arxiv.org/abs/2511.15586v3 MHR: Momentum Human Rig 2025-11-24T19:02:10Z

We present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. Our model enables expressive, anatomically plausible human animation, supporting non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines.

2025-11-19T16:18:02Z Aaron Ferguson Ahmed A. A. Osman Berta Bescos Carsten Stoll Chris Twigg Christoph Lassner David Otte Eric Vignola Fabian Prada Federica Bogo Igor Santesteban Javier Romero Jenna Zarate Jeongseok Lee Jinhyung Park Jinlong Yang John Doublestein Kishore Venkateshan Kris Kitani Ladislav Kavan Marco Dal Farra Matthew Hu Matthew Cioffi Michael Fabris Michael Ranieri Mohammad Modarres Petr Kadlecek Rawal Khirodkar Rinat Abdrashitov Romain Prévost Roman Rajbhandari Ronald Mallet Russell Pearsall Sandy Kao Sanjeev Kumar Scott Parrish Shoou-I Yu Shunsuke Saito Takaaki Shiratori Te-Li Wang Tony Tung Yichen Xu Yuan Dong Yuhua Chen Yuanlu Xu Yuting Ye Zhongshi Jiang