https://arxiv.org/api/AUpaWX78az7dr7SJM1d5vaY8HAg2026-06-25T13:24:00Z9383124515http://arxiv.org/abs/2412.08484v4MeshCone: Second-Order Cone Programming for Geometrically-Constrained Mesh Enhancement2025-11-26T15:05:52ZModern mesh generation pipelines whether learning-based or classical often produce outputs requiring post-processing to achieve production-quality geometry. This work introduces MeshCone, a convex optimization framework for guided mesh refinement that leverages reference geometry to correct deformed or degraded meshes. We formulate the problem as a second-order cone program where vertex positions are optimized to align with target geometry while enforcing smoothness through convex edge-length regularization. MeshCone performs geometry-aware optimization that preserves fine details while correcting structural defects. We demonstrate robust performance across 56 diverse object categories from ShapeNet and ThreeDScans, achieving superior refinement quality compared to Laplacian smoothing and unoptimized baselines while maintaining sub-second inference times. MeshCone is particularly suited for applications where reference geometry is available, such as mesh-from-template workflows, scan-to-CAD alignment, and quality assurance in asset production pipelines.2024-12-11T15:48:25ZAlexander Valverdehttp://arxiv.org/abs/2511.21459v1Resolution Where It Counts: Hash-based GPU-Accelerated 3D Reconstruction via Variance-Adaptive Voxel Grids2025-11-26T14:50:24ZEfficient and scalable 3D surface reconstruction from range data remains a core challenge in computer graphics and vision, particularly in real-time and resource-constrained scenarios. Traditional volumetric methods based on fixed-resolution voxel grids or hierarchical structures like octrees often suffer from memory inefficiency, computational overhead, and a lack of GPU support. We propose a novel variance-adaptive, multi-resolution voxel grid that dynamically adjusts voxel size based on the local variance of signed distance field (SDF) observations. Unlike prior multi-resolution approaches that rely on recursive octree structures, our method leverages a flat spatial hash table to store all voxel blocks, supporting constant-time access and full GPU parallelism. This design enables high memory efficiency and real-time scalability. We further demonstrate how our representation supports GPU-accelerated rendering through a parallel quad-tree structure for Gaussian Splatting, enabling effective control over splat density. Our open-source CUDA/C++ implementation achieves up to 13x speedup and 4x lower memory usage compared to fixed-resolution baselines, while maintaining on par results in terms of reconstruction accuracy, offering a practical and extensible solution for high-performance 3D reconstruction.2025-11-26T14:50:24ZAccepted for publication in ACM Transaction on Graphics. Project site: https://rvp-group.github.io/mrhash/Lorenzo De RebottiEmanuele GiacominiGiorgio GrisettiLuca Di Giammarino10.1145/3777909http://arxiv.org/abs/2505.14306v2A Remeshing Method via Adaptive Multiple Original-Facet-Clipping and Centroidal Voronoi Tessellation2025-11-26T08:03:47ZCVT (Centroidal Voronoi Tessellation)-based remeshing optimizes mesh quality by leveraging the Voronoi-Delaunay framework to optimize vertex distribution and produce uniformly distributed vertices with regular triangles. Current CVT-based approaches can be classified into two categories: (1) exact methods (e.g., Geodesic CVT, Restricted Voronoi Diagrams) that ensure high quality but require significant computation; and (2) approximate methods that try to reduce computational complexity yet result in fair quality. To address this trade-off, we propose a CVT-based surface remeshing approach that achieves balanced optimization between quality and efficiency through multiple clipping times of 3D Centroidal Voronoi cells with curvature-adaptive original surface facets. The core idea of the method is that we adaptively adjust the number of clipping times according to local curvature, and use the angular relationship between the normal vectors of neighboring facets to represent the magnitude of local curvature. Experimental results demonstrate the effectiveness of our method.2025-05-20T12:55:18ZYue FeiJingjing LiuYuyou YaoYusheng PengLiping Zhenghttp://arxiv.org/abs/2511.21129v1CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion2025-11-26T07:27:11ZWe tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation.
However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations.
We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.2025-11-26T07:27:11Z27 pages, 18 figures, 9 tables. Project page: https://tele-ai.github.io/CtrlVDiff/Dianbing XiJiepeng WangYuanzhi LiangXi QiuJialun LiuHao PanYuchi HuoRui WangHaibin HuangChi ZhangXuelong Lihttp://arxiv.org/abs/2511.21098v1Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction2025-11-26T06:34:58ZUnderstanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.2025-11-26T06:34:58ZGayoung LeeJunho KimJin-Hwa KimJunmo Kimhttp://arxiv.org/abs/2511.20640v1MotionV2V: Editing Motion in a Video2025-11-25T18:57:25ZWhile generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V2025-11-25T18:57:25ZRyan BurgertCharles HerrmannForrester ColeMichael S RyooNeal WadhwaAndrey VoynovNataniel Ruizhttp://arxiv.org/abs/2511.20422v1VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning2025-11-25T15:48:49ZUnderstanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.2025-11-25T15:48:49ZBo PangChenxi XuJierui RenGuoping WangSheng Lihttp://arxiv.org/abs/2511.16249v2Controllable Layer Decomposition for Reversible Multi-Layer Image Generation2025-11-25T13:01:28ZThis work presents Controllable Layer Decomposition (CLD), a method for achieving fine-grained and controllable multi-layer separation of raster images. In practical workflows, designers typically generate and edit each RGBA layer independently before compositing them into a final raster image. However, this process is irreversible: once composited, layer-level editing is no longer possible. Existing methods commonly rely on image matting and inpainting, but remain limited in controllability and segmentation precision. To address these challenges, we propose two key modules: LayerDecompose-DiT (LD-DiT), which decouples image elements into distinct layers and enables fine-grained control; and Multi-Layer Conditional Adapter (MLCA), which injects target image information into multi-layer tokens to achieve precise conditional generation. To enable a comprehensive evaluation, we build a new benchmark and introduce tailored evaluation metrics. Experimental results show that CLD consistently outperforms existing methods in both decomposition quality and controllability. Furthermore, the separated layers produced by CLD can be directly manipulated in commonly used design tools such as PowerPoint, highlighting its practical value and applicability in real-world creative workflows. Our project is available at https://monkek123king.github.io/CLD_page/.2025-11-20T11:27:21Z19 pages, 14 figuresZihao LiuZunnan XuShi ShuJun ZhouRuicheng ZhangZhenchao TangXiu Lihttp://arxiv.org/abs/2505.19976v2MAMM: Motion Control via Metric-Aligning Motion Matching2025-11-25T11:51:24ZWe introduce a novel method for controlling a motion sequence using an arbitrary temporal control sequence using temporal alignment. Temporal alignment of motion has gained significant attention owing to its applications in motion control and retargeting. Traditional methods rely on either learned or hand-craft cross-domain mappings between frames in the original and control domains, which often require large, paired, or annotated datasets and time-consuming training. Our approach, named Metric-Aligning Motion Matching, achieves alignment by solely considering within-domain distances. It computes distances among patches in each domain and seeks a matching that optimally aligns the two within-domain distances. This framework allows for the alignment of a motion sequence to various types of control sequences, including sketches, labels, audio, and another motion sequence, all without the need for manually defined mappings or training with annotated data. We demonstrate the effectiveness of our approach through applications in efficient motion control, showcasing its potential in practical scenarios.2025-05-26T13:36:27Z12 pages, SIGGRAPH 2025 (Conference Track) Project Page: https://ataga101.github.io/mamm-project-page/Naoki AgataTakeo Igarashi10.1145/3721238.3730665http://arxiv.org/abs/2508.17811v2MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting2025-11-25T08:48:19ZSurface reconstruction has been widely studied in computer vision and graphics. However, existing surface reconstruction works struggle to recover accurate scene geometry when the input views are extremely sparse. To address this issue, we propose MeshSplat, a generalizable sparse-view surface reconstruction framework via Gaussian Splatting. Our key idea is to leverage 2DGS as a bridge, which connects novel view synthesis to learned geometric priors and then transfers these priors to achieve surface reconstruction. Specifically, we incorporate a feed-forward network to predict per-view pixel-aligned 2DGS, which enables the network to synthesize novel view images and thus eliminates the need for direct 3D ground-truth supervision. To improve the accuracy of 2DGS position and orientation prediction, we propose a Weighted Chamfer Distance Loss to regularize the depth maps, especially in overlapping areas of input views, and also a normal prediction network to align the orientation of 2DGS with normal vectors predicted by a monocular normal estimator. Extensive experiments validate the effectiveness of our proposed improvement, demonstrating that our method achieves state-of-the-art performance in generalizable sparse-view mesh reconstruction tasks. Project Page: https://hanzhichang.github.io/meshsplat_web2025-08-25T09:04:20ZAccepted by AAAI 2026Hanzhi ChangRuijie ZhuWenjie ChangMulin YuYanzhe LiangJiahao LuZhuoyuan LiTianzhu Zhanghttp://arxiv.org/abs/2412.05700v2Temporally Compressed 3D Gaussian Splatting for Dynamic Scenes2025-11-25T08:18:56ZRecent advancements in high-fidelity dynamic scene reconstruction have leveraged dynamic 3D Gaussians and 4D Gaussian Splatting for realistic scene representation. However, to make these methods viable for real-time applications such as AR/VR, gaming, and rendering on low-power devices, substantial reductions in memory usage and improvements in rendering efficiency are required. While many state-of-the-art methods prioritize lightweight implementations, they struggle in handling {scenes with complex motions or long sequences}. In this work, we introduce Temporally Compressed 3D Gaussian Splatting (TC3DGS), a novel technique designed specifically to effectively compress dynamic 3D Gaussian representations. TC3DGS selectively prunes Gaussians based on their temporal relevance and employs gradient-aware mixed-precision quantization to dynamically compress Gaussian parameters. In addition, TC3DGS exploits an adapted version of the Ramer-Douglas-Peucker algorithm to further reduce storage by interpolating Gaussian trajectories across frames. Our experiments on multiple datasets demonstrate that TC3DGS achieves up to 67$\times$ compression with minimal or no degradation in visual quality. More results and videos are provided in the supplementary. Project Page: https://ahmad-jarrar.github.io/tc-3dgs/2024-12-07T17:03:09ZAccepted at British Machine Vision Conference (BMVC) 2025Saqib JavedAhmad Jarrar KhanCorentin DumeryChen ZhaoMathieu Salzmannhttp://arxiv.org/abs/2412.05718v3RLZero: Direct Policy Inference from Language Without In-Domain Supervision2025-11-25T07:32:45ZThe reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.2024-12-07T18:31:16ZNeurIPS 2025, 26 pagesHarshit SikchiSiddhant AgarwalPranaya JajooSamyak ParajuliCaleb ChuckMax RudolphPeter StoneAmy ZhangScott Niekumhttp://arxiv.org/abs/2505.23738v2How Animals Dance (When You're Not Looking)2025-11-25T06:13:27ZWe present a framework for generating music-synchronized, choreography aware animal dance videos. Our framework introduces choreography patterns -- structured sequences of motion beats that define the long-range structure of a dance -- as a novel high-level control signal for dance video generation. These patterns can be automatically estimated from human dance videos. Starting from a few keyframes representing distinct animal poses, generated via text-to-image prompting or GPT-4o, we formulate dance synthesis as a graph optimization problem that seeks the optimal keyframe structure to satisfy a specified choreography pattern of beats. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 seconds dance videos across a wide range of animals and music tracks.2025-05-29T17:58:02ZProject page: https://how-animals-dance.github.io/Xiaojuan WangAleksander HolynskiBrian CurlessIra KemelmacherSteve Seitzhttp://arxiv.org/abs/2511.19850v1DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction2025-11-25T02:28:53ZAutomatic extraction of road networks from aerial imagery is a fundamental task, yet prevailing methods rely on polylines that struggle to model curvilinear geometry. We maintain that road geometry is inherently curve-based and introduce the Bézier Graph, a differentiable parametric curve-based representation. The primary obstacle to this representation is to obtain the difficult-to-construct vector ground-truth (GT). We sidestep this bottleneck by reframing the task as a global optimization problem over the Bézier Graph. Our framework, DOGE, operationalizes this paradigm by learning a parametric Bézier Graph directly from segmentation masks, eliminating the need for curve GT. DOGE holistically optimizes the graph by alternating between two complementary modules: DiffAlign continuously optimizes geometry via differentiable rendering, while TopoAdapt uses discrete operators to refine its topology. Our method sets a new state-of-the-art on the large-scale SpaceNet and CityScale benchmarks, presenting a new paradigm for generating high-fidelity vector maps of road networks. We will release our code and related data.2025-11-25T02:28:53Z11 pages, 6 figuresJiahui SunJunran LuJinhui YinYishuo XuYuanqi LiYanwen Guohttp://arxiv.org/abs/2511.15586v3MHR: Momentum Human Rig2025-11-24T19:02:10ZWe present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. Our model enables expressive, anatomically plausible human animation, supporting non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines.2025-11-19T16:18:02ZAaron FergusonAhmed A. A. OsmanBerta BescosCarsten StollChris TwiggChristoph LassnerDavid OtteEric VignolaFabian PradaFederica BogoIgor SantestebanJavier RomeroJenna ZarateJeongseok LeeJinhyung ParkJinlong YangJohn DoublesteinKishore VenkateshanKris KitaniLadislav KavanMarco Dal FarraMatthew HuMatthew CioffiMichael FabrisMichael RanieriMohammad ModarresPetr KadlecekRawal KhirodkarRinat AbdrashitovRomain PrévostRoman RajbhandariRonald MalletRussell PearsallSandy KaoSanjeev KumarScott ParrishShoou-I YuShunsuke SaitoTakaaki ShiratoriTe-Li WangTony TungYichen XuYuan DongYuhua ChenYuanlu XuYuting YeZhongshi Jiang