https://arxiv.org/api//4Z4gcbPBtZrYts3CPlkLjBKuZY 2026-07-01T12:27:23Z 9421 2145 15 http://arxiv.org/abs/2408.02211v2 SceneMotifCoder: Example-driven Visual Program Learning for Generating 3D Object Arrangements 2025-06-03T10:28:20Z Despite advances in text-to-3D generation methods, generation of multi-object arrangements remains challenging. Current methods exhibit failures in generating physically plausible arrangements that respect the provided text description. We present SceneMotifCoder (SMC), an example-driven framework for generating 3D object arrangements through visual program learning. SMC leverages large language models (LLMs) and program synthesis to overcome these challenges by learning visual programs from example arrangements. These programs are generalized into compact, editable meta-programs. When combined with 3D object retrieval and geometry-aware optimization, they can be used to create object arrangements varying in arrangement structure and contained objects. Our experiments show that SMC generates high-quality arrangements using meta-programs learned from few examples. Evaluation results demonstrates that object arrangements generated by SMC better conform to user-specified text descriptions and are more physically plausible when compared with state-of-the-art text-to-3D generation and layout methods. 2024-08-05T03:24:45Z Accepted at 3DV 2025 (Oral). Project page: https://3dlg-hcvc.github.io/smc/. Minor revisions for camera-ready version Hou In Ivan Tam Hou In Derek Pun Austin T. Wang Angel X. Chang Manolis Savva http://arxiv.org/abs/2506.02661v1 MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation 2025-06-03T09:12:48Z Generating long-term, coherent, and realistic music-conditioned dance sequences remains a challenging task in human motion synthesis. Existing approaches exhibit critical limitations: motion graph methods rely on fixed template libraries, restricting creative generation; diffusion models, while capable of producing novel motions, often lack temporal coherence and musical alignment. To address these challenges, we propose $\textbf{MotionRAG-Diff}$, a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with diffusion-based refinement to enable high-quality, musically coherent dance generation for arbitrary long-term music inputs. Our method introduces three core innovations: (1) A cross-modal contrastive learning architecture that aligns heterogeneous music and dance representations in a shared latent space, establishing unsupervised semantic correspondence without paired data; (2) An optimized motion graph system for efficient retrieval and seamless concatenation of motion segments, ensuring realism and temporal coherence across long sequences; (3) A multi-condition diffusion model that jointly conditions on raw music signals and contrastive features to enhance motion quality and global synchronization. Extensive experiments demonstrate that MotionRAG-Diff achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy. This work establishes a new paradigm for music-driven dance generation by synergizing retrieval-based template fidelity with diffusion-based creative enhancement. 2025-06-03T09:12:48Z 12 pages, 5 figures Mingyang Huang Peng Zhang Bang Zhang http://arxiv.org/abs/2506.02620v1 FlexPainter: Flexible and Multi-View Consistent Texture Generation 2025-06-03T08:36:03Z Texture map production is an important part of 3D modeling and determines the rendering quality. Recently, diffusion-based methods have opened a new way for texture generation. However, restricted control flexibility and limited prompt modalities may prevent creators from producing desired results. Furthermore, inconsistencies between generated multi-view images often lead to poor texture generation quality. To address these issues, we introduce \textbf{FlexPainter}, a novel texture generation pipeline that enables flexible multi-modal conditional guidance and achieves highly consistent texture generation. A shared conditional embedding space is constructed to perform flexible aggregation between different input modalities. Utilizing such embedding space, we present an image-based CFG method to decompose structural and style information, achieving reference image-based stylization. Leveraging the 3D knowledge within the image diffusion prior, we first generate multi-view images simultaneously using a grid representation to enhance global understanding. Meanwhile, we propose a view synchronization and adaptive weighting module during diffusion sampling to further ensure local consistency. Finally, a 3D-aware texture completion model combined with a texture enhancement model is used to generate seamless, high-resolution texture maps. Comprehensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods in both flexibility and generation quality. 2025-06-03T08:36:03Z 11 pages, 10 figures in main paper, 10 pages, 12 figures in supplementary Dongyu Yan Leyi Wu Jiantao Lin Luozhou Wang Tianshuo Xu Zhifei Chen Zhen Yang Lie Xu Shunsi Zhang Yingcong Chen http://arxiv.org/abs/2506.02380v1 EyeNavGS: A 6-DoF Navigation Dataset and Record-n-Replay Software for Real-World 3DGS Scenes in VR 2025-06-03T02:32:35Z 3D Gaussian Splatting (3DGS) is an emerging media representation that reconstructs real-world 3D scenes in high fidelity, enabling 6-degrees-of-freedom (6-DoF) navigation in virtual reality (VR). However, developing and evaluating 3DGS-enabled applications and optimizing their rendering performance, require realistic user navigation data. Such data is currently unavailable for photorealistic 3DGS reconstructions of real-world scenes. This paper introduces EyeNavGS (EyeNavGS), the first publicly available 6-DoF navigation dataset featuring traces from 46 participants exploring twelve diverse, real-world 3DGS scenes. The dataset was collected at two sites, using the Meta Quest Pro headsets, recording the head pose and eye gaze data for each rendered frame during free world standing 6-DoF navigation. For each of the twelve scenes, we performed careful scene initialization to correct for scene tilt and scale, ensuring a perceptually-comfortable VR experience. We also release our open-source SIBR viewer software fork with record-and-replay functionalities and a suite of utility tools for data processing, conversion, and visualization. The EyeNavGS dataset and its accompanying software tools provide valuable resources for advancing research in 6-DoF viewport prediction, adaptive streaming, 3D saliency, and foveated rendering for 3DGS scenes. The EyeNavGS dataset is available at: https://symmru.github.io/EyeNavGS/. 2025-06-03T02:32:35Z Zihao Ding Cheng-Tse Lee Mufeng Zhu Tao Guan Yuan-Chun Sun Cheng-Hsin Hsu Yao Liu http://arxiv.org/abs/2502.03498v3 Controllable Satellite-to-Street-View Synthesis with Precise Pose Alignment and Zero-Shot Environmental Control 2025-06-03T02:10:18Z Generating street-view images from satellite imagery is a challenging task, particularly in maintaining accurate pose alignment and incorporating diverse environmental conditions. While diffusion models have shown promise in generative tasks, their ability to maintain strict pose alignment throughout the diffusion process is limited. In this paper, we propose a novel Iterative Homography Adjustment (IHA) scheme applied during the denoising process, which effectively addresses pose misalignment and ensures spatial consistency in the generated street-view images. Additionally, currently, available datasets for satellite-to-street-view generation are limited in their diversity of illumination and weather conditions, thereby restricting the generalizability of the generated outputs. To mitigate this, we introduce a text-guided illumination and weather-controlled sampling strategy that enables fine-grained control over the environmental factors. Extensive quantitative and qualitative evaluations demonstrate that our approach significantly improves pose accuracy and enhances the diversity and realism of generated street-view images, setting a new benchmark for satellite-to-street-view generation tasks. 2025-02-05T09:06:39Z Xianghui Ze Zhenbo Song Qiwei Wang Jianfeng Lu Yujiao Shi http://arxiv.org/abs/2503.21555v2 SyncSDE: A Probabilistic Framework for Diffusion Synchronization 2025-06-03T00:23:52Z There have been many attempts to leverage multiple diffusion models for collaborative generation, extending beyond the original domain. A prominent approach involves synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing methods rely on naive heuristics, such as averaging, without considering task specificity. These approaches do not clarify why such methods work and often produce suboptimal results when a heuristic suitable for one task is blindly applied to others. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works and reveal where heuristics should be focused; modeling correlations between multiple trajectories and adapting them to each specific task. We further identify optimal correlation models per task, achieving better results than previous approaches that apply a single heuristic across all tasks without justification. 2025-03-27T14:40:53Z Accepted to CVPR2025. Project Page: https://hjl1013.github.io/SyncSDE/ Hyunjun Lee Hyunsoo Lee Sookwan Han http://arxiv.org/abs/2506.02219v1 Stochastic Barnes-Hut Approximation for Fast Summation on the GPU 2025-06-02T20:02:25Z We present a novel stochastic version of the Barnes-Hut approximation. Regarding the level-of-detail (LOD) family of approximations as control variates, we construct an unbiased estimator of the kernel sum being approximated. Through several examples in graphics applications such as winding number computation and smooth distance evaluation, we demonstrate that our method is well-suited for GPU computation, capable of outperforming a GPU-optimized implementation of the deterministic Barnes-Hut approximation by achieving equal median error in up to 9.4x less time. 2025-06-02T20:02:25Z 11 pages, 9 figures. To appear in ACM SIGGRAPH 2025 Abhishek Madan Nicholas Sharp Francis Williams Ken Museth David I. W. Levin 10.1145/3721238.3730725 http://arxiv.org/abs/2506.01591v1 Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation 2025-06-02T12:26:46Z Advances in talking-head animation based on Latent Diffusion Models (LDM) enable the creation of highly realistic, synchronized videos. These fabricated videos are indistinguishable from real ones, increasing the risk of potential misuse for scams, political manipulation, and misinformation. Hence, addressing these ethical concerns has become a pressing issue in AI security. Recent proactive defense studies focused on countering LDM-based models by adding perturbations to portraits. However, these methods are ineffective at protecting reference portraits from advanced image-to-video animation. The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose Silencer, a two-stage method designed to proactively protect the privacy of portraits. First, a nullifying loss is proposed to ignore audio control in talking-head generation. Second, we apply anti-purification loss in LDM to optimize the inverted latent feature to generate robust perturbations. Extensive experiments demonstrate the effectiveness of Silencer in proactively protecting portrait privacy. We hope this work will raise awareness among the AI security community regarding critical ethical issues related to talking-head generation techniques. Code: https://github.com/yuangan/Silencer. 2025-06-02T12:26:46Z Accepted to CVPR 2025 Yuan Gan Jiaxu Miao Yunze Wang Yi Yang http://arxiv.org/abs/2503.19753v3 A Survey on Event-driven 3D Reconstruction: Development under Different Categories 2025-06-02T05:58:10Z Event cameras have gained increasing attention for 3D reconstruction due to their high temporal resolution, low latency, and high dynamic range. They capture per-pixel brightness changes asynchronously, allowing accurate reconstruction under fast motion and challenging lighting conditions. In this survey, we provide a comprehensive review of event-driven 3D reconstruction methods, including stereo, monocular, and multimodal systems. We further categorize recent developments based on geometric, learning-based, and hybrid approaches. Emerging trends, such as neural radiance fields and 3D Gaussian splatting with event data, are also covered. The related works are structured chronologically to illustrate the innovations and progression within the field. To support future research, we also highlight key research gaps and future research directions in dataset, experiment, evaluation, event representation, etc. 2025-03-25T15:16:53Z We have decided not to submit this article and plan to withdraw it from public display. The content of this article will be presented in a more comprehensive form in another work Chuanzhi Xu Haoxian Zhou Haodong Chen Vera Chung Qiang Qu http://arxiv.org/abs/2506.01288v1 WishGI: Lightweight Static Global Illumination Baking via Spherical Harmonics Fitting 2025-06-02T03:50:45Z Global illumination combines direct and indirect lighting to create realistic lighting effects, bringing virtual scenes closer to reality. Static global illumination is a crucial component of virtual scene rendering, leveraging precomputation and baking techniques to significantly reduce runtime computational costs. Unfortunately, many existing works prioritize visual quality by relying on extensive texture storage and massive pixel-level texture sampling, leading to large performance overhead. In this paper, we introduce an illumination reconstruction method that effectively reduces sampling in fragment shader and avoids additional render passes, making it well-suited for low-end platforms. To achieve high-quality global illumination with reduced memory usage, we adopt a spherical harmonics fitting approach for baking effective illumination information and propose an inverse probe distribution method that generates unique probe associations for each mesh. This association, which can be generated offline in the local space, ensures consistent lighting quality across all instances of the same mesh. As a consequence, our method delivers highly competitive lighting effects while using only approximately 5% of the memory required by mainstream industry techniques. 2025-06-02T03:50:45Z Junke Zhu Zehan Wu Qixing Zhang Cheng Liao Zhangjin Huang http://arxiv.org/abs/2410.00890v3 Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation 2025-06-02T03:28:40Z Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications. Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. Subsequently, a view selection pipeline filters these views based on quality and consistency, ensuring that only the high-quality and reliable views are used for reconstruction. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. FlemRM directly outputs 3D Gaussian points leveraging a tri-plane representation, enabling efficient and detailed 3D generation. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models. 2024-10-01T17:29:43Z ICML 25. Project page: https://junlinhan.github.io/projects/flex3d/ Junlin Han Jianyuan Wang Andrea Vedaldi Philip Torr Filippos Kokkinos http://arxiv.org/abs/2312.05984v2 Accurate Differential Operators for Hybrid Neural Fields 2025-06-01T22:32:20Z Neural fields have become widely used in various fields, from shape representation to neural rendering, and for solving partial differential equations (PDEs). With the advent of hybrid neural field representations like Instant NGP that leverage small MLPs and explicit representations, these models train quickly and can fit large scenes. Yet in many applications like rendering and simulation, hybrid neural fields can cause noticeable and unreasonable artifacts. This is because they do not yield accurate spatial derivatives needed for these downstream applications. In this work, we propose two ways to circumvent these challenges. Our first approach is a post hoc operator that uses local polynomial fitting to obtain more accurate derivatives from pre-trained hybrid neural fields. Additionally, we also propose a self-supervised fine-tuning approach that refines the hybrid neural field to yield accurate derivatives directly while preserving the initial signal. We show applications of our method to rendering, collision simulation, and solving PDEs. We observe that using our approach yields more accurate derivatives, reducing artifacts and leading to more accurate simulations in downstream applications. 2023-12-10T20:14:58Z Accepted in CVPR 2025. Project page is available at https://justachetan.github.io/hnf-derivatives/ Aditya Chetan Guandao Yang Zichen Wang Steve Marschner Bharath Hariharan http://arxiv.org/abs/2506.01077v1 TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans 2025-06-01T16:27:24Z Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching 2025-06-01T16:27:24Z 24 pages,12 figures Yueqian Guo Tianzhao Li Xin Lyu Jiehaolin Chen Zhaohan Wang Sirui Xiao Yurun Chen Yezi He Helin Li Fan Zhang http://arxiv.org/abs/2506.00988v1 LensCraft: Your Professional Virtual Cinematographer 2025-06-01T12:43:55Z Digital creators, from indie filmmakers to animation studios, face a persistent bottleneck: translating their creative vision into precise camera movements. Despite significant progress in computer vision and artificial intelligence, current automated filming systems struggle with a fundamental trade-off between mechanical execution and creative intent. Crucially, almost all previous works simplify the subject to a single point-ignoring its orientation and true volume-severely limiting spatial awareness during filming. LensCraft solves this problem by mimicking the expertise of a professional cinematographer, using a data-driven approach that combines cinematographic principles with the flexibility to adapt to dynamic scenes in real time. Our solution combines a specialized simulation framework for generating high-fidelity training data with an advanced neural model that is faithful to the script while being aware of the volume and dynamic behavior of the subject. Additionally, our approach allows for flexible control via various input modalities, including text prompts, subject trajectory and volume, key points, or a full camera trajectory, offering creators a versatile tool to guide camera movements in line with their vision. Leveraging a lightweight real time architecture, LensCraft achieves markedly lower computational complexity and faster inference while maintaining high output quality. Extensive evaluation across static and dynamic scenarios reveals unprecedented accuracy and coherence, setting a new benchmark for intelligent camera systems compared to state-of-the-art models. Extended results, the complete dataset, simulation environment, trained model weights, and source code are publicly accessible on LensCraft Webpage. 2025-06-01T12:43:55Z Zahra Dehghanian Morteza Abolghasemi Hossein Azizinaghsh Amir Vahedi Hamid Beigy Hamid R. Rabiee http://arxiv.org/abs/2406.14567v3 DragPoser: Motion Reconstruction from Variable Sparse Tracking Signals via Latent Space Optimization 2025-06-01T12:02:53Z High-quality motion reconstruction that follows the user's movements can be achieved by high-end mocap systems with many sensors. However, obtaining such animation quality with fewer input devices is gaining popularity as it brings mocap closer to the general public. The main challenges include the loss of end-effector accuracy in learning-based approaches, or the lack of naturalness and smoothness in IK-based solutions. In addition, such systems are often finely tuned to a specific number of trackers and are highly sensitive to missing data e.g., in scenarios where a sensor is occluded or malfunctions. In response to these challenges, we introduce DragPoser, a novel deep-learning-based motion reconstruction system that accurately represents hard and dynamic on-the-fly constraints, attaining real-time high end-effectors position accuracy. This is achieved through a pose optimization process within a structured latent space. Our system requires only one-time training on a large human motion dataset, and then constraints can be dynamically defined as losses, while the pose is iteratively refined by computing the gradients of these losses within the latent space. To further enhance our approach, we incorporate a Temporal Predictor network, which employs a Transformer architecture to directly encode temporality within the latent space. This network ensures the pose optimization is confined to the manifold of valid poses and also leverages past pose data to predict temporally coherent poses. Results demonstrate that DragPoser surpasses both IK-based and the latest data-driven methods in achieving precise end-effector positioning, while it produces natural poses and temporally coherent motion. In addition, our system showcases robustness against on-the-fly constraint modifications, and exhibits exceptional adaptability to various input configurations and changes. 2024-04-29T15:00:50Z Published on Eurographics 2025. Project page: https://upc-virvig.github.io/DragPoser/ Jose Luis Ponton Eduard Pujol Andreas Aristidou Carlos Andujar Nuria Pelechano 10.1111/cgf.70026