https://arxiv.org/api/4QTs3bZyFX20TzvMAyCbCg5BSwA 2026-07-01T15:51:08Z 9421 2190 15 http://arxiv.org/abs/2406.04008v3 Image-Space Collage and Packing with Differentiable Rendering 2025-05-26T10:01:18Z

Collage and packing techniques are widely used to organize geometric shapes into cohesive visual representations, facilitating the representation of visual features holistically, as seen in image collages and word clouds. Traditional methods often rely on object-space optimization, requiring intricate geometric descriptors and energy functions to handle complex shapes. In this paper, we introduce a versatile image-space collage technique. Leveraging a differentiable renderer, our method effectively optimizes the object layout with image-space losses, bringing the benefit of fixed complexity and easy accommodation of various shapes. Applying a hierarchical resolution strategy in image space, our method efficiently optimizes the collage with fast convergence, large coarse steps first and then small precise steps. The diverse visual expressiveness of our approach is demonstrated through various examples. Experimental results show that our method achieves an order-of-magnitude speedup compared to state-of-the-art techniques. The project page is https://szuviz.github.io/pixel-space-collage-technique/.

2024-06-06T12:33:23Z Zhenyu Wang Min Lu http://arxiv.org/abs/2505.19672v1 A Fluorescent Material Model for Non-Spectral Editing & Rendering 2025-05-26T08:28:24Z

Fluorescent materials are characterized by a spectral reradiation toward longer wavelengths. Recent work [Fichet et al. 2024] has shown that the rendering of fluorescence in a non-spectral engine is possible through the use of appropriate reduced reradiation matrices. But the approach has limited expressivity, as it requires the storage of one reduced matrix per fluorescent material, and only works with measured fluorescent assets. In this work, we introduce an analytical approach to the editing and rendering of fluorescence in a non-spectral engine. It is based on a decomposition of the reduced reradiation matrix, and an analytically-integrable Gaussian-based model of the fluorescent component. The model reproduces the appearance of fluorescent materials accurately, especially with the addition of a UV basis. Most importantly, it grants variations of fluorescent material parameters in real-time, either for the editing of fluorescent materials, or for the dynamic spatial variation of fluorescence properties across object surfaces. A simplified one-Gaussian fluorescence model even allows for the artist-friendly creation of plausible fluorescent materials from scratch, requiring only a few reflectance colors as input.

2025-05-26T08:28:24Z SIGGRAPH Conference Papers 2025, August 10-14, Vancouverm BC, Canada Belcour Laurent Fichet Alban Barla Pascal 10.1145/3721238.3730721 http://arxiv.org/abs/2502.03502v2 DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior 2025-05-26T07:44:33Z

Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.

2025-02-05T10:15:00Z Equal contributions from first two authors In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (SIGGRAPH Conference Papers 2025) Janghyeok Han Gyujin Sim Geonung Kim Hyun-seung Lee Kyuha Choi Youngseok Han Sunghyun Cho 10.1145/3721238.3730719 http://arxiv.org/abs/2505.19306v1 From Single Images to Motion Policies via Video-Generation Environment Representations 2025-05-25T20:30:25Z

Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.

2025-05-25T20:30:25Z Weiming Zhi Ziyong Ma Tianyi Zhang Matthew Johnson-Roberson http://arxiv.org/abs/2502.06814v2 Diffusion Instruction Tuning 2025-05-25T15:41:21Z

We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.

2025-02-04T22:20:20Z Project page at https://astrazeneca.github.io/vlm/ Chen Jin Ryutaro Tanno Amrutha Saseendran Tom Diethe Philip Teare http://arxiv.org/abs/2505.19151v1 SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation 2025-05-25T13:58:52Z

Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$\times$ speedup for Wan with nearly no quality loss for VBench, and 2$\times$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.

2025-05-25T13:58:52Z 9 pages, 6 figures Shenggan Cheng Yuanxin Wei Lansong Diao Yong Liu Bujiao Chen Lianghua Huang Yu Liu Wenyuan Yu Jiangsu Du Wei Lin Yang You http://arxiv.org/abs/2502.15989v2 Mean-Shift Distillation for Diffusion Mode Seeking 2025-05-25T13:54:37Z

We present mean-shift distillation, a novel diffusion distillation technique that provides a provably good proxy for the gradient of the diffusion output distribution. This is derived directly from mean-shift mode seeking on the distribution, and we show that its extrema are aligned with the modes. We further derive an efficient product distribution sampling procedure to evaluate the gradient. Our method is formulated as a drop-in replacement for score distillation sampling (SDS), requiring neither model retraining nor extensive modification of the sampling procedure. We show that it exhibits superior mode alignment as well as improved convergence in both synthetic and practical setups, yielding higher-fidelity results when applied to both text-to-image and text-to-3D applications with Stable Diffusion.

2025-02-21T22:58:56Z 15 pages, 9 figures Vikas Thamizharasan Nikitas Chatzis Iliyan Georgiev Matthew Fisher Evangelos Kalogerakis Difan Liu Nanxuan Zhao Michal Lukac http://arxiv.org/abs/2501.18672v6 Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting 2025-05-25T09:14:32Z

Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character's head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at https://quyans.github.io/Drag-Your-Gaussian.

2025-01-30T18:51:54Z Visit our project page at https://quyans.github.io/Drag-Your-Gaussian Yansong Qu Dian Chen Xinyang Li Xiaofan Li Shengchuan Zhang Liujuan Cao Rongrong Ji http://arxiv.org/abs/2503.00807v2 GenAnalysis: Joint Shape Analysis by Learning Man-Made Shape Generators with Deformation Regularizations 2025-05-25T06:19:58Z

We present GenAnalysis, an implicit shape generation framework that allows joint analysis of man-made shapes, including shape matching and joint shape segmentation. The key idea is to enforce an as-affine-as-possible (AAAP) deformation between synthetic shapes of the implicit generator that are close to each other in the latent space, which we achieve by designing a regularization loss. It allows us to understand the shape variation of each shape in the context of neighboring shapes and also offers structure-preserving interpolations between the input shapes. We show how to extract these shape variations by recovering piecewise affine vector fields in the tangent space of each shape. These vector fields provide single-shape segmentation cues. We then derive shape correspondences by iteratively propagating AAAP deformations across a sequence of intermediate shapes. These correspondences are then used to aggregate single-shape segmentation cues into consistent segmentations. We conduct experiments on the ShapeNet dataset to show superior performance in shape matching and joint shape segmentation over previous methods.

2025-03-02T09:17:08Z 19 pages, 24 figures Yuezhi Yang Haitao Yang Kiyohiro Nakayama Xiangru Huang Leonidas Guibas Qixing Huang http://arxiv.org/abs/2410.16865v4 Polycubes via Dual Loops 2025-05-24T19:55:32Z

In this paper we study polycubes: orthogonal polyhedra with axis-aligned quadrilateral faces. We present a complete characterization of polycubes of any genus based on their dual structure: a collection of oriented loops which run in each of the axis directions and capture polycubes via their intersection patterns. A polycube loop structure uniquely corresponds to a polycube. We also describe all combinatorially different ways to add a loop to a loop structure while maintaining its validity. Similarly, we show how to identify loops that can be removed from a polycube loop structure without invalidating it. Our characterization gives rise to an iterative algorithm to construct provably valid polycube maps for a given input surface.

2024-10-22T10:07:42Z Proceedings of the 2025 SIAM International Meshing Roundtable (IMR) Maxim Snoep Bettina Speckmann Kevin Verbeek 10.1137/1.9781611978575.7 http://arxiv.org/abs/2505.18772v1 CageNet: A Meta-Framework for Learning on Wild Meshes 2025-05-24T16:22:58Z

Learning on triangle meshes has recently proven to be instrumental to a myriad of tasks, from shape classification, to segmentation, to deformation and animation, to mention just a few. While some of these applications are tackled through neural network architectures which are tailored to the application at hand, many others use generic frameworks for triangle meshes where the only customization required is the modification of the input features and the loss function. Our goal in this paper is to broaden the applicability of these generic frameworks to "wild", i.e. meshes in-the-wild which often have multiple components, non-manifold elements, disrupted connectivity, or a combination of these. We propose a configurable meta-framework based on the concept of caged geometry: Given a mesh, a cage is a single component manifold triangle mesh that envelopes it closely. Generalized barycentric coordinates map between functions on the cage, and functions on the mesh, allowing us to learn and test on a variety of data, in different applications. We demonstrate this concept by learning segmentation and skinning weights on difficult data, achieving better performance to state of the art techniques on wild meshes.

2025-05-24T16:22:58Z 11 pages, 13 figures (excluding supplementary material) SIGGRAPH Conference Papers 2025 Michal Edelstein Hsueh-Ti Derek Liu Mirela Ben-Chen 10.1145/3721238.3730654 http://arxiv.org/abs/2504.19718v3 Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation 2025-05-24T10:20:30Z

Face registration deforms a template mesh to closely fit a 3D face scan, the quality of which commonly degrades in non-skin regions (e.g., hair, beard, accessories), because the optimized template-to-scan distance pulls the template mesh towards the noisy scan surface. Improving registration quality requires a clean separation of skin and non-skin regions on the scan mesh. Existing image-based (2D) or scan-based (3D) segmentation methods however perform poorly. Image-based segmentation outputs multi-view inconsistent masks, and they cannot account for scan inaccuracies or scan-image misalignment, while scan-based methods suffer from lower spatial resolution compared to images. In this work, we introduce a novel method that accurately separates skin from non-skin geometry on 3D human head scans. For this, our method extracts features from multi-view images using a frozen image foundation model and aggregates these features in 3D. These lifted 2D features are then fused with 3D geometric features extracted from the scan mesh, to then predict a segmentation mask directly on the scan mesh. We show that our segmentations improve the registration accuracy over pure 2D or 3D segmentation methods by 8.89% and 14.3%, respectively. Although trained only on synthetic data, our model generalizes well to real data.

2025-04-28T12:13:12Z 4 pages, 4 figures, published in Eurographics 2025 as a short paper Victoria Yue Chen Daoye Wang Stephan Garbin Jan Bednarik Sebastian Winberg Timo Bolkart Thabo Beeler http://arxiv.org/abs/2408.00458v2 Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion 2025-05-23T19:07:14Z

Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context. Project website: https://mkansy.github.io/reenact-anything/

2024-08-01T10:55:20Z Added more evaluation and analyses since first version. Accepted to SIGGRAPH 2025 (Conference Track). Project page: https://mkansy.github.io/reenact-anything/ Manuel Kansy Jacek Naruniec Christopher Schroers Markus Gross Romann M. Weber 10.1145/3721238.3730668 http://arxiv.org/abs/2505.18075v1 Beyond flat-panel displays, applications of stereographic and holographic devices in 3D microscopy data analysis 2025-05-23T16:27:24Z

Laser scanning microscopy enables the acquisition of 3D data in biomedical research. A fundamental challenge in visualizing 3D data is that common flat-panel displays, being 2D in nature, cannot faithfully reproduce light fields. Recent years have witnessed the development of various 3D display technologies. These technologies generally fall into two categories, stereography and holography, depending on the number of perspectives they can simultaneously present. We have integrated support for many commercially available 3D-capable displays into FluoRender, a visualization and analysis system for fluorescence microscopy data. This study investigates the opportunities and challenges of applying various 3D display devices in biological research, focusing on their practical use and potential for broad adoption. We found that 3D display devices, including the HoloLens and the Looking Glass, each have their merits and shortcomings. We predict that the convergence of stereographic and holographic technologies will create powerful tools for visualization and analysis in biological applications.

2025-05-23T16:27:24Z Yong Wan Holly A. Holman Charles Hansen http://arxiv.org/abs/2505.17402v1 From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation 2025-05-23T02:35:46Z

High-fidelity 3D reconstruction is critical for aerial inspection tasks such as infrastructure monitoring, structural assessment, and environmental surveying. While traditional photogrammetry techniques enable geometric modeling, they lack semantic interpretability, limiting their effectiveness for automated inspection workflows. Recent advances in neural rendering and 3D Gaussian Splatting (3DGS) offer efficient, photorealistic reconstructions but similarly lack scene-level understanding. In this work, we present a UAV-based pipeline that extends Feature-3DGS for language-guided 3D segmentation. We leverage LSeg-based feature fields with CLIP embeddings to generate heatmaps in response to language prompts. These are thresholded to produce rough segmentations, and the highest-scoring point is then used as a prompt to SAM or SAM2 for refined 2D segmentation on novel view renderings. Our results highlight the strengths and limitations of various feature field backbones (CLIP-LSeg, SAM, SAM2) in capturing meaningful structure in large-scale outdoor environments. We demonstrate that this hybrid approach enables flexible, language-driven interaction with photorealistic 3D reconstructions, opening new possibilities for semantic aerial inspection and scene understanding.

2025-05-23T02:35:46Z Mahmoud Chick Zaouali Todd Charter Homayoun Najjaran