https://arxiv.org/api/2FmHmFioTfavZzShJg4rTeaAiXQ2026-06-28T00:52:11Z9390171015http://arxiv.org/abs/2406.00443v2Generating 3D Terrain with 2D Cellular Automata2025-08-20T19:01:54ZThis paper explores the use of 2D cellular automata (CA) to generate 3D terrains through a simple additive approach. Experimenting with multiple CA transition rules produced aesthetically interesting, navigable landscapes, suggesting applicability for terrain generation in games.2024-06-01T13:43:28ZThe peer-reviewed version of this paper is published in IEEE Xplore at https://doi.org/10.1109/CoG64752.2025.11114361. This version is typeset by the author and differs only in pagination and typographical detail2025 IEEE Conference on Games (CoG), 1-4, IEEE, 2025Nuno FachadaAntónio R. RodriguesDiogo de AndradePhil Lopes10.1109/CoG64752.2025.11114361http://arxiv.org/abs/2506.04562v2Handle-based Mesh Deformation Guided By Vision Language Model2025-08-20T18:28:36ZMesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interpret and manipulate a handle-based interface through prompt engineering. We begin by applying cone singularity detection to identify a sparse set of potential handles. The VLM is then prompted to select both the deformable sub-parts of the mesh and the handles that best align with user instructions. Subsequently, we query the desired deformed positions of the selected handles in screen space. To reduce uncertainty inherent in VLM predictions, we aggregate the results from multiple camera views using a novel multi-view voting scheme. % Across a suite of benchmarks, our method produces deformations that align more closely with user intent, as measured by CLIP and GPTEval3D scores, while introducing low distortion -- quantified via membrane energy. In summary, our approach is training-free, highly automated, and consistently delivers high-quality mesh deformations.2025-06-05T02:29:42Z19 pagesXingpeng SunShiyang JiaZherong PanKui WuAniket Berahttp://arxiv.org/abs/2508.14892v1Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds2025-08-20T17:59:11ZReconstructing 3D human bodies from sparse views has been an appealing topic, which is crucial to broader the related applications. In this paper, we propose a quite challenging but valuable task to reconstruct the human body from only two images, i.e., the front and back view, which can largely lower the barrier for users to create their own 3D digital humans. The main challenges lie in the difficulty of building 3D consistency and recovering missing information from the highly sparse input. We redesign a geometry reconstruction model based on foundation reconstruction models to predict consistent point clouds even input images have scarce overlaps with extensive human data training. Furthermore, an enhancement algorithm is applied to supplement the missing color information, and then the complete human point clouds with colors can be obtained, which are directly transformed into 3D Gaussians for better rendering quality. Experiments show that our method can reconstruct the entire human in 190 ms on a single NVIDIA RTX 4090, with two images at a resolution of 1024x1024, demonstrating state-of-the-art performance on the THuman2.0 and cross-domain datasets. Additionally, our method can complete human reconstruction even with images captured by low-cost mobile devices, reducing the requirements for data collection. Demos and code are available at https://hustvl.github.io/Snap-Snap/.2025-08-20T17:59:11ZProject page: https://hustvl.github.io/Snap-Snap/Jia LuTaoran YiJiemin FangChen YangChuiyun WuWei ShenWenyu LiuQi TianXinggang Wanghttp://arxiv.org/abs/2411.13536v3Identity Preserving 3D Head Stylization with Multiview Score Distillation2025-08-20T14:41:03Z3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.2024-11-20T18:37:58Zhttps://three-bee.github.io/head_stylizationBahri Batuhan BilecenAhmet Berke GokmenFurkan GuzelantAysegul Dundarhttp://arxiv.org/abs/2502.11618v2Real-time Neural Rendering of LiDAR Point Clouds2025-08-20T08:45:38ZStatic LiDAR scanners produce accurate, dense, colored point clouds, but often contain obtrusive artifacts which makes them ill-suited for direct display. We propose an efficient method to render photorealistic images of such scans without any expensive preprocessing or training of a scene-specific model. A naive projection of the point cloud to the output view using 1x1 pixels is fast and retains the available detail, but also results in unintelligible renderings as background points leak in between the foreground pixels. The key insight is that these projections can be transformed into a realistic result using a deep convolutional model in the form of a U-Net, and a depth-based heuristic that prefilters the data. The U-Net also handles LiDAR-specific problems such as missing parts due to occlusion, color inconsistencies and varying point densities. We also describe a method to generate synthetic training data to deal with imperfectly-aligned ground truth images. Our method achieves real-time rendering rates using an off-the-shelf GPU and outperforms the state-of-the-art in both speed and quality.2025-02-17T10:01:13ZAccepted at Eurographics 2025Joni VanherckBrent ZoomersTom MertensLode JorissenNick Michiels10.2312/egs.20251041http://arxiv.org/abs/2508.14411v1A Real-world Display Inverse Rendering Dataset2025-08-20T04:15:19ZInverse rendering aims to reconstruct geometry and reflectance from captured images. Display-camera imaging systems offer unique advantages for this task: each pixel can easily function as a programmable point light source, and the polarized light emitted by LCD displays facilitates diffuse-specular separation. Despite these benefits, there is currently no public real-world dataset captured using display-camera systems, unlike other setups such as light stages. This absence hinders the development and evaluation of display-based inverse rendering methods. In this paper, we introduce the first real-world dataset for display-based inverse rendering. To achieve this, we construct and calibrate an imaging system comprising an LCD display and stereo polarization cameras. We then capture a diverse set of objects with diverse geometry and reflectance under one-light-at-a-time (OLAT) display patterns. We also provide high-quality ground-truth geometry. Our dataset enables the synthesis of captured images under arbitrary display patterns and different noise levels. Using this dataset, we evaluate the performance of existing photometric stereo and inverse rendering methods, and provide a simple, yet effective baseline for display inverse rendering, outperforming state-of-the-art inverse rendering methods. Code and dataset are available on our project page at https://michaelcsj.github.io/DIR/2025-08-20T04:15:19ZSeokjun ChoiHoon-Gyu ChungYujin JeonGiljoo NamSeung-Hwan Baekhttp://arxiv.org/abs/2510.01187v1Manim for STEM Education: Visualizing Complex Problems Through Animation2025-08-20T03:12:55ZMany STEM concepts pose significant learning challenges to students due to their inherent complexity and abstract nature. Visualizing complex problems through animations can significantly enhance learning outcomes. However, the creation of animations can be time-consuming and inconvenient. Hence, many educators illustrate complex concepts by hand on a board or a digital device. Although static graphics are helpful for understanding, they are less effective than animations. The free and open-source Python package Manim enables educators to create visually compelling animations easily. Python's straightforward syntax, combined with Manim's comprehensive set of built-in classes and methods, greatly simplifies implementation. This article presents a series of examples that demonstrate how Manim can be used to create animated video lessons for a variety of topics in computer science and mathematics. In addition, it analyzes viewer feedback collected across multiple social media platforms to evaluate the effectiveness and accessibility of these visualizations. The article further explores broader potentials of the Manim Python library by showcasing demonstrations that extend its applications to subject areas beyond computer science and mathematics.2025-08-20T03:12:55ZChristina Zhanghttp://arxiv.org/abs/2508.14933v1Inference Time Debiasing Concepts in Diffusion Models2025-08-19T20:21:02ZWe propose DeCoDi, a debiasing procedure for text-to-image diffusion-based models that changes the inference procedure, does not significantly change image quality, has negligible compute overhead, and can be applied in any diffusion-based image generation model. DeCoDi changes the diffusion process to avoid latent dimension regions of biased concepts. While most deep learning debiasing methods require complex or compute-intensive interventions, our method is designed to change only the inference procedure. Therefore, it is more accessible to a wide range of practitioners. We show the effectiveness of the method by debiasing for gender, ethnicity, and age for the concepts of nurse, firefighter, and CEO. Two distinct human evaluators manually inspect 1,200 generated images. Their evaluation results provide evidence that our method is effective in mitigating biases based on gender, ethnicity, and age. We also show that an automatic bias evaluation performed by the GPT4o is not significantly statistically distinct from a human evaluation. Our evaluation shows promising results, with reliable levels of agreement between evaluators and more coverage of protected attributes. Our method has the potential to significantly improve the diversity of images it generates by diffusion-based text-to-image generative models.2025-08-19T20:21:02ZLucas S. KupssinsküMarco N. BochernitsanJordan KopperOtávio ParragaRodrigo C. Barroshttp://arxiv.org/abs/2410.03844v2Projected Walk on Spheres: A Monte Carlo Closest Point Method for Surface PDEs2025-08-19T19:43:01ZWe present projected walk on spheres (PWoS), a novel pointwise and discretization-free Monte Carlo solver for surface PDEs with Dirichlet boundaries, as a generalization of the walk on spheres method (WoS) [Muller 1956; Sawhney and Crane 2020]. We adapt the recursive relationship of WoS designed for PDEs in volumetric domains to a volumetric neighborhood around the surface, and at the end of each recursion step, we project the sample point on the sphere back to the surface. We motivate this simple modification to WoS with the theory of the closest point extension used in the closest point method. To define the valid volumetric neighborhood domain for PWoS, we develop strategies to estimate the local feature size of the surface and to compute the distance to the Dirichlet boundaries on the surface extended in their normal directions. We also design a mean value filtering method for PWoS to improve the method's efficiency when the surface is represented as a polygonal mesh or a point cloud. Finally, we study the convergence of PWoS and demonstrate its application to graphics tasks, including diffusion curves, geodesic distance computation, and wave propagation animation. We show that our method works with various types of surfaces, including a surface of mixed codimension.2024-10-04T18:22:57ZAccepted to SIGGRAPH Asia 2024 (Conference Papers). See https://rsugimoto.net/ProjectedWalkOnSpheres/ for updatesRyusuke SugimotoNathan KingToshiya HachisukaChristopher Batty10.1145/3680528.3687599http://arxiv.org/abs/2507.06484v23D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds2025-08-19T19:36:27ZDespite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment's layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.2025-07-09T02:00:17Zproject website: https://ai.stanford.edu/~sunfanyun/3d-generalist/Fan-Yun SunShengguang WuChristian JacobsenThomas YimHaoming ZouAlex ZookShangru LiYu-Hsin ChouEthem CanXunlei WuClemens EppnerValts BlukisJonathan TremblayJiajun WuStan BirchfieldNick Haberhttp://arxiv.org/abs/2508.14931v1Pixels Under Pressure: Exploring Fine-Tuning Paradigms for Foundation Models in High-Resolution Medical Imaging2025-08-19T19:01:19ZAdvancements in diffusion-based foundation models have improved text-to-image generation, yet most efforts have been limited to low-resolution settings. As high-resolution image synthesis becomes increasingly essential for various applications, particularly in medical imaging domains, fine-tuning emerges as a crucial mechanism for adapting these powerful pre-trained models to task-specific requirements and data distributions. In this work, we present a systematic study, examining the impact of various fine-tuning techniques on image generation quality when scaling to high resolution 512x512 pixels. We benchmark a diverse set of fine-tuning methods, including full fine-tuning strategies and parameter-efficient fine-tuning (PEFT). We dissect how different fine-tuning methods influence key quality metrics, including Fréchet Inception Distance (FID), Vendi score, and prompt-image alignment. We also evaluate the utility of generated images in a downstream classification task under data-scarce conditions, demonstrating that specific fine-tuning strategies improve both generation fidelity and downstream performance when synthetic images are used for classifier training and evaluation on real images. Our code is accessible through the project website - https://tehraninasab.github.io/PixelUPressure/.2025-08-19T19:01:19ZZahra TehraniNasabAmar KumarTal Arbelhttp://arxiv.org/abs/2508.14930v1Hybrelighter: Combining Deep Anisotropic Diffusion and Scene Reconstruction for On-device Real-time Relighting in Mixed Reality2025-08-19T18:52:30ZMixed Reality scene relighting, where virtual changes to lighting conditions realistically interact with physical objects, producing authentic illumination and shadows, can be used in a variety of applications. One such application in real estate could be visualizing a room at different times of day and placing virtual light fixtures. Existing deep learning-based relighting techniques typically exceed the real-time performance capabilities of current MR devices. On the other hand, scene understanding methods, such as on-device scene reconstruction, often yield inaccurate results due to scanning limitations, in turn affecting relighting quality. Finally, simpler 2D image filter-based approaches cannot represent complex geometry and shadows. We introduce a novel method to integrate image segmentation, with lighting propagation via anisotropic diffusion on top of basic scene understanding, and the computational simplicity of filter-based techniques. Our approach corrects on-device scanning inaccuracies, delivering visually appealing and accurate relighting effects in real-time on edge devices, achieving speeds as high as 100 fps. We show a direct comparison between our method and the industry standard, and present a practical demonstration of our method in the aforementioned real estate example.2025-08-19T18:52:30ZHanwen ZhaoJohn AkersBaback ElmiehIra Kemelmacher-Shlizermanhttp://arxiv.org/abs/2508.14187v1Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer2025-08-19T18:21:59ZScale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.2025-08-19T18:21:59ZMd Ashiqur RahmanChiao-An YangMichael N. ChengLim Jun HaoJeremiah JiangTeck-Yian LimRaymond A. Yehhttp://arxiv.org/abs/2508.13808v1Is-NeRF: In-scattering Neural Radiance Field for Blurred Images2025-08-19T13:13:02ZNeural Radiance Fields (NeRF) has gained significant attention for its prominent implicit 3D representation and realistic novel view synthesis capabilities. Available works unexceptionally employ straight-line volume rendering, which struggles to handle sophisticated lightpath scenarios and introduces geometric ambiguities during training, particularly evident when processing motion-blurred images. To address these challenges, this work proposes a novel deblur neural radiance field, Is-NeRF, featuring explicit lightpath modeling in real-world environments. By unifying six common light propagation phenomena through an in-scattering representation, we establish a new scattering-aware volume rendering pipeline adaptable to complex lightpaths. Additionally, we introduce an adaptive learning strategy that enables autonomous determining of scattering directions and sampling intervals to capture finer object details. The proposed network jointly optimizes NeRF parameters, scattering parameters, and camera motions to recover fine-grained scene representations from blurry images. Comprehensive evaluations demonstrate that it effectively handles complex real-world scenarios, outperforming state-of-the-art approaches in generating high-fidelity images with accurate geometric details.2025-08-19T13:13:02ZNan LuoChenglin YeJiaxu LiGang LiuBo WanDi WangLupeng LiuJun Xiaohttp://arxiv.org/abs/2508.13797v1Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing2025-08-19T12:57:31ZRecent video editing methods achieve attractive results in style transfer or appearance modification. However, editing the structural content of 3D scenes in videos remains challenging, particularly when dealing with significant viewpoint changes, such as large camera rotations or zooms. Key challenges include generating novel view content that remains consistent with the original video, preserving unedited regions, and translating sparse 2D inputs into realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. To solve the challenge posed by sparse inputs, we employ image editing methods to generate edited results for the first frame, which are then propagated to the remaining frames of the video. We utilize sketching as an interaction tool for precise geometry control, while other mask-based image editing methods are also supported. To handle viewpoint changes, we perform a detailed analysis and manipulation of the 3D information in the video. Specifically, we utilize a dense stereo method to estimate a point cloud and the camera parameters of the input video. We then propose a point cloud editing approach that uses depth maps to represent the 3D geometry of newly edited components, aligning them effectively with the original 3D scene. To seamlessly merge the newly edited content with the original video while preserving the features of unedited regions, we introduce a 3D-aware mask propagation strategy and employ a video diffusion model to produce realistic edited videos. Extensive experiments demonstrate the superiority of Sketch3DVE in video editing. Homepage and code: http://http://geometrylearning.com/Sketch3DVE/2025-08-19T12:57:31ZSIGGRAPH 2025Feng-Lin LiuShi-Yang LiYan-Pei CaoHongbo FuLin Gao