https://arxiv.org/api/LhYzlBDFdMgv1zPIdBFsSXIqnlo 2026-07-01T10:02:32Z 9421 2115 15 http://arxiv.org/abs/2506.06440v1 Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation 2025-06-06T18:00:46Z Faithfully reconstructing textured shapes and physical properties from videos presents an intriguing yet challenging problem. Significant efforts have been dedicated to advancing such a system identification problem in this area. Previous methods often rely on heavy optimization pipelines with a differentiable simulator and renderer to estimate physical parameters. However, these approaches frequently necessitate extensive hyperparameter tuning for each scene and involve a costly optimization process, which limits both their practicality and generalizability. In this work, we propose a novel framework, Vid2Sim, a generalizable video-based approach for recovering geometry and physical properties through a mesh-free reduced simulation based on Linear Blend Skinning (LBS), offering high computational efficiency and versatile representation capability. Specifically, Vid2Sim first reconstructs the observed configuration of the physical system from video using a feed-forward neural network trained to capture physical world knowledge. A lightweight optimization pipeline then refines the estimated appearance, geometry, and physical properties to closely align with video observations within just a few minutes. Additionally, after the reconstruction, Vid2Sim enables high-quality, mesh-free simulation with high efficiency. Extensive experiments demonstrate that our method achieves superior accuracy and efficiency in reconstructing geometry and physical properties from video data. 2025-06-06T18:00:46Z Accepted by CVPR 2025 Chuhao Chen Zhiyang Dou Chen Wang Yiming Huang Anjun Chen Qiao Feng Jiatao Gu Lingjie Liu http://arxiv.org/abs/2506.06190v1 NAT: Neural Acoustic Transfer for Interactive Scenes in Real Time 2025-06-06T15:52:06Z Previous acoustic transfer methods rely on extensive precomputation and storage of data to enable real-time interaction and auditory feedback. However, these methods struggle with complex scenes, especially when dynamic changes in object position, material, and size significantly alter sound effects. These continuous variations lead to fluctuating acoustic transfer distributions, making it challenging to represent with basic data structures and render efficiently in real time. To address this challenge, we present Neural Acoustic Transfer, a novel approach that utilizes an implicit neural representation to encode precomputed acoustic transfer and its variations, allowing for real-time prediction of sound fields under varying conditions. To efficiently generate the training data required for the neural acoustic field, we developed a fast Monte-Carlo-based boundary element method (BEM) approximation for general scenarios with smooth Neumann conditions. Additionally, we implemented a GPU-accelerated version of standard BEM for scenarios requiring higher precision. These methods provide the necessary training data, enabling our neural network to accurately model the sound radiation space. We demonstrate our method's numerical accuracy and runtime efficiency (within several milliseconds for 30s audio) through comprehensive validation and comparisons in diverse acoustic transfer scenarios. Our approach allows for efficient and accurate modeling of sound behavior in dynamically changing environments, which can benefit a wide range of interactive applications such as virtual reality, augmented reality, and advanced audio production. 2025-06-06T15:52:06Z Xutong Jin Bo Pang Chenxi Xu Xinyun Hou Guoping Wang Sheng Li http://arxiv.org/abs/2502.06805v3 Efficient Diffusion Models: A Survey 2025-06-06T14:57:09Z Diffusion models have emerged as powerful generative models capable of producing high-quality contents such as images, videos, and audio, demonstrating their potential to revolutionize digital content creation. However, these capabilities come at the cost of their significant computational resources and lengthy generation time, underscoring the critical need to develop efficient techniques for practical deployment. In this survey, we provide a systematic and comprehensive review of research on efficient diffusion models. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient diffusion model topics from algorithm-level, system-level, and framework perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-Diffusion-Model-Survey. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient diffusion model research and inspire them to contribute to this important and exciting field. 2025-02-03T10:15:08Z Published in Transactions on Machine Learning Research (TMLR-2025) Hui Shen Jingxuan Zhang Boning Xiong Rui Hu Shoufa Chen Zhongwei Wan Xin Wang Yu Zhang Zixuan Gong Guangyin Bao Chaofan Tao Yongfeng Huang Ye Yuan Mi Zhang http://arxiv.org/abs/2506.06040v1 Hardware Accelerated Neural Block Texture Compression with Cooperative Vectors 2025-06-06T12:44:33Z In this work, we present an extension to the neural texture compression method of Weinreich and colleagues [2024]. Like them, we leverage existing block compression methods which permit to use hardware texture filtering to store a neural representation of physically-based rendering (PBR) texture sets (including albedo, normal maps, roughness, etc.). However, we show that low dynamic range block compression formats still make the solution viable. Thanks to this, we show that we can achieve higher compression ratio or higher quality at fixed compression ratio. We improve performance at runtime using a tile based rendering architecture that leverage hardware matrix multiplication engine. Thanks to all this, we render 4k textures sets (9 channels per asset) with anisotropic filtering at 1080p using only 28MB of VRAM per texture set at 0.55ms on an Intel B580. 2025-06-06T12:44:33Z Proceedings of High-Performance Graphics (HPG) 2025 Belcour Laurent Benyoub Anis http://arxiv.org/abs/2503.23441v2 Spatially-Embedded Lens Visualization: A Design Space 2025-06-06T12:30:19Z Lens visualization has been a prominent research area in the visualization community, fueled by the continuous need to mitigate visual clutter and occlusion resulting from the increase in data volume. Interactive lenses for spatial data, particularly, challenge designers to conceive design strategies to support the analysis of high-density, multifaceted data with spatial referents. Despite their relevance, there is a lack of systematic understanding regarding the various design elements that compose spatially-embedded lens visualizations. To fill in this gap, we unify these components under a common hood in the form of a design space, which we propose in this paper. Building our knowledge on top of the initial insights gained from Tominski et al.'s survey [57], we construct a design space spanning 7 dimensions through our analysis of 45 papers published in the visualization community over the past 15 years. We describe each design dimension through representative examples and examine the range of design choices available within each, discussing their benefits and pitfalls that affect lens performance and usability. In doing so, we offer a cohesive catalog of considerations for designers-both when examining existing lenses and when conceptualizing novel spatially-embedded lens visualizations. We conclude by shedding light on regions of the design space that remain largely understudied, revealing open opportunities for future research. 2025-03-30T13:41:47Z Roberta Mota Ehud Sharlin Usman Alim http://arxiv.org/abs/2305.16800v2 Joint Optimization of Triangle Mesh, Material, and Light from Neural Fields with Neural Radiance Cache 2025-06-06T06:35:52Z Traditional inverse rendering techniques are based on textured meshes, which naturally adapts to modern graphics pipelines, but costly differentiable multi-bounce Monte Carlo (MC) ray tracing poses challenges for modeling global illumination. Recently, neural fields has demonstrated impressive reconstruction quality but falls short in modeling indirect illumination. In this paper, we introduce a simple yet efficient inverse rendering framework that combines the strengths of both methods. Specifically, given pre-trained neural field representing the scene, we can obtain an initial estimate of the signed distance field (SDF) and create a Neural Radiance Cache (NRC), an enhancement over the traditional radiance cache used in real-time rendering. By using the former to initialize differentiable marching tetrahedrons (DMTet) and the latter to model indirect illumination, we can compute the global illumination via single-bounce differentiable MC ray tracing and jointly optimize the geometry, material, and light through back propagation. Experiments demonstrate that, compared to previous methods, our approach effectively prevents indirect illumination effects from being baked into materials, thus obtaining the high-quality reconstruction of triangle mesh, Physically-Based (PBR) materials, and High Dynamic Range (HDR) light probe. 2023-05-26T10:29:25Z Jiakai Sun Weijing Zhang Zhanjie Zhang Tianyi Chu Guangyuan Li Lei Zhao Wei Xing http://arxiv.org/abs/2501.00625v3 Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting 2025-06-05T23:59:49Z Recently released open-source pre-trained foundational image segmentation and object detection models (SAM2+GroundingDINO) allow for geometrically consistent segmentation of objects of interest in multi-view 2D images. Users can use text-based or click-based prompts to segment objects of interest without requiring labeled training datasets. Gaussian Splatting allows for the learning of the 3D representation of a scene's geometry and radiance based on 2D images. Combining Google Earth Studio, SAM2+GroundingDINO, 2D Gaussian Splatting, and our improvements in mask refinement based on morphological operations and contour simplification, we created a pipeline to extract the 3D mesh of any building based on its name, address, or geographic coordinates. 2024-12-31T19:53:27Z Remote Sensing Applications: Society and Environment 40 2025 101807 Kyle Gao Liangzhi Li Hongjie He Dening Lu Linlin Xu Jonathan Li 10.1016/j.rsase.2025.101807 http://arxiv.org/abs/2412.05278v2 Birth and Death of a Rose 2025-06-05T20:22:36Z We study the problem of generating temporal object intrinsics -- temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose -- from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pre-trained 2D diffusion models. To ensure the temporal consistency of object intrinsics, we propose Neural Templates for temporal-state-guided distillation, derived automatically from image features from self-supervised learning. Our method can generate high-quality temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, under any environmental lighting conditions, at any time of their lifespan. Project website: https://chen-geng.com/rose4d 2024-12-06T18:59:52Z CVPR 2025 Oral. Project website: https://chen-geng.com/rose4d Chen Geng Yunzhi Zhang Shangzhe Wu Jiajun Wu http://arxiv.org/abs/2503.19136v2 Stochastic Poisson Surface Reconstruction with One Solve using Geometric Gaussian Processes 2025-06-05T16:54:25Z Poisson Surface Reconstruction is a widely-used algorithm for reconstructing a surface from an oriented point cloud. To facilitate applications where only partial surface information is available, or scanning is performed sequentially, a recent line of work proposes to incorporate uncertainty into the reconstructed surface via Gaussian process models. The resulting algorithms first perform Gaussian process interpolation, then solve a set of volumetric partial differential equations globally in space, resulting in a computationally expensive two-stage procedure. In this work, we apply recently-developed techniques from geometric Gaussian processes to combine interpolation and surface reconstruction into a single stage, requiring only one linear solve per sample. The resulting reconstructed surface samples can be queried locally in space, without the use of problem-dependent volumetric meshes or grids. These capabilities enable one to (a) perform probabilistic collision detection locally around the region of interest, (b) perform ray casting without evaluating points not on the ray's trajectory, and (c) perform next-view planning on a per-ray basis. They also do not requiring one to approximate kernel matrix inverses with diagonal matrices as part of intermediate computations, unlike prior methods. Results show that our approach provides a cleaner, more-principled, and more-flexible stochastic surface reconstruction pipeline. 2025-03-24T20:47:51Z International Conference on Machine Learning, 2025 Sidhanth Holalkere David S. Bindel Silvia Sellán Alexander Terenin http://arxiv.org/abs/2506.05449v1 AI-powered Contextual 3D Environment Generation: A Systematic Review 2025-06-05T15:56:28Z The generation of high-quality 3D environments is crucial for industries such as gaming, virtual reality, and cinema, yet remains resource-intensive due to the reliance on manual processes. This study performs a systematic review of existing generative AI techniques for 3D scene generation, analyzing their characteristics, strengths, limitations, and potential for improvement. By examining state-of-the-art approaches, it presents key challenges such as scene authenticity and the influence of textual inputs. Special attention is given to how AI can blend different stylistic domains while maintaining coherence, the impact of training data on output quality, and the limitations of current models. In addition, this review surveys existing evaluation metrics for assessing realism and explores how industry professionals incorporate AI into their workflows. The findings of this study aim to provide a comprehensive understanding of the current landscape and serve as a foundation for future research on AI-driven 3D content generation. Key findings include that advanced generative architectures enable high-quality 3D content creation at a high computational cost, effective multi-modal integration techniques like cross-attention and latent space alignment facilitate text-to-3D tasks, and the quality and diversity of training data combined with comprehensive evaluation metrics are critical to achieving scalable, robust 3D scene generation. 2025-06-05T15:56:28Z Miguel Silva Alexandre Valle de Carvalho http://arxiv.org/abs/2408.11721v2 Detection-Driven Object Count Optimization for Text-to-Image Diffusion Models 2025-06-05T15:25:55Z Accurately controlling object count in text-to-image generation remains a key challenge. Supervised methods often fail, as training data rarely covers all count variations. Methods that manipulate the denoising process to add or remove objects can help; however, they still require labeled data, limit robustness and image quality, and rely on a slow, iterative process. Pre-trained differentiable counting models that rely on soft object density summation exist and could steer generation, but employing them presents three main challenges: (i) they are pre-trained on clean images, making them less effective during denoising steps that operate on noisy inputs; (ii) they are not robust to viewpoint changes; and (iii) optimization is computationally expensive, requiring repeated model evaluations per image. We propose a new framework that uses pre-trained object counting techniques and object detectors to guide generation. First, we optimize a counting token using an outer-loop loss computed on fully generated images. Second, we introduce a detection-driven scaling term that corrects errors caused by viewpoint and proportion shifts, among other factors, without requiring backpropagation through the detection model. Third, we show that the optimized parameters can be reused for new prompts, removing the need for repeated optimization. Our method provides efficiency through token reuse, flexibility via compatibility with various detectors, and accuracy with improved counting across diverse object categories. 2024-08-21T15:51:46Z Pre-print Oz Zafar Yuval Cohen Lior Wolf Idan Schwartz http://arxiv.org/abs/2502.17327v2 AnyTop: Character Animation Diffusion with Any Topology 2025-06-05T15:23:33Z Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model's latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation and motion editing. Our webpage, https://anytop2025.github.io/Anytop-page, includes links to videos and code. 2025-02-24T17:00:36Z SIGGRAPH 2025. Video: https://www.youtube.com/watch?v=NWOdkM5hAbE, Project page: https://anytop2025.github.io/Anytop-page, Code: https://github.com/Anytop2025/Anytop Inbar Gat Sigal Raab Guy Tevet Yuval Reshef Amit H. Bermano Daniel Cohen-Or http://arxiv.org/abs/2406.03965v2 More Bang For Your Buck(et): Fast and Space-efficient Hardware-accelerated Coarse-granular Indexing on GPUs 2025-06-05T14:04:28Z In recent work, we have shown that NVIDIA's raytracing cores on RTX video cards can be exploited to realize hardware-accelerated lookups for GPU-resident database indexes. On a high level, the concept materializes all keys as triangles in a 3D scene and indexes them. Lookups are performed by firing rays into the scene and utilizing the index structure to detect hits in a hardware-accelerated fashion. While this approach called RTIndeX (or short RX) is indeed promising, it currently suffers from three limitations: (1) significant memory overhead per key, (2) slow range-lookups, and (3) poor updateability. In this work, we show that all three problems can be tackled by a single design change: Generalizing RX to become a coarse-granular index cgRX. Instead of indexing individual keys, cgRX indexes buckets of keys which are post-filtered after retrieval. This drastically reduces the memory overhead, leads to the generation of a smaller and more efficient index structure, and enables fast range-lookups as well as updates. We will see that representing the buckets in the 3D space such that the lookup of a key is performed both correctly and efficiently requires the careful orchestration of firing rays in a specific sequence. Our experimental evaluation shows that cgRX offers the most bang for the buck(et) by providing a throughput in relation to the memory footprint that is 1.5-3x higher than for the comparable range-lookup supporting baselines. At the same time, cgRX improves the range-lookup performance over RX by up to 2x and offers practical updateability that is up to 5.6x faster than rebuilding from scratch. 2024-06-06T11:22:57Z Justus Henneberg Felix Schuhknecht Rosina Kharal Trevor Brown http://arxiv.org/abs/2506.04972v1 From Screen to Space: Evaluating Siemens' Cinematic Reality 2025-06-05T12:44:21Z As one of the first research teams with full access to Siemens' Cinematic Reality, we evaluate its usability and clinical potential for cinematic volume rendering on the Apple Vision Pro. We visualized venous-phase liver computed tomography and magnetic resonance cholangiopancreatography scans from the CHAOS and MRCP\_DLRecon datasets. Fourteen medical experts assessed usability and anticipated clinical integration potential using the System Usability Scale, ISONORM 9242-110-S questionnaire, and an open-ended survey. Their feedback identified feasibility, key usability strengths, and required features to catalyze the adaptation in real-world clinical workflows. The findings provide insights into the potential of immersive cinematic rendering in medical imaging. 2025-06-05T12:44:21Z 16 pages Gijs Luijten Lisle Faray de Paiva Sebastian Krueger Alexander Brost Laura Mazilescu Ana Sofia Ferreira Santos Peter Hoyer Jens Kleesiek Sophia Marie-Therese Schmitz Ulf Peter Neumann Jan Egger http://arxiv.org/abs/2411.16331v3 Sonic: Shifting Focus to Global Audio Perception in Portrait Animation 2025-06-05T11:49:59Z The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal inconsistencies.Considering the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to {s}hift f{o}cus on the exploration of global audio per{c}ept{i}o{n}.To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall perception.For the intra-clip audio perception, 1). \textbf{Context-enhanced audio learning}, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). \textbf{Motion-decoupled controller}, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips to achieve the global perception, \textbf{Time-aware position shift fusion}, in which the global inter-clip audio information is considered and fused for long-audio inference via through consecutively time-aware shifted windows. Extensive experiments demonstrate that the novel audio-driven paradigm outperform existing SOTA methodologies in terms of video quality, temporally consistency, lip synchronization precision, and motion diversity. 2024-11-25T12:24:52Z refer to our main-page \url{https://jixiaozhong.github.io/Sonic/} Xiaozhong Ji Xiaobin Hu Zhihong Xu Junwei Zhu Chuming Lin Qingdong He Jiangning Zhang Donghao Luo Yi Chen Qin Lin Qinglin Lu Chengjie Wang