https://arxiv.org/api/mgaSCqRgQAcRaBBfF3b/9Zlykh02026-06-17T22:01:01Z934679515http://arxiv.org/abs/2601.14207v2Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints2026-02-28T06:46:28ZWe study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.2026-01-20T18:12:55ZGitHub Page: https://rotemgat.github.io/CopyTransformPaste/Rotem GatenyoOhad Friedhttp://arxiv.org/abs/2603.00413v1DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects2026-02-28T02:21:31ZReconstructing transparent objects from a set of multi-view images is a challenging task due to the complicated nature and indeterminate behavior of light propagation. Typical methods are primarily tailored to specific scenarios, such as objects following a uniform topology, exhibiting ideal transparency and surface specular reflections, or with only surface materials, which substantially constrains their practical applicability in real-world settings. In this work, we propose a differentiable rendering framework for transparent objects, dubbed DiffTrans, which allows for efficient decomposition and reconstruction of the geometry and materials of transparent objects, thereby reconstructing transparent objects accurately in intricate scenes with diverse topology and complex texture. Specifically, we first utilize FlexiCubes with dilation and smoothness regularization as the iso-surface representation to reconstruct an initial geometry efficiently from the multi-view object silhouette. Meanwhile, we employ the environment light radiance field to recover the environment of the scene. Then we devise a recursive differentiable ray tracer to further optimize the geometry, index of refraction and absorption rate simultaneously in a unified and end-to-end manner, leading to high-quality reconstruction of transparent objects in intricate scenes. A prominent advantage of the designed ray tracer is that it can be implemented in CUDA, enabling a significantly reduced computational cost. Extensive experiments on multiple benchmarks demonstrate the superior reconstruction performance of our DiffTrans compared with other methods, especially in intricate scenes involving transparent objects with diverse topology and complex texture. The code is available at https://github.com/lcp29/DiffTrans.2026-02-28T02:21:31ZChangpu LiShuang WuSonglin TangGuangming LuJun YuWenjie Peihttp://arxiv.org/abs/2505.12734v2SounDiT: Geo-Contextual Soundscape-to-Landscape Generation2026-02-27T23:27:56ZRecent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception. Project page: https://gisense.github.io/SounDiT-Page/2025-05-19T05:47:13Z12 pages, 4 figuresJunbo WangHaofeng TanBowen LiaoAlbert JiangTeng FeiQixing HuangBing ZhouZhengzhong TuShan YeYuhao Kanghttp://arxiv.org/abs/2603.00292v1Ray Tracing using HIP2026-02-27T20:21:08ZIn this technical report, we introduce the basics of ray tracing and explain how to accelerate the computation of the rendering algorithm in HIP. We also show how to use a HIP ray tracing framework - HIPRT, leveraging hardware ray tracing features of AMD GPUs. We conclude this technical report with a list of references for further reading.2026-02-27T20:21:08ZAtsushi YoshimuraKenta EtoDaniel MeisterTakahiro Haradahttp://arxiv.org/abs/2509.25094v3Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives2026-02-27T18:42:50ZRecent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. Our implementation code is publicly available at: https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.2025-09-29T17:28:58ZAmirHossein ZamaniBruno RoyArianna Rampinihttp://arxiv.org/abs/2602.24224v1Random-Forest-Induced Graph Neural Networks for Tabular Learning2026-02-27T17:51:18ZGraphs are essential for modeling complex relationships and capturing structured interactions in data. Graph Neural Networks (GNNs) are particularly effective when such relational structure is explicitly available, but many real-world datasets, most notably tabular data, lack an inherent graph representation. To address this limitation, we propose RF-GNN, a framework that constructs instance-level graphs from tabular data using proximity measures induced by random forests. These proximities capture nonlinear feature interactions and data-adaptive similarity without imposing restrictive assumptions on feature geometry. The resulting graphs enable the direct application of GNNs to tabular learning problems. Extensive experiments on 36 benchmark datasets demonstrate that RF-GNN consistently outperforms strong classical baselines and recent graph-construction methods in terms of weighted F1-score. Additional ablation studies highlight the impact of proximity design choices and graph construction settings.2026-02-27T17:51:18ZHaozhe ChenSoheila FarokhiKelvyn BladenHamid KarimiKevin R. Moonhttp://arxiv.org/abs/2508.06316v2The Beauty of Anisotropic Mesh Refinement: Omnitrees for Efficient Dyadic Discretizations2026-02-27T06:27:22ZStructured adaptive mesh refinement (AMR), commonly implemented via quadtrees and octrees, underpins a wide range of applications including databases, computer graphics, physics simulations, and machine learning. However, octrees enforce isotropic refinement in regions of interest, which can be especially inefficient for problems that are intrinsically anisotropic--much resolution is spent where little information is gained. This paper presents omnitrees as an anisotropic generalization of octrees and related data structures. Omnitrees allow to refine only the locally most important dimensions, providing tree structures that are less deep than bintrees and less wide than octrees. As a result, the convergence of the AMR schemes can be increased by up to a factor of the dimensionality d for very anisotropic problems, quickly offsetting their modest increase in storage overhead. We validate this finding on the problem of binary shape representation across 4,166 three-dimensional objects: Omnitrees increase the mean convergence rate by 1.5x, require less storage to achieve equivalent error bounds, and maximize the information density of the stored function faster than octrees. These advantages are projected to be even stronger for higher-dimensional problems. We provide a first validation by introducing a time-dependent rotation to create four-dimensional representations, and discuss the properties of their 4-d octree and omnitree approximations. Overall, omnitree discretizations can make existing AMR approaches more efficient, and open up new possibilities for high-dimensional applications.2025-08-08T13:42:59Zcontains pdf animations; we recommend Okular or Firefox for viewingTheresa PollingerMasado IshiiJens Domkehttp://arxiv.org/abs/2602.23660v1Assessment of Display Performance and Comparative Evaluation of Web Map Libraries for Extensive 3D Geospatial Data2026-02-27T04:01:18ZLarge-scale 3D geospatial data visualization has become increasingly critical for the development of the digital society infrastructure in Japan. This study conducted a comprehensive performance evaluation of two major WebGL-based web mapping libraries, CesiumJS and MapLibre GL JS, using large-scale 3D point-cloud data from the VIRTUAL SHIZUOKA and PLATEAU building models. The research employs standardized 3D Tiles 1.1, and Mapbox Vector Tiles (MVT) formats, comparing performance across different data scales (2nd and 3rd grid levels) using Core Web Vitals metrics, including First Contentful Paint (FCP), Largest Contentful Paint (LCP), Speed Index, Total Blocking Time (TBT), and Cumulative Layout Shift (CLS).
The results demonstrate that MVT-based building visualization with MapLibre GL JS achieves optimal performance (FCP 0.8s, TBT 0ms), whereas MapLibre GL JS combined with deck.gl shows superior performance for large-scale point cloud processing (TBT: 3ms, CesiumJS: 21,357ms). This study provides data-driven selection guidelines for appropriate technology choices according to use cases, establishing reproducible performance evaluation frameworks for 3D web mapping technologies during the WebGPU and OGC 3D Tiles 1.1 standardization era.2026-02-27T04:01:18Z6 pages, 5 figures, 1 tableToshikazu SetoYohei ShiwakuTakayuki MiyauchiDaisuke YoshidaYuichiro Nishimurahttp://arxiv.org/abs/2510.12768v3Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction2026-02-27T02:48:13ZReconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our approach estimates time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.2025-10-14T17:47:11ZAccepted to ICLR 2026. Project page: https://tamu-visual-ai.github.io/usplat4d/Fengzhi GuoChih-Chuan HsuSihao DingCheng Zhanghttp://arxiv.org/abs/2510.03312v3Universal Beta Splatting2026-02-26T22:20:36ZWe introduce Universal Beta Splatting (UBS), a unified framework that generalizes 3D Gaussian Splatting to N-dimensional anisotropic Beta kernels for explicit radiance field rendering. Unlike fixed Gaussian primitives, Beta kernels enable controllable dependency modeling across spatial, angular, and temporal dimensions within a single representation. Our unified approach captures complex light transport effects, handles anisotropic view-dependent appearance, and models scene dynamics without requiring auxiliary networks or specific color encodings. UBS maintains backward compatibility by approximating to Gaussian Splatting as a special case, guaranteeing plug-in usability and lower performance bounds. The learned Beta parameters naturally decompose scene properties into interpretable without explicit supervision: spatial (surface vs. texture), angular (diffuse vs. specular), and temporal (static vs. dynamic). Our CUDA-accelerated implementation achieves real-time rendering while consistently outperforming existing methods across static, view-dependent, and dynamic benchmarks, establishing Beta kernels as a scalable universal primitive for radiance field rendering. Our project website is available at https://rongliu-leo.github.io/universal-beta-splatting/.2025-09-30T22:03:22ZICLR 2026Rong LiuZhongpai GaoBenjamin PlancheMeida ChenVan Nguyen NguyenMeng ZhengAnwesa ChoudhuriTerrence ChenYue WangAndrew FengZiyan Wuhttp://arxiv.org/abs/2409.02108v3Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era2026-02-26T12:11:30ZShadows, formed by the occlusion of light, play an essential role in visual perception and directly influence scene understanding, image quality, and visual realism. This paper presents a unified survey and benchmark of deep-learning-based shadow detection, removal, and generation across images and videos. We introduce consistent taxonomies for architectures, supervision strategies, and learning paradigms; review major datasets and evaluation protocols; and re-train representative methods under standardized settings to enable fair comparison. Our benchmark reveals key findings, including inconsistencies in prior reports, strong dependence on model design and resolution, and limited cross-dataset generalization due to dataset bias. By synthesizing insights across the three tasks, we highlight shared illumination cues and priors that connect detection, removal, and generation. We further outline future directions involving unified all-in-one frameworks, semantics- and geometry-aware reasoning, shadow-based AIGC authenticity analysis, and the integration of physics-guided priors into multimodal foundation models. Corrected datasets, trained models, and evaluation tools are released to support reproducible research.2024-09-03T17:59:05ZAccepted by International Journal of Computer Vision (IJCV). Publicly available results, trained models, and evaluation metrics at https://github.com/xw-hu/Unveiling-Deep-ShadowsInternational Journal of Computer Vision (IJCV), vol.134, article 158, 2026Xiaowei HuZhenghao XingTianyu WangChi-Wing FuPheng-Ann Heng10.1007/s11263-026-02744-zhttp://arxiv.org/abs/2406.09293v4StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning2026-02-26T08:31:53ZWe introduce StableMaterials, a novel approach for generating photorealistic physical-based rendering (PBR) materials that integrate semi-supervised learning with Latent Diffusion Models (LDMs). Our method employs adversarial training to distill knowledge from existing large-scale image generation models, minimizing the reliance on annotated data and enhancing the diversity in generation. This distillation approach aligns the distribution of the generated materials with that of image textures from an SDXL model, enabling the generation of novel materials that are not present in the initial training dataset. Furthermore, we employ a diffusion-based refiner model to improve the visual quality of the samples and achieve high-resolution generation. Finally, we distill a latent consistency model for fast generation in just four steps and propose a new tileability technique that removes visual artifacts typically associated with fewer diffusion steps. We detail the architecture and training process of StableMaterials, the integration of semi-supervised training within existing LDM frameworks and show the advantages of our approach. Comparative evaluations with state-of-the-art methods show the effectiveness of StableMaterials, highlighting its potential applications in computer graphics and beyond. StableMaterials is publicly available at https://gvecchio.com/stablematerials.2024-06-13T16:29:46ZGiuseppe Vecchiohttp://arxiv.org/abs/2508.05115v2RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer2026-02-26T08:08:56ZAudio-driven portrait animation aims to synthesize realistic and natural talking head videos from an input audio signal and a single reference image. While existing methods achieve high-quality results by leveraging high-dimensional intermediate representations and explicitly modeling motion dynamics, their computational complexity renders them unsuitable for real-time deployment. Real-time inference imposes stringent latency and memory constraints, often necessitating the use of highly compressed latent representations. However, operating in such compact spaces hinders the preservation of fine-grained spatiotemporal details, thereby complicating audio-visual synchronization RAP (Real-time Audio-driven Portrait animation), a unified framework for generating high-quality talking portraits under real-time constraints. Specifically, RAP introduces a hybrid attention mechanism for fine-grained audio control, and a static-dynamic training-inference paradigm that avoids explicit motion supervision. Through these techniques, RAP achieves precise audio-driven control, mitigates long-term temporal drift, and maintains high visual fidelity. Extensive experiments demonstrate that RAP achieves state-of-the-art performance while operating under real-time constraints.2025-08-07T07:47:16Z11 pages, 9 figuresFangyu DuTaiqing LiQian QiaoTan YuZiwei ZhangDingcheng ZhenXu JiaYang YangShunshun YinSiyuan Liuhttp://arxiv.org/abs/2602.22701v1BRepMAE: Self-Supervised Masked BRep Autoencoders for Machining Feature Recognition2026-02-26T07:22:32ZWe propose a masked self-supervised learning framework, called BRepMAE, for automatically extracting a valuable representation of the input computer-aided design (CAD) model to recognize its machining features. Representation learning is conducted on a large-scale, unlabeled CAD model dataset using the geometric Attributed Adjacency Graph (gAAG) representation, derived from the boundary representation (BRep). The self-supervised network is a masked graph autoencoder (MAE) that focuses on reconstructing geometries and attributes of BRep facets, rather than graph structures. After pre-training, we fine-tune a network that contains both the encoder and a task-specific classification network for machining feature recognition (MFR). In the experiments, our fine-tuned network achieves high recognition rates with only a small amount of data (e.g., 0.1% of the training data), significantly enhancing its practicality in real-world (or private) scenarios where only limited data is available. Compared with other MFR methods, our fine-tuned network achieves a significant improvement in recognition rate with the same amount of training data, especially when the number of training samples is limited.2026-02-26T07:22:32Z16 pagesCan YaoKang WuZuheng ZhengSiyuan XingXiao-Ming Fuhttp://arxiv.org/abs/2411.15468v2SplatSDF: Boosting SDF-NeRF via Architecture-Level Fusion with Gaussian Splats2026-02-26T05:29:03ZSigned distance-radiance field (SDF-NeRF) is a promising environment representation that offers both photo-realistic rendering and geometric reasoning such as proximity queries for collision avoidance. However, the slow training speed and convergence of SDF-NeRF hinder their use in practical robotic systems. We propose SplatSDF, a novel SDF-NeRF architecture that accelerates convergence using 3D Gaussian splats (3DGS), which can be quickly pre-trained. Unlike prior approaches that introduce a consistency loss between separate 3DGS and SDF-NeRF models, SplatSDF directly fuses 3DGS at an architectural level by consuming it as an input to SDF-NeRF during training. This is achieved using a novel sparse 3DGS fusion strategy that injects neural embeddings of 3DGS into SDF-NeRF around the object surface, while also permitting inference without 3DGS for minimal operation. Experimental results show SplatSDF achieves 3X faster convergence to the same geometric accuracy than the best baseline, and outperforms state-of-the-art SDF-NeRF methods in terms of chamfer distance and peak signal to noise ratio, unlike consistency loss-based approaches that in fact provide limited gains. We also present computational techniques for accelerating gradient and Hessian steps by 3X. We expect these improvements will contribute to deploying SDF-NeRF on practical systems.2024-11-23T06:35:19ZRunfa Blark LiKeito SuzukiBang DuKi Myung Brian LeeNikolay AtanasovTruong Nguyen