https://arxiv.org/api/GP77/QB0OjjPU2pcM1Qt/i75LIc 2026-06-22T11:23:20Z 9354 960 15 http://arxiv.org/abs/2601.21786v1 Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring 2026-01-29T14:34:01Z

Three-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: https://dlr-mi.github.io/ship3d-demo/.

2026-01-29T14:34:01Z Applications of Machine Learning 2025, Proc. of SPIE Vol. 13606, 136061G 2025 Published by SPIE 0277-786X Borja Carrillo-Perez Felix Sattler Angel Bueno Rodriguez Maurice Stephan Sarah Barnes 10.1117/12.3063784 http://arxiv.org/abs/2601.21314v1 HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence 2026-01-29T06:22:26Z

High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a $6\times$ improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.

2026-01-29T06:22:26Z Yanfeng Li Tao Tan Qingquan Gao Zhiwen Cao Xiaohong liu Yue Sun http://arxiv.org/abs/2412.00112v4 BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis 2026-01-29T04:10:28Z

Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.

2024-11-28T05:42:47Z 18 pages, 11 figures. Accepted to WACV 2026 (Oral) Seong-Eun Hong Soobin Lim Juyeong Hwang Minwook Chang Hyeongyeop Kang http://arxiv.org/abs/2601.21141v1 Optimization and Mobile Deployment for Anthropocene Neural Style Transfer 2026-01-29T00:50:03Z

This paper presents AnthropoCam, a mobile-based neural style transfer (NST) system optimized for the visual synthesis of Anthropocene environments. Unlike conventional artistic NST, which prioritizes painterly abstraction, stylizing human-altered landscapes demands a careful balance between amplifying material textures and preserving semantic legibility. Industrial infrastructures, waste accumulations, and modified ecosystems contain dense, repetitive patterns that are visually expressive yet highly susceptible to semantic erosion under aggressive style transfer. To address this challenge, we systematically investigate the impact of NST parameter configurations on the visual translation of Anthropocene textures, including feature layer selection, style and content loss weighting, training stability, and output resolution. Through controlled experiments, we identify an optimal parameter manifold that maximizes stylistic expression while preventing semantic erasure. Our results demonstrate that appropriate combinations of convolutional depth, loss ratios, and resolution scaling enable the faithful transformation of anthropogenic material properties into a coherent visual language. Building on these findings, we implement a low-latency, feed-forward NST pipeline deployed on mobile devices. The system integrates a React Native frontend with a Flask-based GPU backend, achieving high-resolution inference within 3-5 seconds on general mobile hardware. This enables real-time, in-situ visual intervention at the site of image capture, supporting participatory engagement with Anthropocene landscapes. By coupling domain-specific NST optimization with mobile deployment, AnthropoCam reframes neural style transfer as a practical and expressive tool for real-time environmental visualization in the Anthropocene.

2026-01-29T00:50:03Z 7 pages, 11 figures, submitted to SIGGRAPH 2026 Po-Hsun Chen Ivan C. H. Liu http://arxiv.org/abs/2601.20722v1 Rendering Portals in Virtual Reality 2026-01-28T15:56:25Z

Portals have many applications in the field of computer graphics. Recently, they have found use as a way of artificially increasing the available space in a virtual reality (VR) environment. In this paper, we will cover a technique for making the transition through a portal unnoticeable to the user. Additionally, we will measure the performance impact of rendering portals in a test scene and provide some insight into possible optimisations.

2026-01-28T15:56:25Z Milan van Zanten http://arxiv.org/abs/2601.20429v1 GRTX: Efficient Ray Tracing for 3D Gaussian-Based Rendering 2026-01-28T09:37:12Z

3D Gaussian Splatting has gained widespread adoption across diverse applications due to its exceptional rendering performance and visual quality. While most existing methods rely on rasterization to render Gaussians, recent research has started investigating ray tracing approaches to overcome the fundamental limitations inherent in rasterization. However, current Gaussian ray tracing methods suffer from inefficiencies such as bloated acceleration structures and redundant node traversals, which greatly degrade ray tracing performance. In this work, we present GRTX, a set of software and hardware optimizations that enable efficient ray tracing for 3D Gaussian-based rendering. First, we introduce a novel approach for constructing streamlined acceleration structures for Gaussian primitives. Our key insight is that anisotropic Gaussians can be treated as unit spheres through ray space transformations, which substantially reduces BVH size and traversal overhead. Second, we propose dedicated hardware support for traversal checkpointing within ray tracing units. This eliminates redundant node visits during multi-round tracing by resuming traversal from checkpointed nodes rather than restarting from the root node in each subsequent round. Our evaluation shows that GRTX significantly improves ray tracing performance compared to the baseline ray tracing method with a negligible hardware cost.

2026-01-28T09:37:12Z To appear at the 32nd International Symposium on High-Performance Computer Architecture (HPCA 2026) Junseo Lee Sangyun Jeon Jungi Lee Junyong Park Jaewoong Sim http://arxiv.org/abs/2506.01091v2 PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation 2026-01-28T00:03:21Z

Visual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., "make the vase glow orange, then explode") and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further. Code available at https://obsphera.github.io/promptvfx/.

2025-06-01T17:22:59Z Mert Kiray Paul Uhlenbruck Nassir Navab Benjamin Busam http://arxiv.org/abs/2601.19519v1 Mocap Anywhere: Towards Pairwise-Distance based Motion Capture in the Wild (for the Wild) 2026-01-27T11:58:34Z

We introduce a novel motion capture system that reconstructs full-body 3D motion using only sparse pairwise distance (PWD) measurements from body-mounted(UWB) sensors. Using time-of-flight ranging between wireless nodes, our method eliminates the need for external cameras, enabling robust operation in uncontrolled and outdoor environments. Unlike traditional optical or inertial systems, our approach is shape-invariant and resilient to environmental constraints such as lighting and magnetic interference. At the core of our system is Wild-Poser (WiP for short), a compact, real-time Transformer-based architecture that directly predicts 3D joint positions from noisy or corrupted PWD measurements, which can later be used for joint rotation reconstruction via learned methods. WiP generalizes across subjects of varying morphologies, including non-human species, without requiring individual body measurements or shape fitting. Operating in real time, WiP achieves low joint position error and demonstrates accurate 3D motion reconstruction for both human and animal subjects in-the-wild. Our empirical analysis highlights its potential for scalable, low-cost, and general purpose motion capture in real-world settings.

2026-01-27T11:58:34Z 14 pages, 15 figures Ofir Abramovich Ariel Shamir Andreas Aristidou http://arxiv.org/abs/2601.19425v1 It's Not Just a Phase: Creating Phase-Aligned Peripheral Metamers 2026-01-27T10:03:45Z

Novel display technologies can deliver high-quality images across a wide field of view, creating immersive experiences. While rendering for such devices is expensive, most of the content falls into peripheral vision, where human perception differs from that in the fovea. Consequently, it is critical to understand and leverage the limitations of visual perception to enable efficient rendering. A standard approach is to exploit the reduced sensitivity to spatial details in the periphery by reducing rendering resolution, so-called foveated rendering. While this strategy avoids rendering part of the content altogether, an alternative promising direction is to replace accurate and expensive rendering with inexpensive synthesis of content that is perceptually indistinguishable from the ground-truth image. In this paper, we propose such a method for the efficient generation of an image signal that substitutes the rendering of high-frequency details. The method is grounded in findings from image statistics, which show that preserving appropriate local statistics is critical for perceived image quality. Based on this insight, we extrapolate several local image statistics from foveated content into higher spatial frequency ranges that are attenuated or omitted in the rendering process. This rich set of statistics is later used to synthesize a signal that is added to the initial rendering, boosting its perceived quality. We focus on phase information, demonstrating the importance of its alignment across space and frequencies. We calibrate and compare our method with state-of-the-art strategies, showing a significant reduction in the content that must be accurately rendered at a relatively small extra cost for synthesizing the additional signal.

2026-01-27T10:03:45Z 10 pages including references and figure only pages; 2 pages Supplementary Material Sophie Kergaßner Piotr Didyk http://arxiv.org/abs/2504.13386v4 Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis 2026-01-27T08:03:40Z

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations. The code and models will be available at https://thunder.is.tue.mpg.de/

2025-04-18T00:24:52Z Radek Daněček Carolin Schmitt Senya Polikovsky Michael J. Black http://arxiv.org/abs/2601.19310v1 ClipGS-VR: Immersive and Interactive Cinematic Visualization of Volumetric Medical Data in Mobile Virtual Reality 2026-01-27T07:48:59Z

High-fidelity cinematic medical visualization on mobile virtual reality (VR) remains challenging. Although ClipGS enables cross-sectional exploration via 3D Gaussian Splatting, it lacks arbitrary-angle slicing on consumer-grade VR headsets. To achieve real-time interactive performance, we introduce ClipGS-VR and restructure ClipGS's neural inference into a consolidated dataset, integrating high-fidelity layers from multiple pre-computed slicing states into a unified rendering structure. Our framework further supports arbitrary-angle slicing via gradient-based opacity modulation for smooth, visually coherent rendering. Evaluations confirm our approach maintains visual fidelity comparable to offline results while offering superior usability and interaction efficiency.

2026-01-27T07:48:59Z IEEE VR 2026 Posters Yuqi Tong Ruiyang Li Chengkun Li Qixuan Liu Shi Qiu Pheng-Ann Heng http://arxiv.org/abs/2601.19303v1 A Collaborative Extended Reality Prototype for 3D Surgical Planning and Visualization 2026-01-27T07:42:51Z

We present a collaborative extended reality (XR) prototype for 3D surgical planning and visualization. Our system consists of three key modules: XR-based immersive surgical planning, cloud-based data management, and coordinated stereoscopic 3D displays for interactive visualization. We describe the overall workflow, core functionalities, implementations and setups. By conducting user studies on a liver resection surgical planning case, we demonstrate the effectiveness of our prototype and provide practical insights to inspire future advances in medical XR collaboration.

2026-01-27T07:42:51Z IEEE VR 2026 Posters Shi Qiu Ruiyang Li Qixuan Liu Yuqi Tong Yue Qiu Yinqiao Wang Yan Li Chi-Wing Fu Pheng-Ann Heng http://arxiv.org/abs/2601.19294v1 Words have Weight: Comparing the use of pressure and weight as a metaphor in a User Interface in Virtual Reality 2026-01-27T07:36:37Z

This work investigates how weight and pressure can function as haptic metaphors to support user interface notifications in Virtual Reality (VR). While prior research has explored ungrounded weight simulation and pneumatic feedback, their combined role in conveying information through UI elements remains underexplored. We developed a wearable haptic device that transfers liquid and air into flexible containers mounted on the back of the user's hand, allowing us to independently manipulate weight and pressure. Through an initial evaluation using three conditions-no feedback, weight only, and weight combined with pressure-we examined how these signals affect perceived heaviness, coherence with visual cues, and the perceived urgency of notifications. Our results validate that pressure amplifies the perception of weight, but this increased heaviness does not translate into higher perceived urgency. These findings suggest that while pressure___enhanced weight can enrich haptic rendering of UI elements in VR, its contribution to communicating urgency may require further investigation, alternative pressure profiles, or different types of notifications.

2026-01-27T07:36:37Z IEEE World Haptics Conference 2025, Jul 2025, Suwon, South Korea Joffrey Guilmet ESIEA, UM Suzanne Sorli ESIEA Diego Vilela Monteiro ESIEA http://arxiv.org/abs/2601.19233v1 UniMGS: Unifying Mesh and 3D Gaussian Splatting with Single-Pass Rasterization and Proxy-Based Deformation 2026-01-27T06:05:14Z

Joint rendering and deformation of mesh and 3D Gaussian Splatting (3DGS) have significant value as both representa tions offer complementary advantages for graphics applica tions. However, due to differences in representation and ren dering pipelines, existing studies render meshes and 3DGS separately, making it difficult to accurately handle occlusions and transparency. Moreover, the deformed 3DGS still suffers from visual artifacts due to the sensitivity to the topology quality of the proxy mesh. These issues pose serious obsta cles to the joint use of 3DGS and meshes, making it diffi cult to adapt 3DGS to conventional mesh-oriented graphics pipelines. We propose UniMGS, the first unified framework for rasterizing mesh and 3DGS in a single-pass anti-aliased manner, with a novel binding strategy for 3DGS deformation based on proxy mesh. Our key insight is to blend the col ors of both triangle and Gaussian fragments by anti-aliased α-blending in a single pass, achieving visually coherent re sults with precise handling of occlusion and transparency. To improve the visual appearance of the deformed 3DGS, our Gaussian-centric binding strategy employs a proxy mesh and spatially associates Gaussians with the mesh faces, signifi cantly reducing rendering artifacts. With these two compo nents, UniMGS enables the visualization and manipulation of 3D objects represented by mesh or 3DGS within a unified framework, opening up new possibilities in embodied AI, vir tual reality, and gaming. We will release our source code to facilitate future research.

2026-01-27T06:05:14Z conference Zeyu Xiao Mingyang Sun Yimin Cong Lintao Wang Dongliang Kou Zhenyi Wu Dingkang Yang Peng Zhai Zeyu Wang Lihua Zhang http://arxiv.org/abs/2601.19036v1 The Last Mile to Production Readiness: Physics-Based Motion Refinement for Video-Based Capture 2026-01-26T23:39:10Z

High-quality motion data underpins games, film, XR, and robotics. Vision-based motion capture tools have made significant progress, offering accessible and visually convincing results, yet often fall short in the final stretch -- the last mile -- when it comes to physical realism and production readiness, due to various artifacts introduced during capture. In this paper, we summarize key issues through case studies and feedback from professional animators to set a stepping stone for future research in motion cleanup. We then present a physics-based motion refinement framework to bridge the gap, with the goal of reducing labor-intensive manual cleanup and enhancing visual quality and physical realism. Our framework supports both single- and multi-character sequences and can be integrated into animator workflows for further refinement, such as stylizing motions via keyframe editing.

2026-01-26T23:39:10Z Tianxin Tao Han Liu Hung Yu Ling