https://arxiv.org/api/GP77/QB0OjjPU2pcM1Qt/i75LIc2026-06-22T11:23:20Z935496015http://arxiv.org/abs/2601.21786v1Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring2026-01-29T14:34:01ZThree-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: https://dlr-mi.github.io/ship3d-demo/.2026-01-29T14:34:01ZApplications of Machine Learning 2025, Proc. of SPIE Vol. 13606, 136061G 2025 Published by SPIE 0277-786XBorja Carrillo-PerezFelix SattlerAngel Bueno RodriguezMaurice StephanSarah Barnes10.1117/12.3063784http://arxiv.org/abs/2601.21314v1HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence2026-01-29T06:22:26ZHigh-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a $6\times$ improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.2026-01-29T06:22:26ZYanfeng LiTao TanQingquan GaoZhiwen CaoXiaohong liuYue Sunhttp://arxiv.org/abs/2412.00112v4BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis2026-01-29T04:10:28ZGenerating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.2024-11-28T05:42:47Z18 pages, 11 figures. Accepted to WACV 2026 (Oral)Seong-Eun HongSoobin LimJuyeong HwangMinwook ChangHyeongyeop Kanghttp://arxiv.org/abs/2601.21141v1Optimization and Mobile Deployment for Anthropocene Neural Style Transfer2026-01-29T00:50:03ZThis paper presents AnthropoCam, a mobile-based neural style transfer (NST) system optimized for the visual synthesis of Anthropocene environments. Unlike conventional artistic NST, which prioritizes painterly abstraction, stylizing human-altered landscapes demands a careful balance between amplifying material textures and preserving semantic legibility. Industrial infrastructures, waste accumulations, and modified ecosystems contain dense, repetitive patterns that are visually expressive yet highly susceptible to semantic erosion under aggressive style transfer.
To address this challenge, we systematically investigate the impact of NST parameter configurations on the visual translation of Anthropocene textures, including feature layer selection, style and content loss weighting, training stability, and output resolution. Through controlled experiments, we identify an optimal parameter manifold that maximizes stylistic expression while preventing semantic erasure. Our results demonstrate that appropriate combinations of convolutional depth, loss ratios, and resolution scaling enable the faithful transformation of anthropogenic material properties into a coherent visual language.
Building on these findings, we implement a low-latency, feed-forward NST pipeline deployed on mobile devices. The system integrates a React Native frontend with a Flask-based GPU backend, achieving high-resolution inference within 3-5 seconds on general mobile hardware. This enables real-time, in-situ visual intervention at the site of image capture, supporting participatory engagement with Anthropocene landscapes.
By coupling domain-specific NST optimization with mobile deployment, AnthropoCam reframes neural style transfer as a practical and expressive tool for real-time environmental visualization in the Anthropocene.2026-01-29T00:50:03Z7 pages, 11 figures, submitted to SIGGRAPH 2026Po-Hsun ChenIvan C. H. Liuhttp://arxiv.org/abs/2601.20722v1Rendering Portals in Virtual Reality2026-01-28T15:56:25ZPortals have many applications in the field of computer graphics. Recently, they have found use as a way of artificially increasing the available space in a virtual reality (VR) environment. In this paper, we will cover a technique for making the transition through a portal unnoticeable to the user. Additionally, we will measure the performance impact of rendering portals in a test scene and provide some insight into possible optimisations.2026-01-28T15:56:25ZMilan van Zantenhttp://arxiv.org/abs/2601.20429v1GRTX: Efficient Ray Tracing for 3D Gaussian-Based Rendering2026-01-28T09:37:12Z3D Gaussian Splatting has gained widespread adoption across diverse applications due to its exceptional rendering performance and visual quality. While most existing methods rely on rasterization to render Gaussians, recent research has started investigating ray tracing approaches to overcome the fundamental limitations inherent in rasterization. However, current Gaussian ray tracing methods suffer from inefficiencies such as bloated acceleration structures and redundant node traversals, which greatly degrade ray tracing performance.
In this work, we present GRTX, a set of software and hardware optimizations that enable efficient ray tracing for 3D Gaussian-based rendering. First, we introduce a novel approach for constructing streamlined acceleration structures for Gaussian primitives. Our key insight is that anisotropic Gaussians can be treated as unit spheres through ray space transformations, which substantially reduces BVH size and traversal overhead. Second, we propose dedicated hardware support for traversal checkpointing within ray tracing units. This eliminates redundant node visits during multi-round tracing by resuming traversal from checkpointed nodes rather than restarting from the root node in each subsequent round. Our evaluation shows that GRTX significantly improves ray tracing performance compared to the baseline ray tracing method with a negligible hardware cost.2026-01-28T09:37:12ZTo appear at the 32nd International Symposium on High-Performance Computer Architecture (HPCA 2026)Junseo LeeSangyun JeonJungi LeeJunyong ParkJaewoong Simhttp://arxiv.org/abs/2506.01091v2PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation2026-01-28T00:03:21ZVisual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., "make the vase glow orange, then explode") and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further. Code available at https://obsphera.github.io/promptvfx/.2025-06-01T17:22:59ZMert KirayPaul UhlenbruckNassir NavabBenjamin Busamhttp://arxiv.org/abs/2601.19519v1Mocap Anywhere: Towards Pairwise-Distance based Motion Capture in the Wild (for the Wild)2026-01-27T11:58:34ZWe introduce a novel motion capture system that reconstructs full-body 3D motion using only sparse pairwise distance (PWD) measurements from body-mounted(UWB) sensors. Using time-of-flight ranging between wireless nodes, our method eliminates the need for external cameras, enabling robust operation in uncontrolled and outdoor environments. Unlike traditional optical or inertial systems, our approach is shape-invariant and resilient to environmental constraints such as lighting and magnetic interference. At the core of our system is Wild-Poser (WiP for short), a compact, real-time Transformer-based architecture that directly predicts 3D joint positions from noisy or corrupted PWD measurements, which can later be used for joint rotation reconstruction via learned methods. WiP generalizes across subjects of varying morphologies, including non-human species, without requiring individual body measurements or shape fitting. Operating in real time, WiP achieves low joint position error and demonstrates accurate 3D motion reconstruction for both human and animal subjects in-the-wild. Our empirical analysis highlights its potential for scalable, low-cost, and general purpose motion capture in real-world settings.2026-01-27T11:58:34Z14 pages, 15 figuresOfir AbramovichAriel ShamirAndreas Aristidouhttp://arxiv.org/abs/2601.19425v1It's Not Just a Phase: Creating Phase-Aligned Peripheral Metamers2026-01-27T10:03:45ZNovel display technologies can deliver high-quality images across a wide field of view, creating immersive experiences. While rendering for such devices is expensive, most of the content falls into peripheral vision, where human perception differs from that in the fovea. Consequently, it is critical to understand and leverage the limitations of visual perception to enable efficient rendering. A standard approach is to exploit the reduced sensitivity to spatial details in the periphery by reducing rendering resolution, so-called foveated rendering. While this strategy avoids rendering part of the content altogether, an alternative promising direction is to replace accurate and expensive rendering with inexpensive synthesis of content that is perceptually indistinguishable from the ground-truth image. In this paper, we propose such a method for the efficient generation of an image signal that substitutes the rendering of high-frequency details. The method is grounded in findings from image statistics, which show that preserving appropriate local statistics is critical for perceived image quality. Based on this insight, we extrapolate several local image statistics from foveated content into higher spatial frequency ranges that are attenuated or omitted in the rendering process. This rich set of statistics is later used to synthesize a signal that is added to the initial rendering, boosting its perceived quality. We focus on phase information, demonstrating the importance of its alignment across space and frequencies. We calibrate and compare our method with state-of-the-art strategies, showing a significant reduction in the content that must be accurately rendered at a relatively small extra cost for synthesizing the additional signal.2026-01-27T10:03:45Z10 pages including references and figure only pages; 2 pages Supplementary MaterialSophie KergaßnerPiotr Didykhttp://arxiv.org/abs/2504.13386v4Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis2026-01-27T08:03:40ZIn order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations. The code and models will be available at https://thunder.is.tue.mpg.de/2025-04-18T00:24:52ZRadek DaněčekCarolin SchmittSenya PolikovskyMichael J. Blackhttp://arxiv.org/abs/2601.19310v1ClipGS-VR: Immersive and Interactive Cinematic Visualization of Volumetric Medical Data in Mobile Virtual Reality2026-01-27T07:48:59ZHigh-fidelity cinematic medical visualization on mobile virtual reality (VR) remains challenging. Although ClipGS enables cross-sectional exploration via 3D Gaussian Splatting, it lacks arbitrary-angle slicing on consumer-grade VR headsets. To achieve real-time interactive performance, we introduce ClipGS-VR and restructure ClipGS's neural inference into a consolidated dataset, integrating high-fidelity layers from multiple pre-computed slicing states into a unified rendering structure. Our framework further supports arbitrary-angle slicing via gradient-based opacity modulation for smooth, visually coherent rendering. Evaluations confirm our approach maintains visual fidelity comparable to offline results while offering superior usability and interaction efficiency.2026-01-27T07:48:59ZIEEE VR 2026 PostersYuqi TongRuiyang LiChengkun LiQixuan LiuShi QiuPheng-Ann Henghttp://arxiv.org/abs/2601.19303v1A Collaborative Extended Reality Prototype for 3D Surgical Planning and Visualization2026-01-27T07:42:51ZWe present a collaborative extended reality (XR) prototype for 3D surgical planning and visualization. Our system consists of three key modules: XR-based immersive surgical planning, cloud-based data management, and coordinated stereoscopic 3D displays for interactive visualization. We describe the overall workflow, core functionalities, implementations and setups. By conducting user studies on a liver resection surgical planning case, we demonstrate the effectiveness of our prototype and provide practical insights to inspire future advances in medical XR collaboration.2026-01-27T07:42:51ZIEEE VR 2026 PostersShi QiuRuiyang LiQixuan LiuYuqi TongYue QiuYinqiao WangYan LiChi-Wing FuPheng-Ann Henghttp://arxiv.org/abs/2601.19294v1Words have Weight: Comparing the use of pressure and weight as a metaphor in a User Interface in Virtual Reality2026-01-27T07:36:37ZThis work investigates how weight and pressure can function as haptic metaphors to support user interface notifications in Virtual Reality (VR). While prior research has explored ungrounded weight simulation and pneumatic feedback, their combined role in conveying information through UI elements remains underexplored. We developed a wearable haptic device that transfers liquid and air into flexible containers mounted on the back of the user's hand, allowing us to independently manipulate weight and pressure. Through an initial evaluation using three conditions-no feedback, weight only, and weight combined with pressure-we examined how these signals affect perceived heaviness, coherence with visual cues, and the perceived urgency of notifications. Our results validate that pressure amplifies the perception of weight, but this increased heaviness does not translate into higher perceived urgency. These findings suggest that while pressure___enhanced weight can enrich haptic rendering of UI elements in VR, its contribution to communicating urgency may require further investigation, alternative pressure profiles, or different types of notifications.2026-01-27T07:36:37ZIEEE World Haptics Conference 2025, Jul 2025, Suwon, South KoreaJoffrey GuilmetESIEA, UMSuzanne SorliESIEADiego Vilela MonteiroESIEAhttp://arxiv.org/abs/2601.19233v1UniMGS: Unifying Mesh and 3D Gaussian Splatting with Single-Pass Rasterization and Proxy-Based Deformation2026-01-27T06:05:14ZJoint rendering and deformation of mesh and 3D Gaussian Splatting (3DGS) have significant value as both representa tions offer complementary advantages for graphics applica tions. However, due to differences in representation and ren dering pipelines, existing studies render meshes and 3DGS separately, making it difficult to accurately handle occlusions and transparency. Moreover, the deformed 3DGS still suffers from visual artifacts due to the sensitivity to the topology quality of the proxy mesh. These issues pose serious obsta cles to the joint use of 3DGS and meshes, making it diffi cult to adapt 3DGS to conventional mesh-oriented graphics pipelines. We propose UniMGS, the first unified framework for rasterizing mesh and 3DGS in a single-pass anti-aliased manner, with a novel binding strategy for 3DGS deformation based on proxy mesh. Our key insight is to blend the col ors of both triangle and Gaussian fragments by anti-aliased α-blending in a single pass, achieving visually coherent re sults with precise handling of occlusion and transparency. To improve the visual appearance of the deformed 3DGS, our Gaussian-centric binding strategy employs a proxy mesh and spatially associates Gaussians with the mesh faces, signifi cantly reducing rendering artifacts. With these two compo nents, UniMGS enables the visualization and manipulation of 3D objects represented by mesh or 3DGS within a unified framework, opening up new possibilities in embodied AI, vir tual reality, and gaming. We will release our source code to facilitate future research.2026-01-27T06:05:14ZconferenceZeyu XiaoMingyang SunYimin CongLintao WangDongliang KouZhenyi WuDingkang YangPeng ZhaiZeyu WangLihua Zhanghttp://arxiv.org/abs/2601.19036v1The Last Mile to Production Readiness: Physics-Based Motion Refinement for Video-Based Capture2026-01-26T23:39:10ZHigh-quality motion data underpins games, film, XR, and robotics. Vision-based motion capture tools have made significant progress, offering accessible and visually convincing results, yet often fall short in the final stretch -- the last mile -- when it comes to physical realism and production readiness, due to various artifacts introduced during capture. In this paper, we summarize key issues through case studies and feedback from professional animators to set a stepping stone for future research in motion cleanup. We then present a physics-based motion refinement framework to bridge the gap, with the goal of reducing labor-intensive manual cleanup and enhancing visual quality and physical realism. Our framework supports both single- and multi-character sequences and can be integrated into animator workflows for further refinement, such as stylizing motions via keyframe editing.2026-01-26T23:39:10ZTianxin TaoHan LiuHung Yu Ling