https://arxiv.org/api/95YarJMuODpxkN31ziKEm0W864U 2026-06-28T18:00:57Z 9390 1935 15 http://arxiv.org/abs/2312.12491v2 StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation 2025-07-08T17:45:49Z

We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.

2023-12-19T18:18:33Z tech report, the code is available at https://github.com/cumulo-autumn/StreamDiffusion Akio Kodaira Chenfeng Xu Toshiki Hazama Takanori Yoshimoto Kohei Ohno Shogo Mitsuhori Soichi Sugano Hanying Cho Zhijian Liu Masayoshi Tomizuka Kurt Keutzer http://arxiv.org/abs/2507.05572v1 AnatomyCarve: A VR occlusion management technique for medical images based on segment-aware clipping 2025-07-08T01:20:07Z

Visualizing 3D medical images is challenging due to self-occlusion, where anatomical structures of interest can be obscured by surrounding tissues. Existing methods, such as slicing and interactive clipping, are limited in their ability to fully represent internal anatomy in context. In contrast, hand-drawn medical illustrations in anatomy books manage occlusion effectively by selectively removing portions based on tissue type, revealing 3D structures while preserving context. This paper introduces AnatomyCarve, a novel technique developed for a VR environment that creates high-quality illustrations similar to those in anatomy books, while remaining fast and interactive. AnatomyCarve allows users to clip selected segments from 3D medical volumes, preserving spatial relations and contextual information. This approach enhances visualization by combining advanced rendering techniques with natural user interactions in VR. Usability of AnatomyCarve was assessed through a study with non-experts, while surgical planning effectiveness was evaluated with practicing neurosurgeons and residents. The results show that AnatomyCarve enables customized anatomical visualizations, with high user satisfaction, suggesting its potential for educational and clinical applications.

2025-07-08T01:20:07Z Andrey Titov Tina N. H. Nantenaina Marta Kersten-Oertel Simon Drouin http://arxiv.org/abs/2507.05447v1 NRXR-ID: Two-Factor Authentication (2FA) in VR Using Near-Range Extended Reality and Smartphones 2025-07-07T20:00:09Z

Two-factor authentication (2FA) has become widely adopted as an efficient and secure way to validate someone's identity online. Two-factor authentication is difficult in virtual reality (VR) because users are usually wearing a head-mounted display (HMD) which does not allow them to see their real-world surroundings. We present NRXR-ID, a technique to implement two-factor authentication while using extended reality systems and smartphones. The proposed method allows users to complete an authentication challenge using their smartphones without removing their HMD. We performed a user study where we explored four types of challenges for users, including a novel checkers-style challenge. Users responded to these challenges under three different configurations, including a technique that uses the smartphone to support gaze-based selection without the use of VR controllers. A 4X3 within-subjects design allowed us to study all the variations proposed. We collected performance metrics and performed user experience questionnaires to collect subjective impressions from 30 participants. Results suggest that the checkers-style visual matching challenge was the most appropriate option, followed by entering a digital PIN challenge submitted via the smartphone and answered within the VR environment.

2025-07-07T20:00:09Z Aiur Nanzatov Lourdes Peña-Castillo Oscar Meruvia-Pastor http://arxiv.org/abs/2507.05191v1 Neuralocks: Real-Time Dynamic Neural Hair Simulation 2025-07-07T16:49:19Z

Real-time hair simulation is a vital component in creating believable virtual avatars, as it provides a sense of immersion and authenticity. The dynamic behavior of hair, such as bouncing or swaying in response to character movements like jumping or walking, plays a significant role in enhancing the overall realism and engagement of virtual experiences. Current methods for simulating hair have been constrained by two primary approaches: highly optimized physics-based systems and neural methods. However, state-of-the-art neural techniques have been limited to quasi-static solutions, failing to capture the dynamic behavior of hair. This paper introduces a novel neural method that breaks through these limitations, achieving efficient and stable dynamic hair simulation while outperforming existing approaches. We propose a fully self-supervised method which can be trained without any manual intervention or artist generated training data allowing the method to be integrated with hair reconstruction methods to enable automatic end-to-end methods for avatar reconstruction. Our approach harnesses the power of compact, memory-efficient neural networks to simulate hair at the strand level, allowing for the simulation of diverse hairstyles without excessive computational resources or memory requirements. We validate the effectiveness of our method through a variety of hairstyle examples, showcasing its potential for real-world applications.

2025-07-07T16:49:19Z Gene Wei-Chin Lin Egor Larionov Hsiao-yu Chen Doug Roble Tuur Stuyck http://arxiv.org/abs/2507.05304v1 Self-Attention Based Multi-Scale Graph Auto-Encoder Network of 3D Meshes 2025-07-07T07:36:03Z

3D meshes are fundamental data representations for capturing complex geometric shapes in computer vision and graphics applications. While Convolutional Neural Networks (CNNs) have excelled in structured data like images, extending them to irregular 3D meshes is challenging due to the non-Euclidean nature of the data. Graph Convolutional Networks (GCNs) offer a solution by applying convolutions to graph-structured data, but many existing methods rely on isotropic filters or spectral decomposition, limiting their ability to capture both local and global mesh features. In this paper, we introduce 3D Geometric Mesh Network (3DGeoMeshNet), a novel GCN-based framework that uses anisotropic convolution layers to effectively learn both global and local features directly in the spatial domain. Unlike previous approaches that convert meshes into intermediate representations like voxel grids or point clouds, our method preserves the original polygonal mesh format throughout the reconstruction process, enabling more accurate shape reconstruction. Our architecture features a multi-scale encoder-decoder structure, where separate global and local pathways capture both large-scale geometric structures and fine-grained local details. Extensive experiments on the COMA dataset containing human faces demonstrate the efficiency of 3DGeoMeshNet in terms of reconstruction accuracy.

2025-07-07T07:36:03Z International Joint Conference on Neural Networks, Jun 2025, Rome, Italy Saqib Nazir UNICAEN Olivier Lézoray UNICAEN Sébastien Bougleux UNICAEN http://arxiv.org/abs/2412.14453v2 Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation 2025-07-07T03:14:46Z

Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose SewingLDM, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, \ie, body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability. Our project page: https://shengqiliu1.github.io/SewingLDM.

2024-12-19T02:05:28Z Our project page: https://shengqiliu1.github.io/SewingLDM Shengqi Liu Yuhao Cheng Zhuo Chen Xingyu Ren Wenhan Zhu Lincheng Li Mengxiao Bi Xiaokang Yang Yichao Yan http://arxiv.org/abs/2412.16776v2 DMesh++: An Efficient Differentiable Mesh for Complex Shapes 2025-07-06T23:21:51Z

Recent probabilistic methods for 3D triangular meshes capture diverse shapes by differentiable mesh connectivity, but face high computational costs with increased shape details. We introduce a new differentiable mesh processing method that addresses this challenge and efficiently handles meshes with intricate structures. Our method reduces time complexity from O(N) to O(log N) and requires significantly less memory than previous approaches. Building on this innovation, we present a reconstruction algorithm capable of generating complex 2D and 3D shapes from point clouds or multi-view images. Visit our project page (https://sonsang.github.io/dmesh2-project) for source code and supplementary material.

2024-12-21T21:16:03Z 20 pages, 24 figures, 6 tables Sanghyun Son Matheus Gadelha Yang Zhou Matthew Fisher Zexiang Xu Yi-Ling Qiao Ming C. Lin Yi Zhou http://arxiv.org/abs/2507.04285v1 SeqTex: Generate Mesh Textures in Video Sequence 2025-07-06T07:58:36Z

Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

2025-07-06T07:58:36Z Ze Yuan HKU Xin Yu HKU Yangtian Sun HKU Yuan-Chen Guo VAST Yan-Pei Cao VAST Ding Liang VAST Xiaojuan Qi HKU http://arxiv.org/abs/2507.04236v1 AnnoGram: An Annotative Grammar of Graphics Extension 2025-07-06T03:57:41Z

Annotations are central to effective data communication, yet most visualization tools treat them as secondary constructs -- manually defined, difficult to reuse, and loosely coupled to the underlying visualization grammar. We propose a declarative extension to Wilkinson's Grammar of Graphics that reifies annotations as first-class design elements, enabling structured specification of annotation targets, types, and positioning strategies. To demonstrate the utility of our approach, we develop a prototype extension called Vega-Lite Annotation. Through comparison with eight existing tools, we show that our approach enhances expressiveness, reduces authoring effort, and enables portable, semantically integrated annotation workflows.

2025-07-06T03:57:41Z Md Dilshadur Rahman Md Rahat-uz- Zaman Andrew McNutt Paul Rosen http://arxiv.org/abs/2404.13445v3 DMesh: A Differentiable Mesh Representation 2025-07-06T02:05:50Z

We present a differentiable representation, DMesh, for general 3D triangular meshes. DMesh considers both the geometry and connectivity information of a mesh. In our design, we first get a set of convex tetrahedra that compactly tessellates the domain based on Weighted Delaunay Triangulation (WDT), and select triangular faces on the tetrahedra to define the final mesh. We formulate probability of faces to exist on the actual surface in a differentiable manner based on the WDT. This enables DMesh to represent meshes of various topology in a differentiable way, and allows us to reconstruct the mesh under various observations, such as point cloud and multi-view images using gradient-based optimization. The source code and full paper is available at: https://sonsang.github.io/dmesh-project.

2024-04-20T18:52:51Z 36 pages, 24 figures. Updated with camera-ready version Sanghyun Son Matheus Gadelha Yang Zhou Zexiang Xu Ming C. Lin Yi Zhou http://arxiv.org/abs/2507.04147v1 A3FR: Agile 3D Gaussian Splatting with Incremental Gaze Tracked Foveated Rendering in Virtual Reality 2025-07-05T19:55:25Z

Virtual reality (VR) significantly transforms immersive digital interfaces, greatly enhancing education, professional practices, and entertainment by increasing user engagement and opening up new possibilities in various industries. Among its numerous applications, image rendering is crucial. Nevertheless, rendering methodologies like 3D Gaussian Splatting impose high computational demands, driven predominantly by user expectations for superior visual quality. This results in notable processing delays for real-time image rendering, which greatly affects the user experience. Additionally, VR devices such as head-mounted displays (HMDs) are intricately linked to human visual behavior, leveraging knowledge from perception and cognition to improve user experience. These insights have spurred the development of foveated rendering, a technique that dynamically adjusts rendering resolution based on the user's gaze direction. The resultant solution, known as gaze-tracked foveated rendering, significantly reduces the computational burden of the rendering process. Although gaze-tracked foveated rendering can reduce rendering costs, the computational overhead of the gaze tracking process itself can sometimes outweigh the rendering savings, leading to increased processing latency. To address this issue, we propose an efficient rendering framework called~\textit{A3FR}, designed to minimize the latency of gaze-tracked foveated rendering via the parallelization of gaze tracking and foveated rendering processes. For the rendering algorithm, we utilize 3D Gaussian Splatting, a state-of-the-art neural rendering technique. Evaluation results demonstrate that A3FR can reduce end-to-end rendering latency by up to $2\times$ while maintaining visual quality.

2025-07-05T19:55:25Z ACM International Conference on Supercomputing 2025 Shuo Xin Haiyu Wang Sai Qian Zhang 10.1145/3721145.3735112 http://arxiv.org/abs/2507.04084v1 Attention-Guided Multi-Scale Local Reconstruction for Point Clouds via Masked Autoencoder Self-Supervised Learning 2025-07-05T16:17:49Z

Self-supervised learning has emerged as a prominent research direction in point cloud processing. While existing models predominantly concentrate on reconstruction tasks at higher encoder layers, they often neglect the effective utilization of low-level local features, which are typically employed solely for activation computations rather than directly contributing to reconstruction tasks. To overcome this limitation, we introduce PointAMaLR, a novel self-supervised learning framework that enhances feature representation and processing accuracy through attention-guided multi-scale local reconstruction. PointAMaLR implements hierarchical reconstruction across multiple local regions, with lower layers focusing on fine-scale feature restoration while upper layers address coarse-scale feature reconstruction, thereby enabling complex inter-patch interactions. Furthermore, to augment feature representation capabilities, we incorporate a Local Attention (LA) module in the embedding layer to enhance semantic feature understanding. Comprehensive experiments on benchmark datasets ModelNet and ShapeNet demonstrate PointAMaLR's superior accuracy and quality in both classification and reconstruction tasks. Moreover, when evaluated on the real-world dataset ScanObjectNN and the 3D large scene segmentation dataset S3DIS, our model achieves highly competitive performance metrics. These results not only validate PointAMaLR's effectiveness in multi-scale semantic understanding but also underscore its practical applicability in real-world scenarios.

2025-07-05T16:17:49Z 22 pages Xin Cao Haoyu Wang Yuzhu Mao Xinda Liu Linzhi Su Kang Li http://arxiv.org/abs/2507.03839v1 Participatory Evolution of Artificial Life Systems via Semantic Feedback 2025-07-04T23:51:50Z

We present a semantic feedback framework that enables natural language to guide the evolution of artificial life systems. Integrating a prompt-to-parameter encoder, a CMA-ES optimizer, and CLIP-based evaluation, the system allows user intent to modulate both visual outcomes and underlying behavioral rules. Implemented in an interactive ecosystem simulation, the framework supports prompt refinement, multi-agent interaction, and emergent rule synthesis. User studies show improved semantic alignment over manual tuning and demonstrate the system's potential as a platform for participatory generative design and open-ended evolution.

2025-07-04T23:51:50Z 10 pages Shuowen Li Kexin Wang Minglu Fang Danqi Huang Ali Asadipour Haipeng Mi Yitong Sun http://arxiv.org/abs/2507.03836v1 F-Hash: Feature-Based Hash Design for Time-Varying Volume Visualization via Multi-Resolution Tesseract Encoding 2025-07-04T23:23:26Z

Interactive time-varying volume visualization is challenging due to its complex spatiotemporal features and sheer size of the dataset. Recent works transform the original discrete time-varying volumetric data into continuous Implicit Neural Representations (INR) to address the issues of compression, rendering, and super-resolution in both spatial and temporal domains. However, training the INR takes a long time to converge, especially when handling large-scale time-varying volumetric datasets. In this work, we proposed F-Hash, a novel feature-based multi-resolution Tesseract encoding architecture to greatly enhance the convergence speed compared with existing input encoding methods for modeling time-varying volumetric data. The proposed design incorporates multi-level collision-free hash functions that map dynamic 4D multi-resolution embedding grids without bucket waste, achieving high encoding capacity with compact encoding parameters. Our encoding method is agnostic to time-varying feature detection methods, making it a unified encoding solution for feature tracking and evolution visualization. Experiments show the F-Hash achieves state-of-the-art convergence speed in training various time-varying volumetric datasets for diverse features. We also proposed an adaptive ray marching algorithm to optimize the sample streaming for faster rendering of the time-varying neural representation.

2025-07-04T23:23:26Z Jianxin Sun David Lenz Hongfeng Yu Tom Peterka http://arxiv.org/abs/2507.03731v1 3D PixBrush: Image-Guided Local Texture Synthesis 2025-07-04T17:38:34Z

We present 3D PixBrush, a method for performing image-driven edits of local regions on 3D meshes. 3D PixBrush predicts a localization mask and a synthesized texture that faithfully portray the object in the reference image. Our predicted localizations are both globally coherent and locally precise. Globally - our method contextualizes the object in the reference image and automatically positions it onto the input mesh. Locally - our method produces masks that conform to the geometry of the reference image. Notably, our method does not require any user input (in the form of scribbles or bounding boxes) to achieve accurate localizations. Instead, our method predicts a localization mask on the 3D mesh from scratch. To achieve this, we propose a modification to the score distillation sampling technique which incorporates both the predicted localization and the reference image, referred to as localization-modulated image guidance. We demonstrate the effectiveness of our proposed technique on a wide variety of meshes and images.

2025-07-04T17:38:34Z Dale Decatur Itai Lang Kfir Aberman Rana Hanocka