https://arxiv.org/api/6qSLPk1BPYbbms58U/YEzaSug7Y 2026-06-26T13:19:09Z 9390 1560 15 http://arxiv.org/abs/2509.20858v1 ArchGPT: Understanding the World's Architectures with Large Multimodal Models 2025-09-25T07:49:43Z

Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.

2025-09-25T07:49:43Z Yuze Wang Luo Yang Junyi Wang Yue Qi http://arxiv.org/abs/2509.20824v1 ARMesh: Autoregressive Mesh Generation via Next-Level-of-Detail Prediction 2025-09-25T07:12:02Z

Directly generating 3D meshes, the default representation for 3D shapes in the graphics industry, using auto-regressive (AR) models has become popular these days, thanks to their sharpness, compactness in the generated results, and ability to represent various types of surfaces. However, AR mesh generative models typically construct meshes face by face in lexicographic order, which does not effectively capture the underlying geometry in a manner consistent with human perception. Inspired by 2D models that progressively refine images, such as the prevailing next-scale prediction AR models, we propose generating meshes auto-regressively in a progressive coarse-to-fine manner. Specifically, we view mesh simplification algorithms, which gradually merge mesh faces to build simpler meshes, as a natural fine-to-coarse process. Therefore, we generalize meshes to simplicial complexes and develop a transformer-based AR model to approximate the reverse process of simplification in the order of level of detail, constructing meshes initially from a single point and gradually adding geometric details through local remeshing, where the topology is not predefined and is alterable. Our experiments show that this novel progressive mesh generation approach not only provides intuitive control over generation quality and time consumption by early stopping the auto-regressive process but also enables applications such as mesh refinement and editing.

2025-09-25T07:12:02Z NeurIPS 2025, Project Page: https://jblei.site/proj/armesh Jiabao Lei Kewei Shi Zhihao Liang Kui Jia http://arxiv.org/abs/2506.07020v2 CrossGen: Learning and Generating Cross Fields for Quad Meshing 2025-09-25T03:35:20Z

Cross fields play a critical role in various geometry processing tasks, especially for quad mesh generation. Existing methods for cross field generation often struggle to balance computational efficiency with generation quality, using slow per-shape optimization. We introduce CrossGen, a novel framework that supports both feed-forward prediction and latent generative modeling of cross fields for quad meshing by unifying geometry and cross field representations within a joint latent space. Our method enables extremely fast computation of high-quality cross fields of general input shapes, typically within one second without per-shape optimization. Our method assumes a point-sampled surface, also called a {\em point-cloud surface}, as input, so we can accommodate various surface representations by a straightforward point sampling process. Using an auto-encoder network architecture, we encode input point-cloud surfaces into a sparse voxel grid with fine-grained latent spaces, which are decoded into both SDF-based surface geometry and cross fields(see the teaser figure). We also contribute a dataset of models with both high-quality signed distance fields (SDFs) representations and their corresponding cross fields, and use it to train our network. Once trained, the network is capable of computing a cross field of an input surface in a feed-forward manner, ensuring high geometric fidelity, noise resilience, and rapid inference. Furthermore, leveraging the same unified latent representation, we incorporate a diffusion model for computing cross fields of new shapes generated from partial input, such as sketches. To demonstrate its practical applications, we validate CrossGen on the quad mesh generation task for a large variety of surface shapes. Experimental results...

2025-06-08T07:01:00Z SIGGRAPH Asia 2025 Journal Track; Project page: https://qiujiedong.github.io/publications/CrossGen/ Qiujie Dong Jiepeng Wang Rui Xu Cheng Lin Yuan Liu Shiqing Xin Zichun Zhong Xin Li Changhe Tu Taku Komura Leif Kobbelt Scott Schaefer Wenping Wang 10.1145/3763299 http://arxiv.org/abs/2509.20710v1 ArtUV: Artist-style UV Unwrapping 2025-09-25T03:21:21Z

UV unwrapping is an essential task in computer graphics, enabling various visual editing operations in rendering pipelines. However, existing UV unwrapping methods struggle with time-consuming, fragmentation, lack of semanticity, and irregular UV islands, limiting their practical use. An artist-style UV map must not only satisfy fundamental criteria, such as overlap-free mapping and minimal distortion, but also uphold higher-level standards, including clean boundaries, efficient space utilization, and semantic coherence. We introduce ArtUV, a fully automated, end-to-end method for generating artist-style UV unwrapping. We simulates the professional UV mapping process by dividing it into two stages: surface seam prediction and artist-style UV parameterization. In the seam prediction stage, SeamGPT is used to generate semantically meaningful cutting seams. Then, in the parameterization stage, a rough UV obtained from an optimization-based method, along with the mesh, is fed into an Auto-Encoder, which refines it into an artist-style UV map. Our method ensures semantic consistency and preserves topological structure, making the UV map ready for 2D editing. We evaluate ArtUV across multiple benchmarks and show that it serves as a versatile solution, functioning seamlessly as either a plug-in for professional rendering tools or as a standalone system for rapid, high-quality UV generation.

2025-09-25T03:21:21Z Yuguang Chen Xinhai Liu Yang Li Victor Cheung Zhuo Chen Dongyu Zhang Chunchao Guo http://arxiv.org/abs/2504.02045v4 Generating 360° Video is What You Need For a 3D Scene 2025-09-25T03:04:40Z

Generating 3D scenes is still a challenging task due to the lack of readily available scene data. Most existing methods only produce partial scenes and provide limited navigational freedom. We introduce a practical and scalable solution that uses 360° video as an intermediate scene representation, capturing the full-scene context and ensuring consistent visual content throughout the generation. We propose WorldPrompter, a generative pipeline that synthesizes traversable 3D scenes from text prompts. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model, trained with a mix of image and video data, achieves convincing spatial and temporal consistency for static scenes. This is validated by an average COLMAP matching rate of 94.6\%, allowing for high-quality panoramic Gaussian splat reconstruction and improved navigation throughout the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.

2025-04-02T18:04:32Z SIGGRAPH Asia 2025. Project Page: https://zhaoyangzh.github.io/projects/worldprompter/ Zhaoyang Zhang Yannick Hold-Geoffroy Miloš Hašan Ziwen Chen Fujun Luan Julie Dorsey Yiwei Hu http://arxiv.org/abs/2509.20198v1 LidarScout: Direct Out-of-Core Rendering of Massive Point Clouds 2025-09-24T14:53:52Z

Large-scale terrain scans are the basis for many important tasks, such as topographic mapping, forestry, agriculture, and infrastructure planning. The resulting point cloud data sets are so massive in size that even basic tasks like viewing take hours to days of pre-processing in order to create level-of-detail structures that allow inspecting the data set in their entirety in real time. In this paper, we propose a method that is capable of instantly visualizing massive country-sized scans with hundreds of billions of points. Upon opening the data set, we first load a sparse subsample of points and initialize an overview of the entire point cloud, immediately followed by a surface reconstruction process to generate higher-quality, hole-free heightmaps. As users start navigating towards a region of interest, we continue to prioritize the heightmap construction process to the user's viewpoint. Once a user zooms in closely, we load the full-resolution point cloud data for that region and update the corresponding height map textures with the full-resolution data. As users navigate elsewhere, full-resolution point data that is no longer needed is unloaded, but the updated heightmap textures are retained as a form of medium level of detail. Overall, our method constitutes a form of direct out-of-core rendering for massive point cloud data sets (terabytes, compressed) that requires no preprocessing and no additional disk space. Source code, executable, pre-trained model, and dataset are available at: https://github.com/cg-tuwien/lidarscout

2025-09-24T14:53:52Z Published at High-Performance Graphics 2025 High-Performance Graphics - Symposium Papers. The Eurographics Association, 2025 Philipp Erler Lukas Herzberger Michael Wimmer Markus Schütz 10.2312/hpg.20251170 http://arxiv.org/abs/2509.19939v1 AJAHR: Amputated Joint Aware 3D Human Mesh Recovery 2025-09-24T09:46:10Z

Existing human mesh recovery methods assume a standard human body structure, overlooking diverse anatomical conditions such as limb loss. This assumption introduces bias when applied to individuals with amputations - a limitation further exacerbated by the scarcity of suitable datasets. To address this gap, we propose Amputated Joint Aware 3D Human Mesh Recovery (AJAHR), which is an adaptive pose estimation framework that improves mesh reconstruction for individuals with limb loss. Our model integrates a body-part amputation classifier, jointly trained with the mesh recovery network, to detect potential amputations. We also introduce Amputee 3D (A3D), which is a synthetic dataset offering a wide range of amputee poses for robust training. While maintaining competitive performance on non-amputees, our approach achieves state-of-the-art results for amputated individuals. Additional materials can be found at the project webpage.

2025-09-24T09:46:10Z 8pages, Project Page: https://chojinie.github.io/project_AJAHR/ Hyunjin Cho Giyun Choi Jongwon Choi http://arxiv.org/abs/2509.08643v2 X-Part: high fidelity and structure coherent shape decomposition 2025-09-24T02:57:21Z

Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.

2025-09-10T14:37:02Z Tech Report, Project Page: https://yanxinhao.github.io/Projects/X-Part/ Xinhao Yan Jiachen Xu Yang Li Changfeng Ma Yunhan Yang Chunshi Wang Zibo Zhao Zeqiang Lai Yunfei Zhao Zhuo Chen Chunchao Guo http://arxiv.org/abs/2509.20400v1 SeHDR: Single-Exposure HDR Novel View Synthesis via 3D Gaussian Bracketing 2025-09-23T18:28:13Z

This paper presents SeHDR, a novel high dynamic range 3D Gaussian Splatting (HDR-3DGS) approach for generating HDR novel views given multi-view LDR images. Unlike existing methods that typically require the multi-view LDR input images to be captured from different exposures, which are tedious to capture and more likely to suffer from errors (e.g., object motion blurs and calibration/alignment inaccuracies), our approach learns the HDR scene representation from multi-view LDR images of a single exposure. Our key insight to this ill-posed problem is that by first estimating Bracketed 3D Gaussians (i.e., with different exposures) from single-exposure multi-view LDR images, we may then be able to merge these bracketed 3D Gaussians into an HDR scene representation. Specifically, SeHDR first learns base 3D Gaussians from single-exposure LDR inputs, where the spherical harmonics parameterize colors in a linear color space. We then estimate multiple 3D Gaussians with identical geometry but varying linear colors conditioned on exposure manipulations. Finally, we propose the Differentiable Neural Exposure Fusion (NeEF) to integrate the base and estimated 3D Gaussians into HDR Gaussians for novel view rendering. Extensive experiments demonstrate that SeHDR outperforms existing methods as well as carefully designed baselines.

2025-09-23T18:28:13Z ICCV 2025 accepted paper Yiyu Li Haoyuan Wang Ke Xu Gerhard Petrus Hancke Rynson W. H. Lau http://arxiv.org/abs/2509.19296v1 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation 2025-09-23T17:58:01Z

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

2025-09-23T17:58:01Z Project Page: https://research.nvidia.com/labs/toronto-ai/lyra/ Sherwin Bahmani Tianchang Shen Jiawei Ren Jiahui Huang Yifeng Jiang Haithem Turki Andrea Tagliasacchi David B. Lindell Zan Gojcic Sanja Fidler Huan Ling Jun Gao Xuanchi Ren http://arxiv.org/abs/2509.19412v1 EngravingGNN: A Hybrid Graph Neural Network for End-to-End Piano Score Engraving 2025-09-23T14:48:35Z

This paper focuses on automatic music engraving, i.e., the creation of a humanly-readable musical score from musical content. This step is fundamental for all applications that include a human player, but it remains a mostly unexplored topic in symbolic music processing. In this work, we formalize the problem as a collection of interdependent subtasks, and propose a unified graph neural network (GNN) framework that targets the case of piano music and quantized symbolic input. Our method employs a multi-task GNN to jointly predict voice connections, staff assignments, pitch spelling, key signature, stem direction, octave shifts, and clef signs. A dedicated postprocessing pipeline generates print-ready MusicXML/MEI outputs. Comprehensive evaluation on two diverse piano corpora (J-Pop and DCML Romantic) demonstrates that our unified model achieves good accuracy across all subtasks, compared to existing systems that only specialize in specific subtasks. These results indicate that a shared GNN encoder with lightweight task-specific decoders in a multi-task setting offers a scalable and effective solution for automatic music engraving.

2025-09-23T14:48:35Z Accepted at the International Conference on Technologies for Music Notation and Representation (TENOR) 2025 Emmanouil Karystinaios Francesco Foscarin Gerhard Widmer http://arxiv.org/abs/2509.18948v1 One-shot Embroidery Customization via Contrastive LoRA Modulation 2025-09-23T12:58:15Z

Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricate interplay of diverse stitch patterns and material properties, which poses unique challenges for existing style transfer methods. To explore the customization for such fine-grained features, we propose a novel contrastive learning framework that disentangles fine-grained style and content features with a single reference image, building on the classic concept of image analogy. We first construct an image pair to define the target style, and then adopt a similarity metric based on the decoupled representations of pretrained diffusion models for style-content separation. Subsequently, we propose a two-stage contrastive LoRA modulation technique to capture fine-grained style features. In the first stage, we iteratively update the whole LoRA and the selected style blocks to initially separate style from content. In the second stage, we design a contrastive learning strategy to further decouple style and content through self-knowledge distillation. Finally, we build an inference pipeline to handle image or text inputs with only the style blocks. To evaluate our method on fine-grained style transfer, we build a benchmark for embroidery customization. Our approach surpasses prior methods on this task and further demonstrates strong generalization to three additional domains: artistic style transfer, sketch colorization, and appearance transfer.

2025-09-23T12:58:15Z Accepted to ACM Transactions on Graphics (TOG), SIGGRAPH Asia 2025 Jun Ma Qian He Gaofeng He Huang Chen Chen Liu Xiaogang Jin Huamin Wang http://arxiv.org/abs/2507.14920v2 Time Series Information Visualization -- A Review of Approaches and Tools 2025-09-23T08:53:17Z

Time series data are prevalent across various domains and often encompass large datasets containing multiple time-dependent features in each sample. Exploring time-varying data is critical for data science practitioners aiming to understand dynamic behaviors and discover periodic patterns and trends. However, the analysis of such data often requires sophisticated procedures and tools. Information visualization is a communication channel that leverages human perceptual abilities to transform abstract data into visual representations. Visualization techniques have been successfully applied in the context of time series to enhance interpretability by graphically representing the temporal evolution of data. The challenge for information visualization developers lies in integrating a wide range of analytical tools into rich visualization systems that can summarize complex datasets while clearly describing the impacts of the temporal component. Such systems enable data scientists to turn raw data into understandable and potentially useful knowledge. This review examines techniques and approaches designed for handling time series data, guiding users through knowledge discovery processes based on visual analysis. We also provide readers with theoretical insights and design guidelines for considering when developing comprehensive information visualization approaches for time series, with a particular focus on time series with multiple features. As a result, we highlight the challenges and future research directions to address open questions in the visualization of time-dependent data.

2025-07-20T11:28:47Z This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2025.3609404 Evandro S. Ortigossa Fábio F. Dias Diego C. Nascimento Luis Gustavo Nonato 10.1109/ACCESS.2025.3609404 http://arxiv.org/abs/2509.18498v1 null2: Boundary-Dissolving Bodies and Architecture towards Digital Nature 2025-09-23T01:02:43Z

This paper presents a case study of the thematic pavilion null2 at Expo 2025 Osaka-Kansai, contrasting with the static Jomon motifs of Taro Okamoto's Tower of the Sun from Expo 1970. The study discusses Yayoi-inspired mirror motifs and dynamically transforming interactive spatial configuration of null2, where visitors become integrated as experiential content. The shift from static representation to a new ontological and aesthetic model, characterized by the visitor's body merging in real-time with architectural space at installation scale, is analyzed. Referencing the philosophical context of Expo 1970 theme 'Progress and Harmony for Mankind,' this research reconsiders the worldview articulated by null2 in Expo 2025, in which computation is naturalized and ubiquitous, through its intersection with Eastern philosophical traditions. It investigates how immersive experiences within the pavilion, grounded in the philosophical framework of Digital Nature, reinterpret traditional spatial and structural motifs of the tea room, positioning them within contemporary digital art discourse. The aim is to contextualize and document null2 as an important contemporary case study from Expo practices, considering the historical and social background in Japan from the 19th to 21st century, during which world expositions served as pivotal points for the birth of modern Japanese concept of 'fine art,' symbolic milestones of economic development, and key moments in urban and media culture formation. Furthermore, this paper academically organizes architectural techniques, computer graphics methodologies, media art practices, and theoretical backgrounds utilized in null2, highlighting the scholarly significance of preserving these as an archival document for future generations.

2025-09-23T01:02:43Z 12pages Yoichi Ochiai http://arxiv.org/abs/2509.18461v1 Zero-Shot Visual Deepfake Detection: Can AI Predict and Prevent Fake Content Before It's Created? 2025-09-22T22:33:16Z

Generative adversarial networks (GANs) and diffusion models have dramatically advanced deepfake technology, and its threats to digital security, media integrity, and public trust have increased rapidly. This research explored zero-shot deepfake detection, an emerging method even when the models have never seen a particular deepfake variation. In this work, we studied self-supervised learning, transformer-based zero-shot classifier, generative model fingerprinting, and meta-learning techniques that better adapt to the ever-evolving deepfake threat. In addition, we suggested AI-driven prevention strategies that mitigated the underlying generation pipeline of the deepfakes before they occurred. They consisted of adversarial perturbations for creating deepfake generators, digital watermarking for content authenticity verification, real-time AI monitoring for content creation pipelines, and blockchain-based content verification frameworks. Despite these advancements, zero-shot detection and prevention faced critical challenges such as adversarial attacks, scalability constraints, ethical dilemmas, and the absence of standardized evaluation benchmarks. These limitations were addressed by discussing future research directions on explainable AI for deepfake detection, multimodal fusion based on image, audio, and text analysis, quantum AI for enhanced security, and federated learning for privacy-preserving deepfake detection. This further highlighted the need for an integrated defense framework for digital authenticity that utilized zero-shot learning in combination with preventive deepfake mechanisms. Finally, we highlighted the important role of interdisciplinary collaboration between AI researchers, cybersecurity experts, and policymakers to create resilient defenses against the rising tide of deepfake attacks.

2025-09-22T22:33:16Z Published in Foundations and Trends in Signal Processing (#1 in Signal Processing, #3 in Computer Science) Foundations and Trends in Signal Processing (2025) Ayan Sar Sampurna Roy Tanupriya Choudhury Ajith Abraham 10.1561/2000000136