https://arxiv.org/api/G9GwyFNgHmKvwzhyNIHWZngFmv4 2026-06-09T23:27:55Z 9301 45 15 http://arxiv.org/abs/2606.05124v1 Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting 2026-06-03T17:29:36Z

After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.

2026-06-03T17:29:36Z Hongyu Zhou Zorah Lähner http://arxiv.org/abs/2606.05268v1 Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation 2026-06-03T16:50:49Z

We present a pipeline for building and aggregating task-specific, LLM-generated weak (imperfect) verifiers into a strong verifier for spatial layout domains. Given a task description, our pipeline asks an LLM to synthesize a collection of verifier programs using a layout verification DSL. Each individual LLM-generated verifier usually provides an imperfect check for a match between the layout and the corresponding task description. We show that by aggregating the responses of many such verifiers we can produce a stronger verifier. Moreover, by applying techniques from weak learning, our pipeline can learn how to aggregate the weak verifiers from a very sparse set of human labeled example layouts (about 10). We find that the strong verifiers produced by our pipeline outperform the status-quo approach of using a set of LLM judges to directly check whether a layout matches a task description, raising F1-scores by up to 7X across a variety of 3D room layout and 2D poster design tasks. We also demonstrate that verifier-guided layout generation using natural language feedback from our strong verifiers improves layout quality of a base layout generator by up to 66.2% according to a human evaluator.

2026-06-03T16:50:49Z Sharon Zhang R. Kenny Jones Jiajun Wu Maneesh Agrawala http://arxiv.org/abs/2606.05255v1 Oklch+: A Three-Parameter Extension of Oklab for Improved Color Difference Prediction 2026-06-03T15:43:24Z

Oklab and its cylindrical representation Oklch are widely adopted in interpolation and design workflows as perceptually motivated color spaces, but their color difference prediction accuracy falls short of CIEDE2000. We propose Oklch+, a three-parameter extension of Oklab comprising a power transformation on the L-axis and a Naka-Rushton compression on the C-axis, with Euclidean distance computed in the resulting transformed Oklab coordinates. The Naka-Rushton function is bounded in [0,1], reflecting the saturating nature of chroma sensitivity at high colorimetric values. Evaluated on COMBVD -- 3,813 suprathreshold color difference pairs spanning six independent experimental datasets -- Oklch+ achieves STRESS = 29.09, closely matching CIEDE2000 (29.13; difference = 0.04), using only three parameters optimized against color difference data compared to approximately 17 for CIEDE2000. Cross-validation on a held-out BFD-P D65 subset (2,028 pairs) confirms generalization (STRESS = 26.14), with Oklch+ substantially outperforming Oklab (51.45) and achieving STRESS comparable to CIEDE2000 (24.12) on the held-out set. Improvement over Oklab (47.35) is confirmed across all six COMBVD sub-datasets. Because Oklch+ defines a coordinate system in which Euclidean distance approximates perceptual distance, linear interpolation in the transformed space offers substantially improved perceptual uniformity relative to Oklab. Current evaluation is limited to the sRGB-centered COMBVD dataset; validation in high-chroma regions with empirical observer-rated discrimination data remains future work.

2026-06-03T15:43:24Z 3 figures, 8 tables. Submitted to Color Research & Application Naoyuki Uchida http://arxiv.org/abs/2603.28762v2 On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers 2026-06-03T15:05:21Z

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

2026-03-30T17:59:13Z SIGGRAPH 2026. Project page: https://contextual-repulsion.github.io/ Omer Dahary Benaya Koren Daniel Garibi Daniel Cohen-Or http://arxiv.org/abs/2606.04621v1 MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer 2026-06-03T08:57:15Z

We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow

2026-06-03T08:57:15Z CVPR2026 Highlight, Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow Weiyu Li Antoine Toisoul Tom Monnier Roman Shapovalov Rakesh Ranjan Ping Tan Andrea Vedaldi http://arxiv.org/abs/2606.04527v1 Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation 2026-06-03T07:09:01Z

We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.

2026-06-03T07:09:01Z Website: https://echo-team-joy-future-academy-jd.github.io/Echo-Infinity/ Yuxuan Bian Zeyue Xue Songchun Zhang Shiyi Zhang Weiyang Jin Yaowei Li Junhao Zhuang Haoran Li Jie Huang Haoyang Huang Nan Duan Qiang Xu http://arxiv.org/abs/2606.04464v1 Homology-Preserving Dimensionality Reduction via Adaptive Mapper and Landmark Isomap 2026-06-03T05:20:21Z

As data becomes increasingly central across engineering and scientific disciplines, effective visualization is essential for interpreting complex, high-dimensional structures. Dimensionality reduction techniques project high-dimensional data into lower dimensions while aiming to preserve structural properties such as pairwise distances and local neighborhoods. In this paper, we focus on improving homological preservation, that is, the retention of topological features such as connected components and loops, which is critical for maintaining global shape and continuity. We first introduce AdaMapper, a Mapper-based algorithm that leverages persistence diagrams to guide both skeleton construction and landmark selection. AdaMapper incorporates an adaptive refinement strategy that automatically increases cover resolution in regions exhibiting topological loops. We then propose AdaHIsomap, which extends landmark Isomap by incorporating homology-informed landmark selection and augmenting it with random anchor points to better balance distance and homology preservation. We evaluate both methods on a diverse set of datasets, including high-dimensional point clouds, scientific simulations, networks, and image data, and benchmark their performance against state-of-the-art approaches.

2026-06-03T05:20:21Z Shakiba Khourashahi Ilia Jahanshahi Bei Wang Lin Yan http://arxiv.org/abs/2606.03746v2 Qwen-Image-Flash: Beyond Objective Design 2026-06-03T05:16:34Z

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.

2026-06-02T15:00:22Z Tianhe Wu Kun Yan Zikai Zhou Lihan Jiang Jiahao Li Jie Zhang Kaiyuan Gao Ningyuan Tang Shengming Yin Xiaoyue Chen Xiao Xu Yilei Chen Yuxiang Chen Yan Shu Yixian Xu Yanran Zhang Zihao Liu Zhendong Wang Zekai Zhang Deqing Li Liang Peng Yi Wang Jingren Zhou Chenfei Wu http://arxiv.org/abs/2606.04319v1 PureLight: Learning Complex Luminaires with Light Tracing 2026-06-03T00:48:27Z

We propose a neural formulation for estimating the appearance of complex luminaires. We focus on challenging luminaires with complex light transport (e.g., small emitters enclosed by multiple specular layers) that are difficult for (bidirectional) path tracing. To this end, we use light tracing to construct paths from emitters to the exit surfaces and formulate appearance estimation as a distribution learning problem. Specifically, we model the probability density function (pdf) of outgoing radiance on the exit surfaces using a large normalizing flow network, and recover the outgoing radiance as the product of the estimated pdf and flux. To enable efficient inference, we distill the learned appearance into a lightweight MLP that directly estimates radiance on the exit surfaces. We additionally train a sampling network for effective direct illumination computation from the luminaire, and a blending network to composite the luminaire into the scene. Our formulation makes it feasible to render challenging luminaires using low sample counts in arbitrary scenes.

2026-06-03T00:48:27Z 9 pages, 10 figures Pedro Figueiredo Zixuan Li Beibei Wang Miloš Hašan Nima Khademi Kalantari http://arxiv.org/abs/2606.06525v1 Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems 2026-06-02T20:34:11Z

Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long-horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi-agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.

2026-06-02T20:34:11Z Ziheng Geng Ian Franklin Santiago Martinez Jiachen Liu Yunhe Zhao Minghui Cheng http://arxiv.org/abs/2606.04108v1 SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation 2026-06-02T18:11:41Z

Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.

2026-06-02T18:11:41Z Guangda Ji Qimin Chen Qinchan Li Mingrui Zhao Kai Wang Hao Zhang http://arxiv.org/abs/2606.06520v1 Applying Deep Learning for cockpit segmentation in the context of mixed reality 2026-06-02T16:43:24Z

Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it's necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence techniques.The convolutional neural network architectures "U-net" and "DeepLabV3+" are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.

2026-06-02T16:43:24Z XXV Congresso Brasileiro de Automática - CBA 2024 Alexandre Leles Sousa Pedro de Oliveira Nielson Erick Oliveira Rodrigues Rafael Francisco dos Santos Giovani Bernardes Vitor 10.20906/CBA2024/4844 http://arxiv.org/abs/2606.03857v1 A Novel Procedural Generation for Level Design of Mansions and Dungeons 2026-06-02T16:32:27Z

Procedural Content Generation (PCG) has become an essential technique in game development due to its ability to reduce production time and cost while increasing replayability and variety. However, when not aligned with level design principles, PCG can lead to incoherent spatial structures and poor gameplay experiences. Objective: This work proposes a PCG method guided by level design principles to generate structured indoor environments - such as houses, mansions, and dungeons - aiming to ensure both architectural coherence and navigability. Methodology: The method is divided into three main stages: segmentation of the space using Binary Space Partitioning (BSP); logical connection of rooms based on graph traversal to prevent redundant links; and a post-processing stage responsible for cleaning structural artifacts and improving visual cohesion. The methodology allows parameterization of room area and shape, with randomness controlled via seeds for reproducibility. Results: Two experiments were conducted. The first demonstrated the flexibility of the methodology under different seeds and parameter configurations. The second evaluated the navigability of generated maps by verifying connectivity using Breadth-First Search (BFS). In this test, 100,000 maps were generated, and with suitable parameters, over 91% of them achieved complete connectivity.

2026-06-02T16:32:27Z SBGAMES 2025 Isaac Fiuza Vieira Kathya Silvia Collazos Linares Esteban Walter Gonzalez Clua Érick Oliveira Rodrigues 10.5753/sbgames.2025.10089 http://arxiv.org/abs/2603.07664v3 Ref-DGS: Reflective Dual Gaussian Splatting 2026-06-02T16:23:29Z

The reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present \textbf{Ref-DGS}, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware specular adaptive mixing shader that fuses global and local specular features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.

2026-03-08T14:54:15Z Project page: https://njfan.github.io/Ref-DGS/ Ningjing Fan Yiqun Wang Dong-Ming Yan Peter Wonka http://arxiv.org/abs/2605.26006v2 MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control 2026-06-02T14:18:24Z

Enabling physics-based humanoids to execute diverse behaviors from high-level textual commands remains a significant challenge. Existing methods typically follow either a two-stage paradigm that combines kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning paradigm that directly generates actions from text. However, the former suffers from the inherent domain shift between kinematic generation and physics-based tracking, while the latter struggles with the substantial modality gap between textual commands and low-level actions, limiting effective semantic alignment. Notably, humanoid states encode rich motion dynamics that are more semantically aligned with textual descriptions than low-level actions, making them a natural basis for deriving behavioral intent. Building upon this insight, we propose MIND, a novel end-to-end diffusion framework for text-driven physics-based humanoid control that leverages behavioral intent as a semantic bridge between textual commands and low-level actions. At its core, MIND introduces a multi-scale intent diffusion mechanism, where a holistic intent predictor captures global behavioral dynamics to guide overall behavior synthesis, while an immediate intent predictor provides step-wise, fine-grained signals for local behavior refinement at each diffusion step. This hierarchical intent formulation imposes a structured inductive bias for humanoid control, improving semantic alignment and behavioral naturalness. Furthermore, MIND encodes humanoid states into a latent space to enable more effective semantic intent modeling. Extensive experiments demonstrate that MIND outperforms existing methods and synthesizes coherent, physically plausible, and semantically aligned humanoid behaviors from text commands. Project page: https://binlee26.github.io/MIND_page.

2026-05-25T16:23:10Z Bin Li Ruichi Zhang Han Liang Jingyan Zhang Juze Zhang Xin Chen Jingya Wang