https://arxiv.org/api/G9GwyFNgHmKvwzhyNIHWZngFmv42026-06-09T23:27:55Z93014515http://arxiv.org/abs/2606.05124v1Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting2026-06-03T17:29:36ZAfter the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.2026-06-03T17:29:36ZHongyu ZhouZorah Lähnerhttp://arxiv.org/abs/2606.05268v1Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation2026-06-03T16:50:49ZWe present a pipeline for building and aggregating task-specific, LLM-generated weak (imperfect) verifiers into a strong verifier for spatial layout domains. Given a task description, our pipeline asks an LLM to synthesize a collection of verifier programs using a layout verification DSL. Each individual LLM-generated verifier usually provides an imperfect check for a match between the layout and the corresponding task description. We show that by aggregating the responses of many such verifiers we can produce a stronger verifier. Moreover, by applying techniques from weak learning, our pipeline can learn how to aggregate the weak verifiers from a very sparse set of human labeled example layouts (about 10). We find that the strong verifiers produced by our pipeline outperform the status-quo approach of using a set of LLM judges to directly check whether a layout matches a task description, raising F1-scores by up to 7X across a variety of 3D room layout and 2D poster design tasks. We also demonstrate that verifier-guided layout generation using natural language feedback from our strong verifiers improves layout quality of a base layout generator by up to 66.2% according to a human evaluator.2026-06-03T16:50:49ZSharon ZhangR. Kenny JonesJiajun WuManeesh Agrawalahttp://arxiv.org/abs/2606.05255v1Oklch+: A Three-Parameter Extension of Oklab for Improved Color Difference Prediction2026-06-03T15:43:24ZOklab and its cylindrical representation Oklch are widely adopted in interpolation and design workflows as perceptually motivated color spaces, but their color difference prediction accuracy falls short of CIEDE2000. We propose Oklch+, a three-parameter extension of Oklab comprising a power transformation on the L-axis and a Naka-Rushton compression on the C-axis, with Euclidean distance computed in the resulting transformed Oklab coordinates. The Naka-Rushton function is bounded in [0,1], reflecting the saturating nature of chroma sensitivity at high colorimetric values. Evaluated on COMBVD -- 3,813 suprathreshold color difference pairs spanning six independent experimental datasets -- Oklch+ achieves STRESS = 29.09, closely matching CIEDE2000 (29.13; difference = 0.04), using only three parameters optimized against color difference data compared to approximately 17 for CIEDE2000. Cross-validation on a held-out BFD-P D65 subset (2,028 pairs) confirms generalization (STRESS = 26.14), with Oklch+ substantially outperforming Oklab (51.45) and achieving STRESS comparable to CIEDE2000 (24.12) on the held-out set. Improvement over Oklab (47.35) is confirmed across all six COMBVD sub-datasets. Because Oklch+ defines a coordinate system in which Euclidean distance approximates perceptual distance, linear interpolation in the transformed space offers substantially improved perceptual uniformity relative to Oklab. Current evaluation is limited to the sRGB-centered COMBVD dataset; validation in high-chroma regions with empirical observer-rated discrimination data remains future work.2026-06-03T15:43:24Z3 figures, 8 tables. Submitted to Color Research & ApplicationNaoyuki Uchidahttp://arxiv.org/abs/2603.28762v2On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers2026-06-03T15:05:21ZModern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.2026-03-30T17:59:13ZSIGGRAPH 2026. Project page: https://contextual-repulsion.github.io/Omer DaharyBenaya KorenDaniel GaribiDaniel Cohen-Orhttp://arxiv.org/abs/2606.04621v1MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer2026-06-03T08:57:15ZWe present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow2026-06-03T08:57:15ZCVPR2026 Highlight, Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflowWeiyu LiAntoine ToisoulTom MonnierRoman ShapovalovRakesh RanjanPing TanAndrea Vedaldihttp://arxiv.org/abs/2606.04527v1Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation2026-06-03T07:09:01ZWe present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.2026-06-03T07:09:01ZWebsite: https://echo-team-joy-future-academy-jd.github.io/Echo-Infinity/Yuxuan BianZeyue XueSongchun ZhangShiyi ZhangWeiyang JinYaowei LiJunhao ZhuangHaoran LiJie HuangHaoyang HuangNan DuanQiang Xuhttp://arxiv.org/abs/2606.04464v1Homology-Preserving Dimensionality Reduction via Adaptive Mapper and Landmark Isomap2026-06-03T05:20:21ZAs data becomes increasingly central across engineering and scientific disciplines, effective visualization is essential for interpreting complex, high-dimensional structures. Dimensionality reduction techniques project high-dimensional data into lower dimensions while aiming to preserve structural properties such as pairwise distances and local neighborhoods. In this paper, we focus on improving homological preservation, that is, the retention of topological features such as connected components and loops, which is critical for maintaining global shape and continuity. We first introduce AdaMapper, a Mapper-based algorithm that leverages persistence diagrams to guide both skeleton construction and landmark selection. AdaMapper incorporates an adaptive refinement strategy that automatically increases cover resolution in regions exhibiting topological loops. We then propose AdaHIsomap, which extends landmark Isomap by incorporating homology-informed landmark selection and augmenting it with random anchor points to better balance distance and homology preservation. We evaluate both methods on a diverse set of datasets, including high-dimensional point clouds, scientific simulations, networks, and image data, and benchmark their performance against state-of-the-art approaches.2026-06-03T05:20:21ZShakiba KhourashahiIlia JahanshahiBei WangLin Yanhttp://arxiv.org/abs/2606.03746v2Qwen-Image-Flash: Beyond Objective Design2026-06-03T05:16:34ZFew-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.2026-06-02T15:00:22ZTianhe WuKun YanZikai ZhouLihan JiangJiahao LiJie ZhangKaiyuan GaoNingyuan TangShengming YinXiaoyue ChenXiao XuYilei ChenYuxiang ChenYan ShuYixian XuYanran ZhangZihao LiuZhendong WangZekai ZhangDeqing LiLiang PengYi WangJingren ZhouChenfei Wuhttp://arxiv.org/abs/2606.04319v1PureLight: Learning Complex Luminaires with Light Tracing2026-06-03T00:48:27ZWe propose a neural formulation for estimating the appearance of complex luminaires. We focus on challenging luminaires with complex light transport (e.g., small emitters enclosed by multiple specular layers) that are difficult for (bidirectional) path tracing. To this end, we use light tracing to construct paths from emitters to the exit surfaces and formulate appearance estimation as a distribution learning problem. Specifically, we model the probability density function (pdf) of outgoing radiance on the exit surfaces using a large normalizing flow network, and recover the outgoing radiance as the product of the estimated pdf and flux. To enable efficient inference, we distill the learned appearance into a lightweight MLP that directly estimates radiance on the exit surfaces. We additionally train a sampling network for effective direct illumination computation from the luminaire, and a blending network to composite the luminaire into the scene. Our formulation makes it feasible to render challenging luminaires using low sample counts in arbitrary scenes.2026-06-03T00:48:27Z9 pages, 10 figuresPedro FigueiredoZixuan LiBeibei WangMiloš HašanNima Khademi Kalantarihttp://arxiv.org/abs/2606.06525v1Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems2026-06-02T20:34:11ZLarge language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long-horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi-agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.2026-06-02T20:34:11ZZiheng GengIan FranklinSantiago MartinezJiachen LiuYunhe ZhaoMinghui Chenghttp://arxiv.org/abs/2606.04108v1SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation2026-06-02T18:11:41ZSingle-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.2026-06-02T18:11:41ZGuangda JiQimin ChenQinchan LiMingrui ZhaoKai WangHao Zhanghttp://arxiv.org/abs/2606.06520v1Applying Deep Learning for cockpit segmentation in the context of mixed reality2026-06-02T16:43:24ZComputer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it's necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence techniques.The convolutional neural network architectures "U-net" and "DeepLabV3+" are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.2026-06-02T16:43:24ZXXV Congresso Brasileiro de Automática - CBA 2024Alexandre Leles SousaPedro de Oliveira NielsonErick Oliveira RodriguesRafael Francisco dos SantosGiovani Bernardes Vitor10.20906/CBA2024/4844http://arxiv.org/abs/2606.03857v1A Novel Procedural Generation for Level Design of Mansions and Dungeons2026-06-02T16:32:27ZProcedural Content Generation (PCG) has become an essential technique in game development due to its ability to reduce production time and cost while increasing replayability and variety. However, when not aligned with level design principles, PCG can lead to incoherent spatial structures and poor gameplay experiences. Objective: This work proposes a PCG method guided by level design principles to generate structured indoor environments - such as houses, mansions, and dungeons - aiming to ensure both architectural coherence and navigability. Methodology: The method is divided into three main stages: segmentation of the space using Binary Space Partitioning (BSP); logical connection of rooms based on graph traversal to prevent redundant links; and a post-processing stage responsible for cleaning structural artifacts and improving visual cohesion. The methodology allows parameterization of room area and shape, with randomness controlled via seeds for reproducibility. Results: Two experiments were conducted. The first demonstrated the flexibility of the methodology under different seeds and parameter configurations. The second evaluated the navigability of generated maps by verifying connectivity using Breadth-First Search (BFS). In this test, 100,000 maps were generated, and with suitable parameters, over 91% of them achieved complete connectivity.2026-06-02T16:32:27ZSBGAMES 2025Isaac Fiuza VieiraKathya Silvia Collazos LinaresEsteban Walter Gonzalez CluaÉrick Oliveira Rodrigues10.5753/sbgames.2025.10089http://arxiv.org/abs/2603.07664v3Ref-DGS: Reflective Dual Gaussian Splatting2026-06-02T16:23:29ZThe reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present \textbf{Ref-DGS}, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware specular adaptive mixing shader that fuses global and local specular features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.2026-03-08T14:54:15ZProject page: https://njfan.github.io/Ref-DGS/Ningjing FanYiqun WangDong-Ming YanPeter Wonkahttp://arxiv.org/abs/2605.26006v2MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control2026-06-02T14:18:24ZEnabling physics-based humanoids to execute diverse behaviors from high-level textual commands remains a significant challenge. Existing methods typically follow either a two-stage paradigm that combines kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning paradigm that directly generates actions from text. However, the former suffers from the inherent domain shift between kinematic generation and physics-based tracking, while the latter struggles with the substantial modality gap between textual commands and low-level actions, limiting effective semantic alignment. Notably, humanoid states encode rich motion dynamics that are more semantically aligned with textual descriptions than low-level actions, making them a natural basis for deriving behavioral intent. Building upon this insight, we propose MIND, a novel end-to-end diffusion framework for text-driven physics-based humanoid control that leverages behavioral intent as a semantic bridge between textual commands and low-level actions. At its core, MIND introduces a multi-scale intent diffusion mechanism, where a holistic intent predictor captures global behavioral dynamics to guide overall behavior synthesis, while an immediate intent predictor provides step-wise, fine-grained signals for local behavior refinement at each diffusion step. This hierarchical intent formulation imposes a structured inductive bias for humanoid control, improving semantic alignment and behavioral naturalness. Furthermore, MIND encodes humanoid states into a latent space to enable more effective semantic intent modeling. Extensive experiments demonstrate that MIND outperforms existing methods and synthesizes coherent, physically plausible, and semantically aligned humanoid behaviors from text commands. Project page: https://binlee26.github.io/MIND_page.2026-05-25T16:23:10ZBin LiRuichi ZhangHan LiangJingyan ZhangJuze ZhangXin ChenJingya Wang