https://arxiv.org/api/QgQSMEb1yyM34EibJYFmU/LtLRg2026-06-14T07:22:22Z19571631515http://arxiv.org/abs/2606.11446v13D-CBM: A Framework for Concept-Based Interpretability in Generative 3D Modeling2026-06-09T20:57:03ZThis research introduces a framework for incorporating Concept Bottleneck Models (CBMs) into 3D generative architectures to address the inherent 'semantic gap' in deep geometric learning. As deep models become central to 3D content creation, explainability shifts from a peripheral feature to a fundamental requirement for trust and accountability in safety-critical domains such as healthcare and manufacturing. CBMs provide an intrinsic interpretability solution by constraining latent representations to align with human-defined concepts, yet their application to unstructured 3D data remains largely unexplored.
We design, implement, and validate a formal 3D-CBM architecture that maps raw geometric inputs, including point clouds and meshes, into a multi-tiered taxonomy of interpretable primitives and functional attributes. The framework further identifies strategic datasets, such as PartNet and ShapeNet, specialized for concept-based supervision. Experimental results from a 3D part-manipulation proof-of-concept experiment demonstrate the framework's efficacy, achieving a concept prediction accuracy of 88.8\% and a Chamfer Distance of 0.0115. Critically, the model enables precise test-time intervention, allowing for the interactive correction of structural errors. This work establishes a foundation for semantically-steerable 3D generation and invites further exploration into collaborative human-in-the-loop design systems.2026-06-09T20:57:03ZAhmad Al-Kabbanyhttp://arxiv.org/abs/2605.20436v2Lighting-aware Unified Model for Instance Segmentation2026-06-09T20:31:56ZFoundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing \textit{Lighting Convolutional-Attention (\lca{})}, an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca{} employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca{} through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.2026-05-19T19:35:51ZQisai LiuAlloy DasZhanhong JiangJoshua R. WaiteAditya BaluAdarsh KrishnamurthySoumik Sarkarhttp://arxiv.org/abs/2606.11390v1A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting2026-06-09T19:15:39ZGaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.2026-06-09T19:15:39Z14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary materialMatthew CongFrancis WilliamsJonathan SwartzMark HarrisSanja FidlerKen Musethhttp://arxiv.org/abs/2606.11385v1DeceptionX: Explainable Deception Detection with Multimodal Large Language Models2026-06-09T19:08:00ZDeception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model's generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.2026-06-09T19:08:00ZJiayu ZhangShuo YeJiajian HuangYawen CuiTaorui WangWei XiaZeheng WangHaowen TangHui MaZitong Yuhttp://arxiv.org/abs/2606.11381v1From Simulation to Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting2026-06-09T19:04:52ZRobotic strawberry harvesting requires precise 6D pose estimation; however, collecting 6D pose ground truth in real agricultural fields is inherently challenging. Existing 6D pose estimation methods have therefore relied solely on synthetic data that lacks scene-level realism, leaving their performance under real agricultural field conditions unquantified. In this work, we present, to the best of our knowledge, the first real-world 6D pose ground truth dataset of strawberries collected in actual agricultural fields (12,040 images). We also introduce a synthetic dataset rendered in NVIDIA Isaac Sim, featuring scene-level realism and domain randomization. Nevertheless, our experiments reveal that a significant sim-to-real gap persists, underscoring the necessity of real agricultural field data for reliable evaluation. We further quantify the sim-to-real gap through baseline 6D pose estimation results across backbone encoders, serving as a reference for future work. The real-world dataset will be made available upon acceptance.2026-06-09T19:04:52Z7 pages, 6 figures, 1 tableWoojung SonDepartment of Agricultural and Biological Engineering, University of FloridaWon Suk LeeDepartment of Agricultural and Biological Engineering, University of FloridaZijing HuangDepartment of Agricultural and Biological Engineering, University of FloridaDaeun ChoiDepartment of Agricultural and Biological Engineering, University of FloridaCatia SilvaDepartment of Electrical and Computer Engineering, University of FloridaYu SheEdwardson School of Industrial Engineering, Purdue UniversityYan GuSchool of Mechanical Engineering, Purdue Universityhttp://arxiv.org/abs/2412.09023v3STEAM: Squeeze and Transform Enhanced Attention Module2026-06-09T18:50:54ZChannel and spatial attention mechanisms introduced in earlier work enhance the representational capabilities of deep convolutional neural networks (CNNs) but often increase parameter and computational costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, \textit{STEAM: Squeeze and Transform Enhanced Attention Module}, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce \textit{Output Guided Pooling} (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a \(2\%\) increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms the leading modules, ECA and GCT, in terms of accuracy while achieving a threefold reduction in GFLOPs. The code will be made available upon acceptance.2024-12-12T07:38:10ZRishabh SabharwalRam Samarth B BParikshit Singh RathorePunit Rathorehttp://arxiv.org/abs/2606.11363v1NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization2026-06-09T18:43:29ZVector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.2026-06-09T18:43:29ZHao LuYongxin GuoOnur KoyunZhengjie ZhuAbbas AliliMetin N. Gurcanhttp://arxiv.org/abs/2512.16415v3CountZES: Counting via Zero-Shot Exemplar Selection2026-06-09T18:40:25ZObject counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. Since counting is sensitive to exemplar quality, such selection strategies often yield poorly representative exemplars, leading to inaccurate count estimation. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.2025-12-18T11:12:50ZMuhammad Ibraheem SiddiquiMuhammad Haris Khanhttp://arxiv.org/abs/2606.11326v1DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax2026-06-09T18:08:51ZRecent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.2026-06-09T18:08:51ZProject Page: https://darkvggt.github.ioMinseong KweonWenyuan ZhaoNuo ChenLulin LiuHuiwen HanZihao ZhuSrinivas ShakkottaiChao TianZhiwen Fanhttp://arxiv.org/abs/2606.11320v1Semantic Segmentation of Node and Edge Diagrams for Assistive Technology2026-06-09T18:02:12ZIn this paper, we present a novel set of related models for semantic segmentation of node-link diagrams. These diagrams are frequently used to represent mathematical graphs, relationships between concepts, and flowcharts. Such diagrams are difficult to access non-visually; while some assistive interfaces have been designed for node-link diagrams, they rely upon a machine-readable representation of the diagram, whereas such diagrams will generally be made available as bitmap images. Our compact deep learning models show excellent quantitative and qualitative performance on a large synthetic dataset of node-link diagrams, reaching per-pixel accuracy over 93\%.2026-06-09T18:02:12Z8 pages, 6 figures, 1 table. In Proceedings of the 23rd Conference on Robots and Vision (2026)Michael CormierYichun ZhaoLaura PaulCameron SwiftDuc Tri DangMiguel Nacentahttp://arxiv.org/abs/2606.11314v1TRON: Tracing Rays to Orchestrate a Neural Renderer for 3D Gaussian Reconstructions2026-06-09T18:01:02ZWe introduce TRON, a rendering framework that combines 3D Gaussian ray tracing with neural rendering to enable realistic and controllable rendering of real-world 3D scenes under novel lighting, dynamic object motion, object insertion, and material editing. Prior approaches that rely solely on physically based rendering (PBR) of Gaussian representations struggle to achieve realistic relighting due to imperfections in reconstructed geometry, material estimates, and light transport estimation. At the same time, neural rendering methods often lack an explicit scene representation, limiting their ability to support interactive editing with fine-grained manipulation. TRON bridges these two paradigms. We use intrinsic decomposition priors from a learned inverse rendering model to regularize the material properties of a Gaussian field, and repurpose a ray tracer to provide radiometric guidance rather than final pixels. By treating this output as a structured 3D scaffold, we empower a lightweight neural renderer to bridge the domain gap between shading-model constrained estimates and photorealistic output. Our key insight is that the combination of explicit 3D knowledge with robust material priors provides speed and controllability, while neural rendering enables the synthesis of photorealistic images. To support real-world scenarios, we train our neural renderer with a multi-stage strategy consisting of large-scale pretraining and targeted fine-tuning on a newly constructed dataset of 2.1M rendered synthetic and real-world frames from 3D reconstructions. TRON outperforms Gaussian-based relighting methods in realism, and prior neural renderers in editability and speed. To the best of our knowledge, TRON is the first method to enable practical interactive applications in captured 3D environments, offering realistic appearance under dynamic geometric, lighting and material conditions.2026-06-09T18:01:02ZProject page: https://research.nvidia.com/labs/sil/projects/tron/Or PerelHassan Abu AlhaijaZian WangJacob MunkbergMatan AtzmonSanja FidlerMasha Shugrinahttp://arxiv.org/abs/2606.11188v1ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations2026-06-09T17:59:28ZThis paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.2026-06-09T17:59:28Ztechnical reportJunke WangXiao WangJiacheng PanXuefeng HuFeng LiJingxiang SunChaorui DengZilong ChenYunpeng ChenKaibin TianMatthew GwilliamHao ChenDanhui GuanKun XuWeilin HuangZuxuan WuHaoqi FanYu-Gang JiangZhenheng Yanghttp://arxiv.org/abs/2606.11187v1Next Forcing: Causal World Modeling with Multi-Chunk Prediction2026-06-09T17:59:22ZAutoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.2026-06-09T17:59:22ZProject page: https://gangweix.github.io/next-forcing/Gangwei XuQihang ZhangJiaming ZhouXing ZhuYujun ShenXin YangYinghao Xuhttp://arxiv.org/abs/2606.11186v1AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference2026-06-09T17:59:05ZLow-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.2026-06-09T17:59:05ZAccepted at ICML 2026; Project page and code: https://lhfgghc.github.io/LLVE-AMNetHangfeng LiangYutao HuYanhan HuXiaohan WuWenqi ShaoYing Fuhttp://arxiv.org/abs/2606.11289v1i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models2026-06-09T17:58:10ZDiffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github.com/zlab-princeton/i1.2026-06-09T17:58:10ZProject page at https://zlab-princeton.github.io/i1Boya ZengTianze LuoShu PuJucheng ShenTaiming LuGabriel SarchZhuang Liu