https://arxiv.org/api/e/KbUdFrNlQJbP+YMTePaHROiNY 2026-06-09T20:03:07Z 9301 0 15 http://arxiv.org/abs/2606.09803v1 Echo-Memory: A Controlled Study of Memory in Action World Models 2026-06-08T17:54:10Z We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics. 2026-06-08T17:54:10Z 9 figures and 28 pages, Code at \href{https://github.com/Echo-Team-Joy-Future-Academy-JD/Echo-Memory}{this URL} Wayne King Zeyue Xue Yuxuan Bian Jie Huang Haoran Li Yaowei Li Yaofeng Su Yuming Li Haoyu Wang Shiyi Zhang Songchun Zhang Yuwei Niu Sihan Xu Junhao Zhuang Haoyang Huang Nan Duan http://arxiv.org/abs/2606.09794v1 Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction 2026-06-08T17:50:41Z View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks. 2026-06-08T17:50:41Z 19 pages, 11 figures Ewa Miazga Jorge Condor Piotr Didyk http://arxiv.org/abs/2605.16223v2 Evaluating Design Video Generation: Metrics for Compositional Fidelity 2026-06-08T17:19:01Z Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field. We release the code and dataset here: https://github.com/purvanshi/lica-bench. 2026-05-15T17:34:05Z ICML 2026 Workshop on Human-AI Co-Creativity Adrienne Deganutti Dingning Cao Jaejung Seol Elad Hirsch Purvanshi Mehta http://arxiv.org/abs/2502.06819v2 AccioScene: Compositional 3D Scene Generation via Graph Diffusion and Interaction-driven Critics 2026-06-08T17:06:41Z This paper presents a framework for generating 3D indoor scenes from text prompts. Existing methods often formulate scene synthesis as an object layout prediction problem conditioned on a single input modality, such as a text description, room shape, or scene graph. This design can lead to object collisions and limited functional plausibility, reducing its practical applicability. To address these limitations, we introduce a multi-stage pipeline that better reflects practical scene creation scenarios. Given a text prompt describing partial scene content, our method first uses graph diffusion to produce a contextually coherent scene graph and then predicts a realistic object layout. In addition, we incorporate lightweight human-object interaction priors to encourage human-centric and functional arrangements, with explicit spatial constraints to reduce interpenetration. Our approach generates coherent 3D scenes with viable layouts that better support human interaction. Experiments on the 3D-FRONT dataset demonstrate that our method achieves competitive or state-of-the-art performance compared with existing approaches, while improving the physical plausibility of generated scenes. 2025-02-05T04:00:24Z Yao Wei Matteo Toso Pietro Morerio Changjae Oh Michael Ying Yang Alessio Del Bue http://arxiv.org/abs/2606.09741v1 bbsolver: A Unified Error-Bounded Spatiotemporal Optimization Solver for Key Timing and Topology-Consistent Vector Paths 2026-06-08T17:04:33Z Dense sampling records what an animation system actually evaluated, but it produces a poor final representation: every sampled frame can become a key, edit handles become noisy, and animated vector paths remain hard to adjust. Existing reducers usually treat the two axes separately: animation-curve reducers reduce key timing, while curve and path simplifiers reduce geometry. When applied independently to animated paths, these methods can break point identity across frames, change vertex structure over time, or provide no single error budget that covers both timing and shape. bbsolver frames the task as tolerance-bounded spatiotemporal reduction. A host application, such as After Effects or Blender, samples temporal and spatial animation into a documented JSON bundle; the standalone solver chooses sparse keys, interpolation metadata, and path representation; and the output is accepted only if replayed samples remain within the requested worst-case error. The same solver core can be used by any application that can export samples and write back returned keys or paths. In After Effects validation, solved keys written back into AE and re-sampled from AE playback reduce a DUIK humanoid walk cycle from 12,684 samples to 540 keys at epsilon=1, a 23.5x reduction, and an ant rig from 11,956 samples to 653 keys, an 18.3x reduction, with maximum errors below 1 px and 1 degree. A Blender-sampled FBX mocap retarget reaches 214 keys from 13,455 samples at epsilon=3; baselines tuned to matched measured accuracy require 4.5x to 27.5x more scalar key entries. For vector paths, bbsolver supports reduction when vertex identity/order is constant over time and diagnostics for variable-vertex-count streams, including a 6.7x After Effects-compatible procedural-path compression and exact transition-timing recovery in a diagnostic case. 2026-06-08T17:04:33Z 18 pages, 11 figures; source code and reproducibility artifact at https://github.com/ivg-design/bbsolver Ilya Gusinski IVG Design http://arxiv.org/abs/2606.09606v1 Path-Traced Inverse Rendering with Global Illumination in 3D Gaussian Fields 2026-06-08T15:15:07Z Ray tracing enables 3D Gaussian fields to serve as a representation for physically based light transport. Faithful inverse rendering requires forward rendering and backward optimization to be defined within a consistent light-transport pipeline. Existing inverse rendering methods estimate G-buffers via splatting and optimize materials in screen space, tying the recovered properties to a rasterization-based pipeline. This pipeline mismatch, together with simplified rendering equations that neglect indirect illumination, often leads to inconsistent shading, visible artifacts, and inaccurate material-lighting estimation under path-traced rendering. Therefore, we propose a splatting-free path-traced inverse rendering framework for 3D Gaussian fields, where forward light transport and backward gradient propagation are defined within a unified ray-tracing pipeline. Our key idea is to define a path-space equivalent interaction model for overlapping Gaussian primitives, under which Monte-Carlo-based path tracing is unbiased for the induced light-transport integral, while pathwise gradients are replayed over the same ray-traced interactions rather than splatting-derived screen-space buffers. The framework optimizes materials and a compact Spherical-Gaussian environment under the full rendering equation with ray-traced visibility and multi-bounce light transport. Extensive experiments demonstrate competitive material inversion and improved path-traced rendering quality, producing more plausible shadows, reflections, and relighting results under global illumination. 2026-06-08T15:15:07Z Junke Zhu Hao Zhang Yutian Zhu Ang Li Chenxiao Hu Meng Gai Fei Zhu Zhangjin Huang Sheng Li http://arxiv.org/abs/2606.06497v2 Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers 2026-06-08T13:17:08Z Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model's default representational space. 2026-04-24T12:40:58Z 5 pages, 4 figures. Accepted to ACM Creativity & Cognition XAIxArts Workshop 2026 Adam Cole Rebecca Fiebrink Mick Grierson http://arxiv.org/abs/2604.16512v2 Medial Axis Aware Learning of Signed Distance Functions 2026-06-08T13:14:17Z We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method. 2026-04-15T08:55:07Z Samuel Weidemaier Christoph Norden-Smoch Martin Rumpf http://arxiv.org/abs/2604.25781v3 Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects 2026-06-08T09:16:44Z Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at https://arlo-yang.github.io/Sketch2Arti. 2026-04-28T15:47:30Z Project page: https://arlo-yang.github.io/Sketch2Arti Yi Yang Hao Pan Yijing Cui Alla Sheffer Changjian Li http://arxiv.org/abs/2606.09134v1 From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs 2026-06-08T07:32:06Z Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%. 2026-06-08T07:32:06Z Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026 Jiangtao Shuai Zongxiong Chen Manfred Hauswirth Sonja Schimmler http://arxiv.org/abs/2512.07248v2 Distinguishing Imitation Error from Intrinsic Motion Learning Difficulty 2026-06-08T06:49:19Z Physics-based motion imitation is central to humanoid control, yet current evaluation metrics (e.g., MPJPE) only quantify imitation outcomes, not their underlying causes. This conflation obscures a critical diagnostic question: when imitation error occurs, does it stem from policy limitations or the intrinsic learning difficulty of the target motion? To resolve this ambiguity, we propose the Torque Variation Score (TVS), a physics-grounded metric that quantifies the inherent learning difficulty of a motion independently of any policy's performance. TVS measures the magnitude of torque variation required to correct small pose perturbations, directly capturing how dynamical properties shape the reinforcement learning landscape. We establish that high-TV motions induce flat reward landscapes and vanishing policy gradients, explaining persistent imitation failures. Extensive experiments with state-of-the-art methods (UHC, PHC+) confirm TVS strongly correlates with imitation error and enables principled error attribution: high error on low-TV motions indicates policy deficiency, while high error on high-TV motions reflects fundamental learning constraints. Beyond error diagnosis, TVS facilitates three practical applications: Maximum Imitable Difficulty (MID) for policy capability assessment, Difficulty-Stratified Joint Error (DSJE) for granular performance profiling, and Flawed Motion Detection for identifying segments with abnormally high learning difficulty to support mocap data curation and quality control. TVS provides a rigorous lens to distinguish policy-induced errors from motion-inherent challenges and enhances motion dataset reliability. 2025-12-08T07:45:24Z Zhaorui Meng Lu Yin Xinrui Chen Chengxu Zuo Anjun Chen Shihui Guo Yipeng Qin http://arxiv.org/abs/2604.22482v2 Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond 2026-06-08T05:20:56Z While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available. 2026-04-24T12:03:27Z Datasets Link: https://github.com/Jou719/Holo360D Jing Ou Zidong Cao Yinrui Ren Zhuoxiao Li Jinjing Zhu Tongyan Hua Shuai Zhang Hui Xiong Wufan Zhao http://arxiv.org/abs/2606.09018v1 MaterialClusterGS: Palette-Based Material Decomposition and Physically-Based Relighting with 2D Gaussian Splatting 2026-06-08T04:30:03Z We present MaterialClusterGS, a palette-based material decomposition framework for 2D Gaussian Splatting that enables physically based relighting and material editing. Existing Gaussian inverse rendering methods typically assign independent BRDF parameters to individual primitives. While flexible, this local fitting strategy makes material recovery highly under-constrained: shadows, indirect illumination, geometric errors, and visibility residuals can be absorbed into thousands of slightly different local material estimates. Meanwhile, recent palette-based appearance methods operate solely in RGB space without modeling physical materials or illumination. To bridge this gap, we represent scene materials using a compact global palette of shared BRDF prototypes assigned via a continuous spatial material field. Without shared material structure, editing one region does not propagate consistently to others of the same material, making per-primitive decompositions impractical for editing. We jointly optimize the material field, palette prototypes, and environment lighting under a physically based rendering objective. The resulting framework recovers compact, spatially coherent attributes directly usable for material editing, relighting, and transfer. 2026-06-08T04:30:03Z Hao Zhang Ang Li Boyan Du Junke Zhu Fei Zhu Meng Gai Zhangjin Huang Guoping Wang Sheng Li http://arxiv.org/abs/2606.08739v1 The Minimal Retroreflective Microfacet Model 2026-06-07T17:20:21Z We present the Minimal Retroreflective Microfacet (MRM) model, which turns any existing microfacet BSDF into a physically plausible retroreflective one by a single substitution: replacing the view direction with its reflection about the surface normal before evaluating the standard model. Based on the previously published back-vector formulation, MRM requires only minimal code changes and has been adopted in the OpenPBR and MaterialX material standards. We prove reciprocity and energy conservation under the assumption of a reflection-symmetric normal distribution function (NDF), which holds for all commonly used distributions, and validate the model against measured retroreflective material data. 2026-06-07T17:20:21Z 16 pages, 7 figures, 1 code listing. Author's version. Published in the Journal of Computer Graphics Techniques (JCGT), Vol. 15, No. 1, 2026. Article URL: http://jcgt.org/published/0015/01/04/ Journal of Computer Graphics Techniques (JCGT), Vol. 15, No. 1, pp. 60-75, 2026 Jamie Portsmouth Autodesk Matthias Raab NVIDIA Laurent Belcour Intel Francis Liu NVIDIA http://arxiv.org/abs/2605.27852v3 ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation 2026-06-07T10:58:14Z Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts. Project Page: https://yucrazing.github.io/clothtransformer/ 2026-05-27T02:10:58Z Yu Zhang Yidi Shao Wenqi Ouyang Yushi Lan Zhexin Liang Chengrui Wu Xudong Xu Xingang Pan