https://arxiv.org/api/e/KbUdFrNlQJbP+YMTePaHROiNY2026-06-09T20:03:07Z9301015http://arxiv.org/abs/2606.09803v1Echo-Memory: A Controlled Study of Memory in Action World Models2026-06-08T17:54:10ZWe present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.2026-06-08T17:54:10Z9 figures and 28 pages, Code at \href{https://github.com/Echo-Team-Joy-Future-Academy-JD/Echo-Memory}{this URL}Wayne KingZeyue XueYuxuan BianJie HuangHaoran LiYaowei LiYaofeng SuYuming LiHaoyu WangShiyi ZhangSongchun ZhangYuwei NiuSihan XuJunhao ZhuangHaoyang HuangNan Duanhttp://arxiv.org/abs/2606.09794v1Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction2026-06-08T17:50:41ZView-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.2026-06-08T17:50:41Z19 pages, 11 figuresEwa MiazgaJorge CondorPiotr Didykhttp://arxiv.org/abs/2605.16223v2Evaluating Design Video Generation: Metrics for Compositional Fidelity2026-06-08T17:19:01ZGenerative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field. We release the code and dataset here: https://github.com/purvanshi/lica-bench.2026-05-15T17:34:05ZICML 2026 Workshop on Human-AI Co-CreativityAdrienne DeganuttiDingning CaoJaejung SeolElad HirschPurvanshi Mehtahttp://arxiv.org/abs/2502.06819v2AccioScene: Compositional 3D Scene Generation via Graph Diffusion and Interaction-driven Critics2026-06-08T17:06:41ZThis paper presents a framework for generating 3D indoor scenes from text prompts. Existing methods often formulate scene synthesis as an object layout prediction problem conditioned on a single input modality, such as a text description, room shape, or scene graph. This design can lead to object collisions and limited functional plausibility, reducing its practical applicability. To address these limitations, we introduce a multi-stage pipeline that better reflects practical scene creation scenarios. Given a text prompt describing partial scene content, our method first uses graph diffusion to produce a contextually coherent scene graph and then predicts a realistic object layout. In addition, we incorporate lightweight human-object interaction priors to encourage human-centric and functional arrangements, with explicit spatial constraints to reduce interpenetration. Our approach generates coherent 3D scenes with viable layouts that better support human interaction. Experiments on the 3D-FRONT dataset demonstrate that our method achieves competitive or state-of-the-art performance compared with existing approaches, while improving the physical plausibility of generated scenes.2025-02-05T04:00:24ZYao WeiMatteo TosoPietro MorerioChangjae OhMichael Ying YangAlessio Del Buehttp://arxiv.org/abs/2606.09741v1bbsolver: A Unified Error-Bounded Spatiotemporal Optimization Solver for Key Timing and Topology-Consistent Vector Paths2026-06-08T17:04:33ZDense sampling records what an animation system actually evaluated, but it produces a poor final representation: every sampled frame can become a key, edit handles become noisy, and animated vector paths remain hard to adjust. Existing reducers usually treat the two axes separately: animation-curve reducers reduce key timing, while curve and path simplifiers reduce geometry. When applied independently to animated paths, these methods can break point identity across frames, change vertex structure over time, or provide no single error budget that covers both timing and shape. bbsolver frames the task as tolerance-bounded spatiotemporal reduction. A host application, such as After Effects or Blender, samples temporal and spatial animation into a documented JSON bundle; the standalone solver chooses sparse keys, interpolation metadata, and path representation; and the output is accepted only if replayed samples remain within the requested worst-case error. The same solver core can be used by any application that can export samples and write back returned keys or paths. In After Effects validation, solved keys written back into AE and re-sampled from AE playback reduce a DUIK humanoid walk cycle from 12,684 samples to 540 keys at epsilon=1, a 23.5x reduction, and an ant rig from 11,956 samples to 653 keys, an 18.3x reduction, with maximum errors below 1 px and 1 degree. A Blender-sampled FBX mocap retarget reaches 214 keys from 13,455 samples at epsilon=3; baselines tuned to matched measured accuracy require 4.5x to 27.5x more scalar key entries. For vector paths, bbsolver supports reduction when vertex identity/order is constant over time and diagnostics for variable-vertex-count streams, including a 6.7x After Effects-compatible procedural-path compression and exact transition-timing recovery in a diagnostic case.2026-06-08T17:04:33Z18 pages, 11 figures; source code and reproducibility artifact at https://github.com/ivg-design/bbsolverIlya GusinskiIVG Designhttp://arxiv.org/abs/2606.09606v1Path-Traced Inverse Rendering with Global Illumination in 3D Gaussian Fields2026-06-08T15:15:07ZRay tracing enables 3D Gaussian fields to serve as a representation for physically based light transport. Faithful inverse rendering requires forward rendering and backward optimization to be defined within a consistent light-transport pipeline. Existing inverse rendering methods estimate G-buffers via splatting and optimize materials in screen space, tying the recovered properties to a rasterization-based pipeline. This pipeline mismatch, together with simplified rendering equations that neglect indirect illumination, often leads to inconsistent shading, visible artifacts, and inaccurate material-lighting estimation under path-traced rendering. Therefore, we propose a splatting-free path-traced inverse rendering framework for 3D Gaussian fields, where forward light transport and backward gradient propagation are defined within a unified ray-tracing pipeline. Our key idea is to define a path-space equivalent interaction model for overlapping Gaussian primitives, under which Monte-Carlo-based path tracing is unbiased for the induced light-transport integral, while pathwise gradients are replayed over the same ray-traced interactions rather than splatting-derived screen-space buffers. The framework optimizes materials and a compact Spherical-Gaussian environment under the full rendering equation with ray-traced visibility and multi-bounce light transport. Extensive experiments demonstrate competitive material inversion and improved path-traced rendering quality, producing more plausible shadows, reflections, and relighting results under global illumination.2026-06-08T15:15:07ZJunke ZhuHao ZhangYutian ZhuAng LiChenxiao HuMeng GaiFei ZhuZhangjin HuangSheng Lihttp://arxiv.org/abs/2606.06497v2Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers2026-06-08T13:17:08ZGenerative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model's default representational space.2026-04-24T12:40:58Z5 pages, 4 figures. Accepted to ACM Creativity & Cognition XAIxArts Workshop 2026Adam ColeRebecca FiebrinkMick Griersonhttp://arxiv.org/abs/2604.16512v2Medial Axis Aware Learning of Signed Distance Functions2026-06-08T13:14:17ZWe propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.2026-04-15T08:55:07ZSamuel WeidemaierChristoph Norden-SmochMartin Rumpfhttp://arxiv.org/abs/2604.25781v3Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects2026-06-08T09:16:44ZArticulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at https://arlo-yang.github.io/Sketch2Arti.2026-04-28T15:47:30ZProject page: https://arlo-yang.github.io/Sketch2ArtiYi YangHao PanYijing CuiAlla ShefferChangjian Lihttp://arxiv.org/abs/2606.09134v1From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs2026-06-08T07:32:06ZConstructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.2026-06-08T07:32:06ZAccepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026Jiangtao ShuaiZongxiong ChenManfred HauswirthSonja Schimmlerhttp://arxiv.org/abs/2512.07248v2Distinguishing Imitation Error from Intrinsic Motion Learning Difficulty2026-06-08T06:49:19ZPhysics-based motion imitation is central to humanoid control, yet current evaluation metrics (e.g., MPJPE) only quantify imitation outcomes, not their underlying causes. This conflation obscures a critical diagnostic question: when imitation error occurs, does it stem from policy limitations or the intrinsic learning difficulty of the target motion? To resolve this ambiguity, we propose the Torque Variation Score (TVS), a physics-grounded metric that quantifies the inherent learning difficulty of a motion independently of any policy's performance. TVS measures the magnitude of torque variation required to correct small pose perturbations, directly capturing how dynamical properties shape the reinforcement learning landscape. We establish that high-TV motions induce flat reward landscapes and vanishing policy gradients, explaining persistent imitation failures. Extensive experiments with state-of-the-art methods (UHC, PHC+) confirm TVS strongly correlates with imitation error and enables principled error attribution: high error on low-TV motions indicates policy deficiency, while high error on high-TV motions reflects fundamental learning constraints. Beyond error diagnosis, TVS facilitates three practical applications: Maximum Imitable Difficulty (MID) for policy capability assessment, Difficulty-Stratified Joint Error (DSJE) for granular performance profiling, and Flawed Motion Detection for identifying segments with abnormally high learning difficulty to support mocap data curation and quality control. TVS provides a rigorous lens to distinguish policy-induced errors from motion-inherent challenges and enhances motion dataset reliability.2025-12-08T07:45:24ZZhaorui MengLu YinXinrui ChenChengxu ZuoAnjun ChenShihui GuoYipeng Qinhttp://arxiv.org/abs/2604.22482v2Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond2026-06-08T05:20:56ZWhile feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.2026-04-24T12:03:27ZDatasets Link: https://github.com/Jou719/Holo360DJing OuZidong CaoYinrui RenZhuoxiao LiJinjing ZhuTongyan HuaShuai ZhangHui XiongWufan Zhaohttp://arxiv.org/abs/2606.09018v1MaterialClusterGS: Palette-Based Material Decomposition and Physically-Based Relighting with 2D Gaussian Splatting2026-06-08T04:30:03ZWe present MaterialClusterGS, a palette-based material decomposition framework for 2D Gaussian Splatting that enables physically based relighting and material editing. Existing Gaussian inverse rendering methods typically assign independent BRDF parameters to individual primitives. While flexible, this local fitting strategy makes material recovery highly under-constrained: shadows, indirect illumination, geometric errors, and visibility residuals can be absorbed into thousands of slightly different local material estimates. Meanwhile, recent palette-based appearance methods operate solely in RGB space without modeling physical materials or illumination. To bridge this gap, we represent scene materials using a compact global palette of shared BRDF prototypes assigned via a continuous spatial material field. Without shared material structure, editing one region does not propagate consistently to others of the same material, making per-primitive decompositions impractical for editing. We jointly optimize the material field, palette prototypes, and environment lighting under a physically based rendering objective. The resulting framework recovers compact, spatially coherent attributes directly usable for material editing, relighting, and transfer.2026-06-08T04:30:03ZHao ZhangAng LiBoyan DuJunke ZhuFei ZhuMeng GaiZhangjin HuangGuoping WangSheng Lihttp://arxiv.org/abs/2606.08739v1The Minimal Retroreflective Microfacet Model2026-06-07T17:20:21ZWe present the Minimal Retroreflective Microfacet (MRM) model, which turns any existing microfacet BSDF into a physically plausible retroreflective one by a single substitution: replacing the view direction with its reflection about the surface normal before evaluating the standard model. Based on the previously published back-vector formulation, MRM requires only minimal code changes and has been adopted in the OpenPBR and MaterialX material standards. We prove reciprocity and energy conservation under the assumption of a reflection-symmetric normal distribution function (NDF), which holds for all commonly used distributions, and validate the model against measured retroreflective material data.2026-06-07T17:20:21Z16 pages, 7 figures, 1 code listing. Author's version. Published in the Journal of Computer Graphics Techniques (JCGT), Vol. 15, No. 1, 2026. Article URL: http://jcgt.org/published/0015/01/04/Journal of Computer Graphics Techniques (JCGT), Vol. 15, No. 1, pp. 60-75, 2026Jamie PortsmouthAutodeskMatthias RaabNVIDIALaurent BelcourIntelFrancis LiuNVIDIAhttp://arxiv.org/abs/2605.27852v3ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation2026-06-07T10:58:14ZUnified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts. Project Page: https://yucrazing.github.io/clothtransformer/2026-05-27T02:10:58ZYu ZhangYidi ShaoWenqi OuyangYushi LanZhexin LiangChengrui WuXudong XuXingang Pan