https://arxiv.org/api/c/p8WVjOgmJvJoFeT+4Xdsu7dxY2026-03-24T09:58:00Z88473015http://arxiv.org/abs/2603.14925v2Workflow-Aware Structured Layer Decomposition for Illustration Production2026-03-18T13:18:12ZRecent generative image editing methods adopt layered representations to mitigate the entangled nature of raster images and improve controllability, typically relying on object-based segmentation. However, such strategies may fail to capture the structural and stylized properties of human-created images, such as anime illustrations. To solve this issue, we propose a workflow-aware structured layer decomposition framework tailored to the illustration production of anime artwork. Inspired by the creation pipeline of anime production, our method decomposes the illustration into semantically meaningful production layers, including line art, flat color, shadow, and highlight. To decouple all these layers, we introduce lightweight layer semantic embeddings to provide specific task guidance for each layer. Furthermore, a set of layer-wise losses is incorporated to supervise the training process of individual layers. To overcome the lack of ground-truth layered data, we construct a high-quality illustration dataset that simulated the standard anime production workflow. Experiments demonstrate that the accurate and visually coherent layer decompositions were achieved by using our method. We believe that the resulting layered representation further enables downstream tasks such as recoloring and embedding texture, supporting content creation, and illustration editing. Code is available at: https://github.com/zty0304/Anime-layer-decomposition2026-03-16T07:28:37Z17 pages, 15 figuresTianyu ZhangDongchi LiKeiichi SawadaHaoran Xiehttp://arxiv.org/abs/2509.25857v2Vector sketch animation generation with differentiable motion trajectories2026-03-18T10:19:37ZSketching is a direct and inexpensive means of visual expression. Though image-based sketching has been well studied, video-based sketch animation generation is still very challenging due to the temporal coherence requirement. In this paper, we propose a novel end-to-end automatic generation approach for vector sketch animation. To solve the flickering issue, we introduce a Differentiable Motion Trajectory (DMT) representation that describes the frame-wise movement of stroke control points using differentiable polynomial-based trajectories. DMT enables global semantic gradient propagation across multiple frames, significantly improving the semantic consistency and temporal coherence, and producing high-framerate output. DMT employs a Bernstein basis to balance the sensitivity of polynomial parameters, thus achieving more stable optimization. Instead of implicit fields, we introduce sparse track points for explicit spatial modeling, which improves efficiency and supports long-duration video processing. Evaluations on DAVIS and LVOS datasets demonstrate the superiority of our approach over SOTA methods. Cross-domain validation on 3D models and text-to-video data confirms the robustness and compatibility of our approach.2025-09-30T06:53:04Z14 pages, 12 figuresXinding ZhuXinye YangShuyang ZhengZhexin ZhangFei GaoJing HuangJiazhou Chenhttp://arxiv.org/abs/2603.20284v1STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction2026-03-18T06:36:46ZOnline 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal VGGT transformers address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency.
In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency.
Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.2026-03-18T06:36:46Z10 pages, 6 figures. Accepted by CVPR 2026Runze WangYuxuan SongYoucheng CaiLigang Liuhttp://arxiv.org/abs/2603.17337v1Scale-Aware Navigation of Astronomical Survey Imagery Data on High Resolution Immersive Displays2026-03-18T04:04:39ZUpcoming astronomical surveys produce imagery that spans many orders of magnitude in spatial scale, requiring scientists to reason fluidly between global structure and local detail. Data from the Vera C. Rubin Observatory exemplifies this challenge, as traditional desktop-based workflows often rely on discrete views or static cutouts that fragment context during exploration. This paper presents a design-oriented framework for scale-aware navigation of astronomical survey imagery in high-resolution immersive display environments. We illustrate these principles through representative usage scenarios using Vera Rubin Observatory and Milky Way survey imagery deployed in room-scale immersive environments, including tiled high-resolution displays and curved immersive systems. Our goal is to contribute design insights that inform the development of immersive interaction paradigms for exploratory analysis of extreme-scale scientific imagery.2026-03-18T04:04:39Z4 pages, 2 figures, to appear in IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (IEEE VRW 2026)Ava NederlanderStony Brook UniversityZainab AamirStony Brook UniversityArie E. KaufmanStony Brook Universityhttp://arxiv.org/abs/2603.16866v1ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K2026-03-17T17:59:49ZLearning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.2026-03-17T17:59:49ZWebsite: https://manitwin.github.io/Kaixuan WangTianxing ChenJiawei LiuHonghao SuShaolong ZhuMinxuan WangZixuan LiYue ChenHuan-ang GaoYusen QinJiawei WangQixuan ZhangLan XuJingyi YuYao MuPing Luohttp://arxiv.org/abs/2603.16853v1BrickSim: A Physics-Based Simulator for Manipulating Interlocking Brick Assemblies2026-03-17T17:56:53ZInterlocking brick assemblies provide a standardized yet challenging testbed for contact-rich and long-horizon robotic manipulation, but existing rigid-body simulators do not faithfully capture snap-fit mechanics. We present BrickSim, the first real-time physics-based simulator for interlocking brick assemblies. BrickSim introduces a compact force-based mechanics model for snap-fit connections and solves the resulting internal force distribution using a structured convex quadratic program. Combined with a hybrid architecture that delegates rigid-body dynamics to the underlying physics engine while handling snap-fit mechanics separately, BrickSim enables real-time, high-fidelity simulation of assembly, disassembly, and structural collapse. On 150 real-world assemblies, BrickSim achieves 100% accuracy in static stability prediction with an average solve time of 5 ms. In dynamic drop tests, it also faithfully reproduces real-world structural collapse, precisely mirroring both the occurrence of breakage and the specific breakage locations. Built on Isaac Sim, BrickSim further supports seamless integration with a wide variety of robots and existing pipelines. We demonstrate robotic construction of brick assemblies using BrickSim, highlighting its potential as a foundation for research in dexterous, long-horizon robotic manipulation. BrickSim is open-source, and the code is available at https://github.com/intelligent-control-lab/BrickSim.2026-03-17T17:56:53Z9 pages, 9 figuresHaowei WenRuixuan LiuWeiyi PiaoSiyu LiChangliu Liuhttp://arxiv.org/abs/2603.16801v1A low-data, low-cost, and open-source workflow for 3D printing lithographs for digital accessibility of microscopy images2026-03-17T17:05:28ZDescribe an animal without using the verb look. Can you effectively provide an alternative method for interpreting complex microscopy images while preserving the length scale? The world is filled with features too small for our eyes to see: the setae on a gecko's feet, the cuticles covering a rat's whisker, or the fuzziness of a bat's wing. Furthermore, these structures are non-homogeneous, often shifting from stiff to soft. We provide a workflow for producing low-data, low-cost, and open-source lithograph files, allowing tactile accessibility in microscopy images. The lithographs made with this workflow can be printed on a 350 USD 3D printer using 3D files under 100 Mb, for a total cost per print of 0.75 USD. This work seeks to leverage advanced 3D printing to create tactile graphics and art that make science more accessible and enable tactile exploration of biological structures. This framework in this text is aligned with a GitHub repository that will be constantly updated, allowing tactile media to be created as 3D printing and lithography become more streamlined in the years to come.2026-03-17T17:05:28Z3 figures, Abigale Stangl and Andrew K. Schulz are co-corresponding authorsRobert FaulknerNatalia Gonzalez-VazquezVictoria GamezKarly E. CohenGunther RichterAbigale StanglAndrew K. Schulzhttp://arxiv.org/abs/2603.16612v1Retrieval-Augmented Sketch-Guided 3D Building Generation2026-03-17T14:51:52ZIn the early design stage of Japanese detached houses, the lack of a unified design representation among clients, sales representatives, and designers leads to design drift and inefficient feedback. Usually, sketches handed off by sales representatives may lose details for quick drawing, which reduces the fidelity of subsequent 3D generation using generative AI models. The generated 3D model typically takes the form of a single unified mesh, preventing component-level editing. To solve these issues, we propose a multi-stage 3D generative design framework capable of producing architectural models from rough design sketches. The framework combines generative and retrieval-based methods to enable component-level editing and personalized customization. It adopts a multimodal representation for 3D model generation and applies component segmentation to localize architectural components such as windows and doors and uses retrieval to support targeted replacement of components. Experiments show that the work enables modular customization which is thought to be suitable for personalized architectural design. This work introduces a multi-stage sketch-to-3D framework for Japanese detached houses, provides facade and component datasets, and shows effectiveness through quantitative and expert evaluations.2026-03-17T14:51:52Z10 pages, 4 figures, Proceeding of CAADRIA 2026Zhengyang WangNuttapong RochanavibhataYuxiao RenXusheng DuYe ZhangHaoran Xiehttp://arxiv.org/abs/2603.16566v1VideoMatGen: PBR Materials through Joint Generative Modeling2026-03-17T14:24:20ZWe present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.2026-03-17T14:24:20ZJon HasselgrenZheng ZengMilos HasanJacob Munkberghttp://arxiv.org/abs/2512.11237v2WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering2026-03-17T13:12:15ZExisting methods achieve high-quality facial albedo capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial albedo capture from a smartphone video recorded in the wild. To disentangle high-quality albedo from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. We first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for the albedo map and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Other reflectance maps are then predicted from the albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin.2025-12-12T02:37:03ZCVPR 2026. project page: https://yxuhan.github.io/WildCap/index.html; code: https://github.com/yxuhan/WildCapYuxuan HanXin MingTianxiao LiZhuofan ShenQixuan ZhangLan XuFeng Xuhttp://arxiv.org/abs/2603.16478v1Fast and Reliable Gradients for Deformables Across Frictional Contact Regimes2026-03-17T13:04:23ZDifferentiable simulation establishes the mathematical foundation for solving challenging inverse problems in computer graphics and robotics, such as physical system identification and inverse dynamics control. However, rigor in frictional contact remains the "elephant in the room." Current frameworks often avoid contact singularities via non-Markovian position approximations or heuristic gradients. This lack of mathematical consistency distorts gradients, causing optimization stagnation or failure in complex frictional contact and large-deformation scenarios. We introduce our unified fully GPU-accelerated differentiable simulator, which establishes a rigorous theoretical paradigm through: Long-Horizon Consistency: enforcing strict Markovian dynamics on a coupled position-velocity manifold to prevent gradient collapse; Unified Contact Stability: employing a mass-aligned preconditioner and soft Fischer--Burmeister operator for smooth frictional optimization; Robust Material Identification: resolving FEM singularities via a derived "Within-block Commutation" condition. Our experiments demonstrate our solver efficacy in bridging the Sim-to-Real gap, delivering precise, low-noise gradients in contact-rich tasks like dexterous manipulation and cloth folding. By mitigating the gradient instability issues common in conventional approaches, our framework significantly enhances the fidelity of physical system identification and control.2026-03-17T13:04:23ZZiqiu ZengGang YangZhenhao HuangYulin LiJason PhoSiyuan LuoFan Shihttp://arxiv.org/abs/2603.16447v1ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars2026-03-17T12:30:27ZIn practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.2026-03-17T12:30:27ZAccepted to CVPR 2026, Project page: https://ustc3dv.github.io/ProgressiveAvatars/Kaiwen SongJinkai CuiJuyong Zhanghttp://arxiv.org/abs/2504.05296v3Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation2026-03-17T12:14:12Z3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, Physics-Guided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance prior. This allows for photorealistic refinements while ensuring the dynamics remain plausible. Our framework enables scene-wide dynamic weather effects, including snowfall, rainfall, fog, and sandstorms, with physically plausible motion. Experiments demonstrate our physics-guided approach significantly outperforms baselines, with ablations confirming this joint refinement is essential for generating coherent, high-fidelity dynamics.2025-04-07T17:51:21ZAccepted to CVPR 2026. Project webpage: https://galfiebelman.github.io/let-it-snow/Gal FiebelmanHadar Averbuch-ElorSagie Benaimhttp://arxiv.org/abs/2511.02580v2TAUE: Training-free Noise Transplant and Cultivation Diffusion Model2026-03-17T10:21:32ZDespite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for layer-wise image generation that requires neither fine-tuning nor additional data. TAUE embeds global structural information from intermediate denoising latents into the initial noise to preserve spatial coherence, and integrates semantic cues through cross-layer attention sharing to maintain contextual and visual consistency across layers. Extensive experiments demonstrate that TAUE achieves state-of-the-art performance among training-free methods, delivering image quality comparable to fine-tuned models while improving inter-layer consistency. Moreover, it enables new applications, such as layout-aware editing, multi-object composition, and background replacement, indicating potential for interactive, layer-separated generation systems in real-world creative workflows.2025-11-04T13:56:39ZAccepted to CVPR 2026 Findings. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUEDaichi NagaiRyugo MoritaShunsuke KitadaHitoshi Iyatomihttp://arxiv.org/abs/2603.16103v1NanoGS: Training-Free Gaussian Splat Simplification2026-03-17T03:58:02Z3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at https://saliteta.github.io/NanoGS/.2026-03-17T03:58:02ZButian XiongRong LiuTiantian ZhouMeida ChenZhiwen FanAndrew Feng