https://arxiv.org/api/96jwcTuUMtAd4NRbnGwvrOJmZwU2026-06-14T19:28:16Z932352515http://arxiv.org/abs/2605.13856v1Image-aware Layout Generation with User Constraints for Poster Design2026-04-08T11:38:12ZGraphic layout is essential in poster generation. Professionals often need to design different layouts for a product image, to ensure they meet specific user requirements. This paper focuses on utilizing a deep-learning model to automatically generate image-aware layouts with user-defined constraints, including layout attributes and partial layouts. Layout attribute constraints require generated layouts to include and exclude elements of specified classes, such as text, logos, underlays, and embellishments. Our model represents different attributes by sampling multidimensional Gaussian noise with different means, and we propose an attribute-consistent loss and an attribute-disentangled loss to ensure that the generated layout satisfies the specified attribute. Partial layout constraints provide our model with incomplete layout information to guide the generation of the remaining elements. We design a partial-constraint loss to incorporate the provided partial layout. Furthermore, we introduce a random mask to diversify the partial layout constraints, which can encourage the model to learn more general latent representations of the provided partial layouts. Both quantitative and qualitative evaluations demonstrate that our model can generate different image-aware layouts according to various user constraints while achieving state-of-the-art performance.2026-04-08T11:38:12ZChenchen XuKaixin HanWeiwei Xuhttp://arxiv.org/abs/2512.07527v3From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images2026-04-08T08:18:16ZCity-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation. Project page can be found at https://pku-vcl-geometry.github.io/Orbit2Ground/.2025-12-08T13:01:12ZAccepted by CVPR 2026 Findings. Project page: https://pku-vcl-geometry.github.io/Orbit2Ground/Fei YuYu LiuLuyang TangMingchao SunZengye GeRui BuYuchao JinHaisen ZhaoHe SunYangyan LiMu XuWenzheng ChenBaoquan Chenhttp://arxiv.org/abs/2603.20284v2STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction2026-04-08T07:51:15ZOnline 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal variants of VGGT address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.2026-03-18T06:36:46Z10 pages, 6 figures. Accepted by CVPR 2026. This version includes supplementary materialRunze WangYuxuan SongYoucheng CaiLigang Liuhttp://arxiv.org/abs/2604.06494v1DesigNet: Learning to Draw Vector Graphics as Designers Do2026-04-07T21:55:20ZAI-driven content generation has made remarkable progress in recent years. However, neural networks and human designers operate in fundamentally different ways, making collaboration between them challenging. We address this gap for Scalable Vector Graphics (SVG) by equipping neural networks with tools commonly used by designers, such as axis alignment and explicit continuity control at command junctions. We introduce DesigNet, a hierarchical Transformer-VAE that operates directly on SVG sequences with a continuous command parameterization. Our main contributions are two differentiable modules: a continuity self-refinement module that predicts $C^0$, $G^1$, and $C^1$ continuity for each curve point and enforces it by modifying Bézier control points, and an alignment self-refinement module with snapping capabilities for horizontal or vertical lines. DesigNet produces editable outlines and achieves competitive results against state-of-the-art methods, with notably higher accuracy in continuity and alignment. These properties ensure the outputs are easier to refine and integrate into professional design workflows. Source Code: https://github.com/TomasGuija/DesigNet.2026-04-07T21:55:20ZTomas Guija-ValienteIago Suárezhttp://arxiv.org/abs/2604.06358v1GS-Surrogate: Deformable Gaussian Splatting for Parameter Space Exploration of Ensemble Simulations2026-04-07T18:37:15ZExploring ensemble simulations is increasingly important across many scientific domains. However, supporting flexible post-hoc exploration remains challenging due to the trade-off between storing the expensive raw data and flexibly adjusting visualization settings. Existing visualization surrogate models have improved this workflow, but they either operate in image space without an explicit 3D representation or rely on neural radiance fields that are computationally expensive for interactive exploration and encode all parameter-driven variations within a single implicit field. In this work, we introduce GS-Surrogate, a deformable Gaussian Splatting-based visualization surrogate for parameter-space exploration. Our method first constructs a canonical Gaussian field as a base 3D representation and adapts it through sequential parameter-conditioned deformations. By separating simulation-related variations from visualization-specific changes, this explicit formulation enables efficient and controllable adaptation to different visualization tasks, such as isosurface extraction and transfer function editing. We evaluate our framework on a range of simulation datasets, demonstrating that GS-Surrogate enables real-time and flexible exploration across both simulation and visualization parameter spaces.2026-04-07T18:37:15ZZiwei LiRumali PereraAngus ForbesKen MorelandDave PugmireScott KlaskyWei-Lun ChaoHan-Wei Shenhttp://arxiv.org/abs/2604.02320v2Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining2026-04-07T17:25:24ZHigh-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.2026-04-02T17:58:40ZAccepted in CVPR2026. Website: https://junxuan-li.github.io/lcaJunxuan LiRawal KhirodkarChengan HeZhongshi JiangGiljoo NamLingchen YangJihyun LeeEgor ZakharovZhaoen SuRinat AbdrashitovYuan DongJulieta MartinezKai LiQingyang TanTakaaki ShiratoriMatthew HuPeihong GuoXuhua HuangAriyan ZareiMarco PesaventoYichen XuHe WenTeng DengWyatt BorsosAnjali ThakrarJean-Charles BazinCarsten StollGinés HidalgoJames BoothLucy WangXiaowen MaYu RongSairanjith ThalankiChen CaoChristian HäneAbhishek KarSofien BouazizJason SaragihYaser SheikhShunsuke Saitohttp://arxiv.org/abs/2506.18601v2BulletGen: Improving 4D Reconstruction with Bullet-Time Generation2026-04-07T16:31:10ZTransforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian-based dynamic scene representation. This is done by aligning the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen "bullet-time" step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.2025-06-23T13:03:42ZAccepted at CVPR 2026 Workshop "4D World Models: Bridging Generation and Reconstruction"Denis RozumnyJonathon LuitenNumair KhanJohannes SchönbergerPeter Kontschiederhttp://arxiv.org/abs/2601.18336v2PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction2026-04-07T14:58:50ZMulti-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves state-of-the-art performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available. The source code is available at: https://github.com/nv-tlabs/ppisp2026-01-26T10:23:43ZFor more details and updates, please visit our project website: https://research.nvidia.com/labs/sil/projects/ppisp/Isaac DeutschNicolas Moënne-LoccozGavriel StateZan Gojcichttp://arxiv.org/abs/2604.05794v1EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion2026-04-07T12:30:19ZStrand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.2026-04-07T12:30:19Z10 pages, 6 figures, conferenceDa LiDominik EngelDeng LuoIvan Violahttp://arxiv.org/abs/2504.14135v3Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering2026-04-07T08:59:35ZHigh-fidelity simulation is essential for robotics research, enabling safe and efficient testing of perception, control, and navigation algorithms. However, achieving both photorealistic rendering and accurate physics modeling remains a challenge. This paper presents a novel simulation framework, the Unreal Robotics Lab (URL), that integrates the advanced rendering capabilities of the Unreal Engine with MuJoCo's high-precision physics simulation. Our approach enables realistic robotic perception while maintaining accurate physical interactions, facilitating benchmarking and dataset generation for vision-based robotics applications. The system supports complex environmental effects, such as smoke, fire, and water dynamics, which are critical to evaluating robotic performance under adverse conditions. We benchmark visual navigation and SLAM methods within our framework, demonstrating its utility for testing real-world robustness in controlled yet diverse scenarios. By bridging the gap between physics accuracy and photorealistic rendering, our framework provides a powerful tool for advancing robotics research and sim-to-real transfer. Our open-source framework is available at https://unrealroboticslab.github.io/.2025-04-19T01:54:45ZJonathan Embley-RichesJianwei LiuSimon JulierDimitrios Kanoulashttp://arxiv.org/abs/2604.05547v1COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration2026-04-07T07:45:17ZIterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.2026-04-07T07:45:17Z10 pages, 3 figures, preprint paperLiyuan DengShujian DengYongkang ChenYongkang DaiZhihang ZhongLinyang LiXiao SunYilei ShiHuaxi Huanghttp://arxiv.org/abs/2604.05525v1CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation2026-04-07T07:24:21ZCrowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable per-frame control, and success-biased datasets-through: (i) agent-centric visual supervision via semantically reconstructed environments and Low-Rank Adaptation (LoRA) fine-tuning of a pretrained vision-language model, (ii) a motion skill action space that bridges symbolic decision making and continuous locomotion, and (iii) exploration-based question answering that exposes agents to counterfactual actions and their outcomes through simulation rollouts. Our results shift crowd simulation from motion-centric synthesis toward perception-driven, consequence-aware decision making, enabling crowds that move not just realistically, but meaningfully.2026-04-07T07:24:21ZJuyeong HwangSeong-Eun HongJinhyun KimJaeYoung SeonGiljoo NamHanyoung JangHyeongYeop Kanghttp://arxiv.org/abs/2605.13855v1SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method2026-04-07T06:04:37Z3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose SparseOIT, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering. Project page:2026-04-07T06:04:37ZWentao YangFanzhen KongZejian KangXiangru Huanghttp://arxiv.org/abs/2601.03323v3Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset2026-04-07T04:22:26ZAdvances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. The project page is available at https://oranduanstudy.github.io/LRCM/.2026-01-06T14:59:22Z12 pages, 13 figuresOran DuanYinghua ShenYingzhu LvLuyang JieYaxin LiuQiong Wuhttp://arxiv.org/abs/2604.05394v1Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters2026-04-07T03:47:14ZPhysics-based character animation has become a fundamental approach for synthesizing realistic, physically plausible motions. While current data-driven deep reinforcement learning (DRL) methods can synthesize complex skills, they struggle to reproduce exaggerated, stylized motions, such as instantaneous dashes or mid-air trajectory changes, which are required in animation but violate standard physical laws. The primary limitation stems from modeling the character as an underactuated floating-base system, in which internal joint torques and momentum conservation strictly govern motion. Direct attempts to enforce such motions via external wrenches often lead to training instability, as velocity discontinuities produce sparse, high-magnitude force spikes that prevent policy convergence. We propose Assistive Impulse Neural Control, a framework that reformulates external assistance in impulse space rather than force space to ensure numerical stability. We decompose the assistive signal into an analytic high-frequency component derived from Inverse Dynamics and a learned low-frequency residual correction, governed by a hybrid neural policy. We demonstrate that our method enables robust tracking of highly agile, dynamically infeasible maneuvers that were previously intractable for physics-based methods.2026-04-07T03:47:14ZZhiquan WangBedrich Benes