https://arxiv.org/api///WeAW8itRpBlsFjPjaDzozYo0I 2026-06-13T19:41:06Z 9323 180 15 http://arxiv.org/abs/2605.23508v1 DrawVideo: Generating Long Video from Storyboard Keyframe Sketches 2026-05-22T11:16:05Z Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation. 2026-05-22T11:16:05Z 45 pages, 19 figures Chuanzhi Xu Huiqi Liang Bang Shi Huiming Zhang Yifan Xiao Guangcheng Lin Haodong Chen Qiang Qu Zhicheng Lu Weidong Cai http://arxiv.org/abs/2605.23462v1 Closing Trajectories: Equation-Free Cyclic Animation via Koopman Surrogates 2026-05-22T10:23:07Z Cyclic animation is widely used in computer graphics and interactive content.It supports seamless playback in games, VR, and interactive simulation,where short clips must repeat smoothly over long durations. Achievingphysically plausible cyclic synthesis from an input sequence is challengingbecause the endpoint states of the observed sequence rarely match exactly,and the governing equations of the underlying system are often unavailable.We therefore propose an equation-free framework that identiffes a Koopmansurrogate from the observed trajectory and computes a cyclic trajectory byapplying a Fourier-parameterized, time-varying control force under a hardtemporal periodicity constraint. The resulting formulation reduces cyclicsynthesis to a linearly constrained quadratic program that can be solvedefffciently through a structured KKT system. Our method is applicable toa diverse range of examples, including N-body systems, cloth, deformableobjects, shallow water, etc. 2026-05-22T10:23:07Z Shixun Huang Siyuan Chen Yue Chang Zhecheng Wang Peter Yichen Chen http://arxiv.org/abs/2605.26137v1 AssetGen: Deployable 3D Asset Generation at Interactive Speed 2026-05-22T04:58:06Z While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows. 2026-05-22T04:58:06Z Dilin Wang Xiaoyu Xiang Kihyuk Sohn Tom Monnier Yu-Ying Yeh Thu Nguyen-Phuoc Jiawen Zhang Yuchen Fan Antoine Toisoul Hyunyoung Jung Prithviraj Dhar Michael Bunnell Nikolaos Sarafianos Chuhang Zou Roman Shapovalov Andrea Vedaldi Rakesh Ranjan http://arxiv.org/abs/2605.23088v1 YASPS: A Symbolic Framework for Extensible, High-Performance IPC Simulation 2026-05-21T22:39:21Z Incremental Potential Contact (IPC) enables robust, contact-rich simulation by casting elasticity and contact as a single energy minimization problem, but high-performance IPC pipelines are typically built from specialized kernels and assembly logic tied to fixed energies, primitive types, and parameterizations, making extensions costly and combinatorial. We present YASPS, a GPU-oriented framework that removes this extensibility bottleneck by making structure explicit in a differentiable intermediate representation. YASPS introduces two first-class relational operators: JOIN, which composes dependent quantities across user-declared relations (e.g., element-to-vertex connectivity), and UNION, which represents alternative parameterizations within a relation (e.g., mixing free vertices with affine-body or other parameterizations without fragmenting the program). Because JOIN and UNION are part of the symbolic program, YASPS differentiates through them using dedicated rules and an efficient second-order procedure that reuses intermediate Jacobians and reduces Hessian-projection cost. From the same relational description, YASPS derives the global gradient/Hessian sparsity and block layout, enabling structure-aware block-sparse storage and compression, and JIT-compiles CUDA kernels for evaluation, derivatives, assembly, and solving. Across IPC-style examples, including layered cloth-on-bunny, mixed rigid/deformable bunnies, and a caged deformation model, YASPS supports rapid front-end extensions with minimal back-end changes while achieving competitive end-to-end performance; its Hessian compression yields near 10x faster CG iterations in our benchmarks. 2026-05-21T22:39:21Z Accepted to Siggraph 2026 Xuan Tang Kemeng Huang Gilbert Bernstein Minchen LI Tzumao Li 10.1145/3811327 http://arxiv.org/abs/2511.07820v3 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control 2026-05-21T17:26:49Z Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control. 2025-11-11T04:37:40Z Project page: https://nvlabs.github.io/SONIC/ Zhengyi Luo Ye Yuan Tingwu Wang Chenran Li Fernando Castañeda Sirui Chen Zi-Ang Cao Jiefeng Li David Minor Qingwei Ben Jinhyung Park David Sami Zi Wang Xingye Da Runyu Ding Cyrus Hogg Lina Song Edy Lim Eugene Jeong Tairan He Haoru Xue Wenli Xiao Simon Yuen Jan Kautz Yan Chang Umar Iqbal Linxi "Jim" Fan Yuke Zhu http://arxiv.org/abs/2605.22597v1 MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy 2026-05-21T15:13:49Z Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/. 2026-05-21T15:13:49Z International Conference on Machine Learning 2026 Jiaxu Wang Junhao He Jingkai Sun Yi Gu Yunyang Mo Jiahang Cao Qiang Zhang Renjing Xu http://arxiv.org/abs/2605.22284v1 moveEZ: An R Package for Animated Biplots 2026-05-21T10:37:30Z The moveEZ (pronounced move easy) R package provides tools for constructing animated PCA biplots that reveal how multivariate structure evolves across the ordered levels of a categorical variable. Built as an extension to the biplotEZ package, moveEZ offers three animation frameworks of increasing methodological complexity: a fixed variable frame, in which variable vectors remain constant and only sample positions are animated; and two dynamic frames, in which both sample positions and variable vectors are recomputed and animated at each level. The dynamic frames support Procrustes alignment and reflection to ensure visual continuity across levels, and are compatible with high-dimensional datasets including grouped structures. The package integrates with gganimate to produce high-quality animations suitable for publications and presentations, and supports both animated and static faceted displays via a single argument. Although originally motivated by tracking shifts in African climate indicators, moveEZ is domain-agnostic and applicable wherever multivariate measurements are recorded repeatedly across an ordered categorical variable, including economic, ecological, and biological settings. 2026-05-21T10:37:30Z R package Raeesa Ganey Johané Nienkemper-Swanepoel http://arxiv.org/abs/2605.22013v1 PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought 2026-05-21T05:19:51Z Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios. 2026-05-21T05:19:51Z Chaoqi Chen Qile Xu Wenjun Zhou Hui Huang http://arxiv.org/abs/2604.17623v3 ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes 2026-05-20T22:21:05Z Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies. 2026-04-19T21:21:11Z Project page: https://honglin-c.github.io/vips/ Honglin Chen Karran Pandey Rundi Wu Matheus Gadelha Yannick Hold-Geoffroy Ayush Tewari Niloy J. Mitra Changxi Zheng Paul Guerrero http://arxiv.org/abs/2605.21766v1 BodyReLux: Temporally Consistent Full-Body Video Relighting 2026-05-20T21:57:31Z Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances. 2026-05-20T21:57:31Z Siggraph 2026 Journal Track. Project page: https://eyeline-labs.github.io/bodyrelux/ Li Ma Mingming He Xueming Yu David M. George Ahmet Levent Taşel Paul Debevec Julien Philip 10.1145/3811352 http://arxiv.org/abs/2605.17855v2 Accelerating 3D Gaussian Splatting using Tensor Cores 2026-05-20T21:11:27Z 3D Gaussian Splatting (3DGS) has become a leading technique for real-time neural rendering and 3D scene reconstruction, but its rendering cost remains too high for many latency-sensitive scenarios. In particular, the rasterization stage in 3DGS dominates end-to-end rendering time, during which the renderer repeatedly evaluates each Gaussian's contribution to each covered pixel, making this stage compute-bound. At the same time, modern GPUs provide high-throughput Tensor Cores for low-precision matrix operations, yet existing 3DGS systems execute rasterization entirely on CUDA cores and leave Tensor Cores idle. We find that 3DGS rendering can be executed in FP16 with negligible quality degradation, suggesting a promising opportunity for Tensor Core acceleration. However, exploiting Tensor Cores for 3DGS is non-trivial because rasterization does not naturally match their execution model. Existing 3DGS rasterization is expressed as irregular per-pixel scalar operations, whereas Tensor Cores require dense, regular, and reuse-rich matrix workloads. Moreover, conventional tile-by-tile execution fails to exploit Gaussian reuse across neighboring tiles, resulting in repeated data loading and thus high data movement overhead. To this end, we present TensorGS, a 3DGS acceleration framework using Tensor Cores. TensorGS tensorizes the dominant rasterization computation into Tensor-Core-compatible matrix operations and introduces cross-tile grouping to improve Gaussian reuse, amortize overhead, and increase Tensor Core utilization. Experimental results show that TensorGS improves end-to-end rendering performance by 1.65$\times$ while preserving image quality. 2026-05-18T04:53:03Z Sheng Li Yang Sui Yue Wu Zhuoran Song Bo Yuan Xulong Tang Yue Dai http://arxiv.org/abs/2402.06795v2 Squidgets: Sketch-based Widget Design for Scene Manipulation 2026-05-20T19:37:45Z People naturally sketch strokes over graphical scenes to convey scene changes. We propose automatically interpreting these strokes to execute scene changes with squidgets (sketch-widgets), a novel sketch-based UI framework for direct scene manipulation. Squidgets are motivated by the observation that curves resulting from visually abstracting scene elements provide natural handles for the direct manipulation of scene parameters. Additional curves can be defined by users to author custom handles associated with scene attributes. Users manipulate a scene by simply drawing strokes, partially matched against scene curves to select a squidget and interactively control associated parameters. We present an implementation of squidgets within the 3D animation system Maya, showing 2D/3D stroke input to manipulate 2D/3D scenes. We report on a controlled experiment evaluating squidgets on 2D object translation and deformation tasks, and a broader informal study on squidget creation and manipulation. 2024-02-09T21:40:23Z Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology 2025 Joonho Kim Fanny Chevalier Karan Singh 10.1145/3746059.3747690 http://arxiv.org/abs/2605.21478v1 Latent Dynamics for Full Body Avatar Animation 2026-05-20T17:58:03Z Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines. 2026-05-20T17:58:03Z Supplementary video: https://youtu.be/xjnr3YM0yIE Shichong Peng Chengxiang Yin Fei Jiang Zhongshi Jiang Lingchen Yang Qingyang Tan Amin Jourabloo Jason Saragih Ke Li Christian Häne http://arxiv.org/abs/2605.15305v4 WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer 2026-05-20T14:25:31Z A unified simulator that can model diverse physical phenomena without solver-specific redesign is a long-standing goal across simulation science. We present a learning-based particle simulator built on a single transformer architecture to model cloth, elastic solds, Newtonian and non-Newtonian fluids, granular materials, and molecular dynamics. Our model follows a prediction-correction design on a shared Lagrangian particle representation. An explicit predictor first advances particles under the known external forces, producing an intermediate state that captures externally driven motion but not inter-particle interactions. A learned corrector then predicts the residual position and velocity updates through three stages: a particle tokenizer that encodes local particle-particle, particle-boundary, and topology-guided interactions; a super-token encoder that hierarchically merges particle tokens into a compact set of super tokens via alternating self-attention and token merging; and a super-token decoder that lifts these super tokens back to particle resolution through cross-attention to predict per-particle position and velocity corrections. Progressive token merging reduces the attention cost at successive encoder layers by halving the token count at each level, and the decoder communicates through the compact super-token set rather than full particle-to-particle attention. Across the six dynamics categories, the same architecture generalizes to unseen materials, boundary configurations, initial conditions, and external forces. We further demonstrate downstream interactive control, inverse design, and learning from real-world manipulation data, reducing the need for per-phenomenon solver engineering. 2026-05-14T18:18:12Z Caoliwen Wang Minghao Guo Siyuan Chen Heng Zhang Mengdi Wang Xingyu Ni Hanson Sun Kunyi Wang Zherong Pan Kui Wu Lingjie Liu Yin Yang Chenfanfu Jiang Taku Komura Wojciech Matusik Peter Yichen Chen http://arxiv.org/abs/2605.21121v1 ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation 2026-05-20T12:50:52Z Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements. 2026-05-20T12:50:52Z Hanxiao Sun Mingxin Yang Shuhui Yang Zebin He Xintong Han Hongbo Fu Chunchao Guo Wenhan Luo