https://arxiv.org/api///WeAW8itRpBlsFjPjaDzozYo0I2026-06-13T19:41:06Z932318015http://arxiv.org/abs/2605.23508v1DrawVideo: Generating Long Video from Storyboard Keyframe Sketches2026-05-22T11:16:05ZLong video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.2026-05-22T11:16:05Z45 pages, 19 figuresChuanzhi XuHuiqi LiangBang ShiHuiming ZhangYifan XiaoGuangcheng LinHaodong ChenQiang QuZhicheng LuWeidong Caihttp://arxiv.org/abs/2605.23462v1Closing Trajectories: Equation-Free Cyclic Animation via Koopman Surrogates2026-05-22T10:23:07ZCyclic animation is widely used in computer graphics and interactive content.It supports seamless playback in games, VR, and interactive simulation,where short clips must repeat smoothly over long durations. Achievingphysically plausible cyclic synthesis from an input sequence is challengingbecause the endpoint states of the observed sequence rarely match exactly,and the governing equations of the underlying system are often unavailable.We therefore propose an equation-free framework that identiffes a Koopmansurrogate from the observed trajectory and computes a cyclic trajectory byapplying a Fourier-parameterized, time-varying control force under a hardtemporal periodicity constraint. The resulting formulation reduces cyclicsynthesis to a linearly constrained quadratic program that can be solvedefffciently through a structured KKT system. Our method is applicable toa diverse range of examples, including N-body systems, cloth, deformableobjects, shallow water, etc.2026-05-22T10:23:07ZShixun HuangSiyuan ChenYue ChangZhecheng WangPeter Yichen Chenhttp://arxiv.org/abs/2605.26137v1AssetGen: Deployable 3D Asset Generation at Interactive Speed2026-05-22T04:58:06ZWhile 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.2026-05-22T04:58:06ZDilin WangXiaoyu XiangKihyuk SohnTom MonnierYu-Ying YehThu Nguyen-PhuocJiawen ZhangYuchen FanAntoine ToisoulHyunyoung JungPrithviraj DharMichael BunnellNikolaos SarafianosChuhang ZouRoman ShapovalovAndrea VedaldiRakesh Ranjanhttp://arxiv.org/abs/2605.23088v1YASPS: A Symbolic Framework for Extensible, High-Performance IPC Simulation2026-05-21T22:39:21ZIncremental Potential Contact (IPC) enables robust, contact-rich simulation by casting elasticity and contact as a single energy minimization problem, but high-performance IPC pipelines are typically built from specialized kernels and assembly logic tied to fixed energies, primitive types, and parameterizations, making extensions costly and combinatorial. We present YASPS, a GPU-oriented framework that removes this extensibility bottleneck by making structure explicit in a differentiable intermediate representation. YASPS introduces two first-class relational operators: JOIN, which composes dependent quantities across user-declared relations (e.g., element-to-vertex connectivity), and UNION, which represents alternative parameterizations within a relation (e.g., mixing free vertices with affine-body or other parameterizations without fragmenting the program). Because JOIN and UNION are part of the symbolic program, YASPS differentiates through them using dedicated rules and an efficient second-order procedure that reuses intermediate Jacobians and reduces Hessian-projection cost. From the same relational description, YASPS derives the global gradient/Hessian sparsity and block layout, enabling structure-aware block-sparse storage and compression, and JIT-compiles CUDA kernels for evaluation, derivatives, assembly, and solving. Across IPC-style examples, including layered cloth-on-bunny, mixed rigid/deformable bunnies, and a caged deformation model, YASPS supports rapid front-end extensions with minimal back-end changes while achieving competitive end-to-end performance; its Hessian compression yields near 10x faster CG iterations in our benchmarks.2026-05-21T22:39:21ZAccepted to Siggraph 2026Xuan TangKemeng HuangGilbert BernsteinMinchen LITzumao Li10.1145/3811327http://arxiv.org/abs/2511.07820v3SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control2026-05-21T17:26:49ZDespite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.2025-11-11T04:37:40ZProject page: https://nvlabs.github.io/SONIC/Zhengyi LuoYe YuanTingwu WangChenran LiFernando CastañedaSirui ChenZi-Ang CaoJiefeng LiDavid MinorQingwei BenJinhyung ParkDavid SamiZi WangXingye DaRunyu DingCyrus HoggLina SongEdy LimEugene JeongTairan HeHaoru XueWenli XiaoSimon YuenJan KautzYan ChangUmar IqbalLinxi "Jim" FanYuke Zhuhttp://arxiv.org/abs/2605.22597v1MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy2026-05-21T15:13:49ZLearning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/.2026-05-21T15:13:49ZInternational Conference on Machine Learning 2026Jiaxu WangJunhao HeJingkai SunYi GuYunyang MoJiahang CaoQiang ZhangRenjing Xuhttp://arxiv.org/abs/2605.22284v1moveEZ: An R Package for Animated Biplots2026-05-21T10:37:30ZThe moveEZ (pronounced move easy) R package provides tools for constructing animated PCA biplots that reveal how multivariate structure evolves across the ordered levels of a categorical variable. Built as an extension to the biplotEZ package, moveEZ offers three animation frameworks of increasing methodological complexity: a fixed variable frame, in which variable vectors remain constant and only sample positions are animated; and two dynamic frames, in which both sample positions and variable vectors are recomputed and animated at each level. The dynamic frames support Procrustes alignment and reflection to ensure visual continuity across levels, and are compatible with high-dimensional datasets including grouped structures. The package integrates with gganimate to produce high-quality animations suitable for publications and presentations, and supports both animated and static faceted displays via a single argument. Although originally motivated by tracking shifts in African climate indicators, moveEZ is domain-agnostic and applicable wherever multivariate measurements are recorded repeatedly across an ordered categorical variable, including economic, ecological, and biological settings.2026-05-21T10:37:30ZR packageRaeesa GaneyJohané Nienkemper-Swanepoelhttp://arxiv.org/abs/2605.22013v1PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought2026-05-21T05:19:51ZUnderstanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.2026-05-21T05:19:51ZChaoqi ChenQile XuWenjun ZhouHui Huanghttp://arxiv.org/abs/2604.17623v3ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes2026-05-20T22:21:05ZKinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.2026-04-19T21:21:11ZProject page: https://honglin-c.github.io/vips/Honglin ChenKarran PandeyRundi WuMatheus GadelhaYannick Hold-GeoffroyAyush TewariNiloy J. MitraChangxi ZhengPaul Guerrerohttp://arxiv.org/abs/2605.21766v1BodyReLux: Temporally Consistent Full-Body Video Relighting2026-05-20T21:57:31ZBeing able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.2026-05-20T21:57:31ZSiggraph 2026 Journal Track. Project page: https://eyeline-labs.github.io/bodyrelux/Li MaMingming HeXueming YuDavid M. GeorgeAhmet Levent TaşelPaul DebevecJulien Philip10.1145/3811352http://arxiv.org/abs/2605.17855v2Accelerating 3D Gaussian Splatting using Tensor Cores2026-05-20T21:11:27Z3D Gaussian Splatting (3DGS) has become a leading technique for real-time neural rendering and 3D scene reconstruction, but its rendering cost remains too high for many latency-sensitive scenarios. In particular, the rasterization stage in 3DGS dominates end-to-end rendering time, during which the renderer repeatedly evaluates each Gaussian's contribution to each covered pixel, making this stage compute-bound. At the same time, modern GPUs provide high-throughput Tensor Cores for low-precision matrix operations, yet existing 3DGS systems execute rasterization entirely on CUDA cores and leave Tensor Cores idle. We find that 3DGS rendering can be executed in FP16 with negligible quality degradation, suggesting a promising opportunity for Tensor Core acceleration. However, exploiting Tensor Cores for 3DGS is non-trivial because rasterization does not naturally match their execution model. Existing 3DGS rasterization is expressed as irregular per-pixel scalar operations, whereas Tensor Cores require dense, regular, and reuse-rich matrix workloads. Moreover, conventional tile-by-tile execution fails to exploit Gaussian reuse across neighboring tiles, resulting in repeated data loading and thus high data movement overhead. To this end, we present TensorGS, a 3DGS acceleration framework using Tensor Cores. TensorGS tensorizes the dominant rasterization computation into Tensor-Core-compatible matrix operations and introduces cross-tile grouping to improve Gaussian reuse, amortize overhead, and increase Tensor Core utilization. Experimental results show that TensorGS improves end-to-end rendering performance by 1.65$\times$ while preserving image quality.2026-05-18T04:53:03ZSheng LiYang SuiYue WuZhuoran SongBo YuanXulong TangYue Daihttp://arxiv.org/abs/2402.06795v2Squidgets: Sketch-based Widget Design for Scene Manipulation2026-05-20T19:37:45ZPeople naturally sketch strokes over graphical scenes to convey scene changes. We propose automatically interpreting these strokes to execute scene changes with squidgets (sketch-widgets), a novel sketch-based UI framework for direct scene manipulation. Squidgets are motivated by the observation that curves resulting from visually abstracting scene elements provide natural handles for the direct manipulation of scene parameters. Additional curves can be defined by users to author custom handles associated with scene attributes. Users manipulate a scene by simply drawing strokes, partially matched against scene curves to select a squidget and interactively control associated parameters. We present an implementation of squidgets within the 3D animation system Maya, showing 2D/3D stroke input to manipulate 2D/3D scenes. We report on a controlled experiment evaluating squidgets on 2D object translation and deformation tasks, and a broader informal study on squidget creation and manipulation.2024-02-09T21:40:23ZProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology 2025Joonho KimFanny ChevalierKaran Singh10.1145/3746059.3747690http://arxiv.org/abs/2605.21478v1Latent Dynamics for Full Body Avatar Animation2026-05-20T17:58:03ZPose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.2026-05-20T17:58:03ZSupplementary video: https://youtu.be/xjnr3YM0yIEShichong PengChengxiang YinFei JiangZhongshi JiangLingchen YangQingyang TanAmin JourablooJason SaragihKe LiChristian Hänehttp://arxiv.org/abs/2605.15305v4WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer2026-05-20T14:25:31ZA unified simulator that can model diverse physical phenomena without solver-specific redesign is a long-standing goal across simulation science. We present a learning-based particle simulator built on a single transformer architecture to model cloth, elastic solds, Newtonian and non-Newtonian fluids, granular materials, and molecular dynamics. Our model follows a prediction-correction design on a shared Lagrangian particle representation. An explicit predictor first advances particles under the known external forces, producing an intermediate state that captures externally driven motion but not inter-particle interactions. A learned corrector then predicts the residual position and velocity updates through three stages: a particle tokenizer that encodes local particle-particle, particle-boundary, and topology-guided interactions; a super-token encoder that hierarchically merges particle tokens into a compact set of super tokens via alternating self-attention and token merging; and a super-token decoder that lifts these super tokens back to particle resolution through cross-attention to predict per-particle position and velocity corrections. Progressive token merging reduces the attention cost at successive encoder layers by halving the token count at each level, and the decoder communicates through the compact super-token set rather than full particle-to-particle attention. Across the six dynamics categories, the same architecture generalizes to unseen materials, boundary configurations, initial conditions, and external forces. We further demonstrate downstream interactive control, inverse design, and learning from real-world manipulation data, reducing the need for per-phenomenon solver engineering.2026-05-14T18:18:12ZCaoliwen WangMinghao GuoSiyuan ChenHeng ZhangMengdi WangXingyu NiHanson SunKunyi WangZherong PanKui WuLingjie LiuYin YangChenfanfu JiangTaku KomuraWojciech MatusikPeter Yichen Chenhttp://arxiv.org/abs/2605.21121v1ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation2026-05-20T12:50:52ZSingle-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.2026-05-20T12:50:52ZHanxiao SunMingxin YangShuhui YangZebin HeXintong HanHongbo FuChunchao GuoWenhan Luo