https://arxiv.org/api/V41TmOoy8qM/d6pM8NWmmTQ+ipU 2026-06-14T04:17:21Z 9323 300 15 http://arxiv.org/abs/2606.02586v1 Fewer, Better Frames: A Compute-Normalized Proof of Concept for Coherence-First World-Model Rendering with Model-Guided FSR4 Frame Generation 2026-05-11T16:42:10Z World models are often evaluated by native frame cadence, but higher nominal frame rate can trade away long-horizon scene stability. This article reports an independent proof of concept implemented using Overworld's Waypoint-1.5 family and WorldEngine runtime on a Windows fallback stack with ONNX Runtime + DirectML and an FSR4 DX12 bridge. The tested coherence-first branch generates higher-context anchor frames at a 15 FPS presentation-timeline cadence and reconstructs presentation to 30 FPS using latent-delta motion guidance and synthesized depth. It is compared against a lower-context cadence-first baseline that generates about 30 FPS natively under the same seed, route, control script, target presentation duration, and local time-scaling regime. Across forest, sword, desert, and snow scenes, the coherence-first branch preserves path geometry, object identity, large silhouettes, and depth layering longer, while the baseline degrades earlier into brightness drift and geometric distortion. Lightweight temporal metrics and paired videos support the visual comparison, with LPIPS favoring the coherence-first branch across all tested scenes. Here compute-normalized means approximately matched same-GPU, same-timescale operating points, not exact FLOP parity or measured realtime throughput. A separate heavier sword-scene probe suggests local non-monotonicity: more context and denoising did not automatically improve quality. These results support coherence-first allocation as a practical proof-of-concept strategy under limited inference budget, not as a finished realtime renderer. 2026-05-11T16:42:10Z 19 pages, 8 figures, independent systems proof of concept Paweł Katarzyński http://arxiv.org/abs/2601.22143v2 JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion 2026-05-11T12:30:46Z Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines. 2026-01-29T18:57:13Z Project webpage available at https://justdubit.github.io Anthony Chen Naomi Ken Korem Gal Zeevi Tavi Halperin Matan Ben Yosef Urska Jelercic Ofir Bibi Or Patashnik Daniel Cohen-Or http://arxiv.org/abs/2605.10457v1 Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation 2026-05-11T12:28:49Z Real-time Light Detection And Ranging (LiDAR) simulation must find, per emitted ray, the closest intersecting triangle even in dynamic scenes containing large numbers of moving and deformable objects. Dominant acceleration-structure approaches require rebuilding each frame for dynamic geometry -- a cost that compounds directly with scene dynamics and cannot be amortized regardless of how little actually changed. This paper presents the Gajmer Ray-Casting Algorithm (GRCA), which inverts the question: instead of asking what does each ray hit? it asks which rays can each triangle possibly hit? GRCA geometrically models spinning LiDAR emitters as rotation-traced cones or planes and uses each triangle's emitter-centric apparent area to cull, per triangle, which channels and the rays within those channels can possibly reach it -- without any acceleration structure. GRCA is compute-based and vendor-agnostic by design, targeting highly dynamic, high-resolution simultaneous multi-sensor simulation. At its core, GRCA is a general-purpose ray-casting algorithm: the emitter-centric inversion applies to any setting where rays originate from a known position, not only LiDAR. Benchmarks evaluate 2-8 simultaneous 128x4096-ray LiDARs (360deg/180deg) over complex dynamic scenes -- with just two sensors casting ~1M rays per frame. With range culling inactive, GRCA reaches up to 7.97x over hardware-accelerated OptiX (GPU) and 14.55x over Embree (CPU). Two independent extensions further boost performance even in the most complex scene (~22M triangles, ~9M of which are dynamic, 8 LiDARs): range culling at realistic deployment ranges (10-100m) reaches up to 7.02x GPU and 9.33x CPU; a hybrid pipeline -- GRCA for dynamic geometry, OptiX/Embree for static -- reaches up to 10.5x GPU and 19.2x CPU. 2026-05-11T12:28:49Z 21 pages, 20 figures Rabin Gajmer Joonas Haapala Zoltan Beck http://arxiv.org/abs/2605.10307v1 PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction 2026-05-11T10:06:41Z Dynamic scene reconstruction represents a fundamental yet demanding challenge in computer vision and robotics. While recent progress in 3DGS-based methods has advanced dynamic scene modeling, obtaining high-fidelity rendering and accurate tracking in scenarios with substantial, intricate motions remains significantly challenging. To address these challenges, we propose PaMoSplat, a novel dynamic Gaussian splatting framework incorporating part awareness and motion priors. Our approach is grounded in two key observations: 1) Parts serve as primitives for scene deformation, and 2) Motion cues from optical flow can effectively guide part motion. Specifically, PaMoSplat initializes by lifting multi-view segmentation masks into 3D space via graph clustering, establishing coherent Gaussian parts. For subsequent timestamps, we leverage a differential evolutionary algorithm to estimate the rigid motion of these parts using multi-view optical flow cues, providing a robust warm-start for further optimization. Additionally, PaMoSplat introduces an adaptive iteration count mechanism, internal learnable rigidity, and flow-supervised rendering loss to accelerate and optimize the training process. Comprehensive evaluations across diverse scenes, including real-world environments, demonstrate that PaMoSplat delivers superior rendering quality, improved tracking precision, and faster convergence compared to existing methods. Furthermore, it enables multiple part-level downstream applications, such as 4D scene editing. 2026-05-11T10:06:41Z Accepted by TCSVT. Project Url: https://pamosplat.github.io Yinan Deng Jianyu Dou Jiahui Wang Jingyu Zhao Yi Yang Yufeng Yue 10.1109/TCSVT.2026.3691475 http://arxiv.org/abs/2605.10014v1 Elemental Alchemist: A Generative Interface for Semantic Control of Particle Systems Across Dynamic Levels of Abstraction 2026-05-11T05:40:34Z Editing particle-system visual effects (VFX) is vital for digital storytelling, but achieving controllable, art-directable results remains challenging due to their multi-dimensional nature. Given a large collection of parameters, users must find the ones relevant to their creative goals -- a task that requires a systematic understanding of the particle system and how parameters map to high-level intents, such as making a fire look angry. Elemental Alchemist is a generative interface that transforms user intent into contextualized controls for semantic editing of particle systems. The system introduces two components: a contextual brush palette that generates tools based on scene context, and a generative control panel that surfaces relevant technical parameters and abstracts them to generate mid-level semantic attributes and high-level conceptual controls. An evaluation with 10 novice and 5 expert VFX practitioners shows the system supported users in translating high-level creative goals into particle system parameters. 2026-05-11T05:40:34Z 23 pages including appendix, 14 figures. Accepted at ACM DIS 2026 Kyzyl Monteiro Evan Atherton George Fitzmaurice Qian Zhou http://arxiv.org/abs/2605.09699v1 A Real-Calibrated Synthetic-First Data Engine 2026-05-10T18:34:43Z Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation. 2026-05-10T18:34:43Z 7 pages, 6 figures Yukang Shen http://arxiv.org/abs/2605.00548v2 Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation 2026-05-10T17:09:06Z Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs. 2026-05-01T10:02:14Z SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/ Nadav Z. Cohen Ofir Abramovich Ariel Shamir 10.1145/3799902.3811104 http://arxiv.org/abs/2605.09362v1 FrameTwin: Curve-Anchored Gaussian Alignment from Sparse Views for Adaptive Wireframe 3D Printing 2026-05-10T06:21:35Z We present FrameTwin, a curve-anchored Gaussian alignment framework that uses sparse-view images to close the control loop for adaptive wireframe 3D printing. Our key idea is to capture the deformation of thin wireframe structures from sparse-view images using Gaussian kernels anchored to parametric curves, yielding a compact and geometry-aware encoding that explicitly captures strut topology. Driven by a differentiable rendering pipeline, FrameTwin estimates a neural deformation field that aligns the partially printed target model with the deformed structure observed during fabrication, where the optimized curve-Gaussian representation serves as a digital twin of the evolving wireframe. Unlike general Gaussian-splatting approaches, our formulation constrains kernel placement along parametric curves, substantially reducing the ambiguity inherent in sparse-view observations of thin structures. The resultant deformation-field alignment enforces global consistency across all struts. By using the estimated deformation field to blend the distorted printed geometry with the remaining unprinted geometry, FrameTwin enables adaptive updates to future printing trajectories. We demonstrate that FrameTwin can robustly capture and compensate for deformation in wireframe models fabricated using a robotized 3D printing system. 2026-05-10T06:21:35Z Wenting Wang Zhuo Huang Kun Qian Neelotpal Dutta Yuhu Guo Yingjun Tian Yeung Yam Charlie C. L. Wang http://arxiv.org/abs/2505.23617v3 One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory 2026-05-10T04:23:42Z Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution. 2025-05-29T16:25:35Z ICCV 2025 Chenhao Zheng Jieyu Zhang Mohammadreza Salehi Ziqi Gao Vishnu Iyengar Norimasa Kobori Quan Kong Ranjay Krishna http://arxiv.org/abs/2605.09299v1 LagrangianSplats: Divergence-Free Transport of Gaussian Primitives for Fluid Reconstruction 2026-05-10T03:45:50Z Reconstructing 3D fluid velocity fields from sparse 2D video observations is a highly ill-posed inverse problem, demanding both transport consistency with observed motion and physical validity under fluid laws. Existing methods typically impose these constraints through soft penalties, often leading to compromised accuracy and convergence issues. We introduce a reconstruction framework that structurally enforces both constraints. Specifically, we parameterize the reconstructed velocity using a continuous Divergence-Free Kernel representation, driving the advection of a Lagrangian 3D Gaussian Splatting representation. This formulation intrinsically guarantees both flow incompressibility and long-range transport coherence by construction. To enable the efficient optimization of such a constrained system, we introduce a novel Sliding Window scheme that propagates gradients over meaningful temporal horizons while maintaining tractable training costs. Experiments on synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art baselines in both transport consistency and physical accuracy, enabling applications such as high-quality re-simulation and flow analysis. 2026-05-10T03:45:50Z Ningxiao Tao Baoquan Chen Mengyu Chu http://arxiv.org/abs/2605.09279v1 CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting 2026-05-10T03:08:24Z Volumetric video (VV) streaming enables real-time, immersive access to remote 3D environments, powering telepresence, ecological monitoring, and robotic teleoperation. These applications turn VV streaming into a real-time interface to remote physical environments, imposing new system-level demands for photorealistic scene representation, low-latency interaction, and robust performance under heterogeneous networks. 3D Gaussian Splatting (3DGS) has been widely used for real-time photorealistic rendering, offering superior visual quality and rendering performance, but it faces challenges due to bandwidth consumption. Furthermore, as the foundation of adaptive VV streaming, existing Levels of Detail (LoD) methods based on density are not well-suited to Gaussian representations, leading to visible gaps and severe quality degradation. Recent studies have also explored attribute compression techniques to reduce bandwidth consumption. Our preliminary studies reveal that aggressive attribute compression primarily causes color distortion, which can be effectively corrected in the rendered image using a reference image. Motivated by these findings, we propose a novel Color-Adaptive scheme for adaptive VV streaming that uses vector quantization (VQ) to establish LoDs and correct color distortions with low-resolution reference images. We further present CAGS, an adaptive VV streaming system compatible with diverse Gaussian representations, which integrates the Color-Adaptive scheme by rendering reference images on the streaming server and performing color restoration on the client. Extensive experiments on our prototype system demonstrate that CAGS outperforms the existing adaptive streaming systems in PSNR by 5$\sim$20 dB under fluctuating bandwidth, operates significantly faster than existing scalable Gaussian compression methods, and generalizes across different Gaussian representations. 2026-05-10T03:08:24Z SIGGRAPH 2026 Conference Paper. Code is available at https://github.com/yindaheng98/ColorAdaptiveGaussianSplatting ACM SIGGRAPH 2026 Daheng Yin Yili Jin Jianxin Shi Isaac Ding Miao Zhang Fangxin Wang Zhaowu Huang Cong Zhang Jiangchuan Liu Fang Dong 10.1145/3799902.3811058 http://arxiv.org/abs/2605.09196v1 RigidFormer: Learning Rigid Dynamics using Transformers 2026-05-09T22:31:09Z Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations, therefore, remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components. 2026-05-09T22:31:09Z Project Page: https://people.csail.mit.edu/frankzydou/projects/RigidFormer/index.html Zhiyang Dou Minghao Guo Haixu Wu Doug Roble Tuur Stuyck Wojciech Matusik http://arxiv.org/abs/2605.06063v2 Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures 2026-05-09T20:43:51Z The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture generation still depends heavily on perceptual evaluation, yet studies often vary avatar appearance and facial presentation when judging the generated motions. Prior work suggests these visual choices can bias motion judgments, but controlled evidence remains limited. We address this gap with controlled evaluations of co-speech gestures across motion sources, spanning seven representative avatar renderings used in contemporary research and application pipelines. Our results show that avatar and face presentation systematically shift perceptual judgments, and we provide recommendations for benchmarking gesture synthesis as well as for deploying virtual humans in human-facing applications. 2026-05-07T11:46:15Z Haoyang Du Yinghan Xu John Dingliana Brian Keegan Rachel McDonnell Cathy Ennis http://arxiv.org/abs/2605.09024v1 Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination 2026-05-09T16:04:06Z Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (<3 GB RAM, <5 GB VRAM, <2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders. 2026-05-09T16:04:06Z Adrian Azzarelli Nantheera Anantrasirichai James Pollock David R. Bull http://arxiv.org/abs/2605.10996v1 Towards Scalable Persistence-Based Topological Optimization 2026-05-09T15:47:20Z Persistence-based topological optimization deforms a point cloud $X \subset \mathbb{R}^d$ by minimizing objectives of the form $L(X) = \ell(\mathrm{Dgm}(X))$, where $\mathrm{Dgm}(X)$ is a persistence diagram. In practice, optimization is limited by two coupled issues: persistent homology is typically computed on subsamples, and the resulting topological gradients are highly sparse, with only a few anchor points receiving nonzero updates. Motivated by diffeomorphic interpolation, which extends sparse gradients to smooth ambient vector fields via Reproducing Kernel Hilbert Space (RKHS) interpolation, we propose a more scalable pipeline that improves both subsampling and gradient extension. We introduce subsampling via random slicing, a lightweight scheme that promotes iteration-wise geometric coverage and mitigates density bias. We further replace the costly kernel solve with a fast Nadaraya-Watson (NW) Gaussian convolution, producing a globally defined smooth update field at a fraction of the computational cost, while being more suited for topological optimization tasks. We provide theoretical guarantees for NW smoothing, including anchor approximation bounds and global Lipschitz estimates. Experiments in $2$D and $3$D show that combining random slicing with NW smoothing yields consistent speedups and improved objective values over other baselines on common persistence losses. 2026-05-09T15:47:20Z Abderrahim Bendahi Alexandre Duplessis Arnaud Fickinger