https://arxiv.org/api/p6bftHH9OyhHoIrVSSqaWqiurno 2026-06-25T19:21:55Z 9383 1320 15 http://arxiv.org/abs/2511.08633v1 Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising 2025-11-09T22:47:50Z Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/. 2025-11-09T22:47:50Z Assaf Singer Noam Rotstein Amir Mann Ron Kimmel Or Litany http://arxiv.org/abs/2511.06112v1 A computational framework for evaluating an edge-integrated, multi-ramp construction model of the Great Pyramid of Giza 2025-11-08T19:35:13Z Despite decades of study, a quantitative, integrated framework to evaluate minutescale throughput, geometric control, and a zero external footprint for Khufu's pyramid has been lacking. We test the Integrated Edge-Ramp (IER) model-a helical path formed by omitting and backfilling perimeter courses-using a unified, end-to-end pipeline coupling parametric geometry, discrete-event logistics, and staged finite-element analysis (FEA). An adaptive multiramp strategy can sustain 4-6-minute dispatches and yields a median on-site duration of 13.8-20.6 years (95% CI); including quarrying, river transport, and seasonal pauses gives 20-27 years. FEA indicates that stresses and settlements remain within plausible limits for Old Kingdom limestone under self-weight. The model's geometry is also consistent with internal voids identified by muon imaging (a hypothesis-generating result). The IER helps reconcile throughput, survey access, and zero-footprint closure, and produces falsifiable predictions (edge-fill signatures, corner wear). Our study provides a transferable, open-data/code framework for testing construction hypotheses for ancient megastructures. 2025-11-08T19:35:13Z How could Khufu's pyramid be built without external ramps? This open and reproducible framework provides a quantitative answer by testing an edge-integrated, multi-ramp construction model (preprint under peer review at npj Heritage Science, Nature Portfolio) Vicente Luis Rosell Roig 10.21203/rs.3.rs-7331924/v1 http://arxiv.org/abs/2509.23852v4 SIG-Chat: Spatial Intent-Guided Conversational Gesture Generation Involving How, When and Where 2025-11-08T14:00:48Z The accompanying actions and gestures in dialogue are often closely linked to interactions with the environment, such as looking toward the interlocutor or using gestures to point to the described target at appropriate moments. Speech and semantics guide the production of gestures by determining their timing (WHEN) and style (HOW), while the spatial locations of interactive objects dictate their directional execution (WHERE). Existing approaches either rely solely on descriptive language to generate motions or utilize audio to produce non-interactive gestures, thereby lacking the characterization of interactive timing and spatial intent. This significantly limits the applicability of conversational gesture generation, whether in robotics or in the fields of game and animation production. To address this gap, we present a full-stack solution. We first established a unique data collection method to simultaneously capture high-precision human motion and spatial intent. We then developed a generation model driven by audio, language, and spatial data, alongside dedicated metrics for evaluating interaction timing and spatial accuracy. Finally, we deployed the solution on a humanoid robot, enabling rich, context-aware physical interactions. 2025-09-28T12:43:09Z Yiheng Huang Junran Peng Silei Shen Jingwei Yang ZeJi Wei ChenCheng Bai Yonghao He Wei Sui Muyi Sun Yan Liu Xu-Cheng Yin Man Zhang Zhaoxiang Zhang Chuanchen Luo http://arxiv.org/abs/2505.08239v3 ACT-R: Adaptive Camera Trajectories for Single View 3D Reconstruction 2025-11-08T09:01:50Z We introduce the simple idea of adaptive view planning to multi-view synthesis, aiming to improve both occlusion revelation and 3D consistency for single-view 3D reconstruction. Instead of producing an unordered set of views independently or simultaneously, we generate a sequence of views, leveraging temporal consistency to enhance 3D coherence. More importantly, our view sequence is not determined by a pre-determined and fixed camera setup. Instead, we compute an adaptive camera trajectory (ACT), forming an orbit, which seeks to maximize the visibility of occluded regions of the 3D object to be reconstructed. Once the best orbit is found, we feed it to a video diffusion model to generate novel views around the orbit, which can then be passed to any multi-view 3D reconstruction model to obtain the final result. Our multi-view synthesis pipeline is quite efficient since it involves no run-time training/optimization, only forward inferences by applying pre-trained models for occlusion analysis and multi-view synthesis. Our method predicts camera trajectories that reveal occlusions effectively and produce consistent novel views, significantly improving 3D reconstruction over SOTA alternatives on the unseen GSO dataset. Project Page: https://mingrui-zhao.github.io/ACT-R/ 2025-05-13T05:31:59Z 3DV 2026, Project Page: https://mingrui-zhao.github.io/ACT-R/ Yizhi Wang Mingrui Zhao Hao Zhang http://arxiv.org/abs/2511.05360v1 Neural Image Abstraction Using Long Smoothing B-Splines 2025-11-07T15:50:48Z We integrate smoothing B-splines into a standard differentiable vector graphics (DiffVG) pipeline through linear mapping, and show how this can be used to generate smooth and arbitrarily long paths within image-based deep learning systems. We take advantage of derivative-based smoothing costs for parametric control of fidelity vs. simplicity tradeoffs, while also enabling stylization control in geometric and image spaces. The proposed pipeline is compatible with recent vector graphics generation and vectorization methods. We demonstrate the versatility of our approach with four applications aimed at the generation of stylized vector graphics: stylized space-filling path generation, stroke-based image abstraction, closed-area image abstraction, and stylized text generation. 2025-11-07T15:50:48Z Daniel Berio Michael Stroh Sylvain Calinon Frederic Fol Leymarie Oliver Deussen Ariel Shamir 10.1145/3763345 http://arxiv.org/abs/2305.08186v2 City Street Layout Generation via Conditional Adversarial Learning 2025-11-07T12:07:53Z The demand for high-quality city street layouts has persisted for an extended period presenting notable challenges. Conventional methods are yet to effectively address the integration of both natural and socioeconomic factors in this complex task. In this study, we propose a novel conditional adversarial learning-based method for city street layout generation from natural and socioeconomic conditions. Specifically, we design an image synthesis module that leverages an autoencoder to fuse a set of natural and socioeconomic data for a given region of interest into a feature map, and then employs a conditional generative adversarial network trained on real-world data to synthesize street layout images from the feature map. Afterward, a graph extraction module converts each synthesized image to the corresponding high-quality street layout graph. Experiments and evaluations suggest that the proposed method produces diverse city street layouts that closely resemble their real-world counterparts both visually and structurally. This capability can facilitate the creation of high-quality virtual city scenes. 2023-05-14T15:39:38Z in Chinese language Bulletin of Science and Technology, 2025, 41(10):83-90+115 Lehao Yang Cui Zhu Tian Feng 10.13774/j.cnki.kjtb.2025.10.009 http://arxiv.org/abs/2511.05109v1 Efficient representation of 3D spatial data for defense-related applications 2025-11-07T09:50:36Z Geospatial sensor data is essential for modern defense and security, offering indispensable 3D information for situational awareness. This data, gathered from sources like lidar sensors and optical cameras, allows for the creation of detailed models of operational environments. In this paper, we provide a comparative analysis of traditional representation methods, such as point clouds, voxel grids, and triangle meshes, alongside modern neural and implicit techniques like Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS). Our evaluation reveals a fundamental trade-off: traditional models offer robust geometric accuracy ideal for functional tasks like line-of-sight analysis and physics simulations, while modern methods excel at producing high-fidelity, photorealistic visuals but often lack geometric reliability. Based on these findings, we conclude that a hybrid approach is the most promising path forward. We propose a system architecture that combines a traditional mesh scaffold for geometric integrity with a neural representation like 3DGS for visual detail, managed within a hierarchical scene structure to ensure scalability and performance. 2025-11-07T09:50:36Z Benjamin Kahl Marcus Hebel Michael Arens 10.1117/12.3069693 http://arxiv.org/abs/2511.05066v1 VEIL: Reading Control Flow Graphs Like Code 2025-11-07T08:21:04Z Control flow graphs (CFGs) are essential tools for understanding program behavior, yet the size of real-world CFGs makes them difficult to interpret. With thousands of nodes and edges, sophisticated graph drawing algorithms are required to present them on screens in ways that make them readable and understandable. However, being designed for general graphs, these algorithms frequently break the natural flow of execution, placing later instructions before earlier ones and obscuring critical program structures. In this paper, we introduce a set of criteria specifically tailored for CFG visualization, focusing on preserving execution order and making complex structures easier to follow. Building on these criteria, we present VEIL, a new layout algorithm that uses dominator analysis to produce clearer, more intuitive CFG layouts. Through a study of CFGs from real-world applications, we show how our method improves readability and provides improved layout performance compared to state of the art graph drawing techniques. 2025-11-07T08:21:04Z Philipp Schaad Tal Ben-Nun Torsten Hoefler http://arxiv.org/abs/2511.05053v1 Accelerating HDC-CNN Hybrid Models Using Custom Instructions on RISC-V GPUs 2025-11-07T07:50:48Z Machine learning based on neural networks has advanced rapidly, but the high energy consumption required for training and inference remains a major challenge. Hyperdimensional Computing (HDC) offers a lightweight, brain-inspired alternative that enables high parallelism but often suffers from lower accuracy on complex visual tasks. To overcome this, hybrid accelerators combining HDC and Convolutional Neural Networks (CNNs) have been proposed, though their adoption is limited by poor generalizability and programmability. The rise of open-source RISC-V architectures has created new opportunities for domain-specific GPU design. Unlike traditional proprietary GPUs, emerging RISC-V-based GPUs provide flexible, programmable platforms suitable for custom computation models such as HDC. In this study, we design and implement custom GPU instructions optimized for HDC operations, enabling efficient processing for hybrid HDC-CNN workloads. Experimental results using four types of custom HDC instructions show a performance improvement of up to 56.2 times in microbenchmark tests, demonstrating the potential of RISC-V GPUs for energy-efficient, high-performance computing. 2025-11-07T07:50:48Z Wakuto Matsumi Riaz-Ul-Haque Mian http://arxiv.org/abs/2601.19901v1 Light Field Display Point Rendering 2025-11-06T18:12:19Z Rendering for light field displays (LFDs) requires rendering of dozens or hundreds of views, which must then be combined into a single image on the display, making real-time LFD rendering extremely difficult. We introduce light field display point rendering (LFDPR), which meets these challenges by improving eye-based point rendering [Gavane and Watson 2023] with texture-based splatting, which avoids oversampling of triangles mapped to only a few texels; and with LFD-biased sampling, which adjusts horizontal and vertical triangle sampling to match the sampling of the LFD itself. To improve image quality, we introduce multiview mipmapping, which reduces texture aliasing even though compute shaders do not support hardware mipmapping. We also introduce angular supersampling and reconstruction to combat LFD view aliasing and crosstalk. The resulting LFDPR is 2-8x times faster than multiview rendering, with similar comparable quality. 2025-11-06T18:12:19Z 18 pages, 11 figures, Published in Proceedings of the ACM on Computer Graphics and Interactive Techniques, Vol. 7, Issue. 1 (May 2024) Proceedings of the ACM on Computer Graphics and Interactive Techniques 7.1 (2024) Ajinkya Gavane Benjamin Watson 10.1145/3651300 http://arxiv.org/abs/2508.04825v2 Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off 2025-11-05T18:23:44Z Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization. 2025-08-06T19:10:58Z Accepted to SIGGRAPH Asia 2025, project page: https://nxnai.github.io/Voost/ Seungyong Lee Jeong-gi Kwak http://arxiv.org/abs/2511.03617v1 Visualization Biases MLLM's Decision Making in Network Data Tasks 2025-11-05T16:34:12Z We evaluate how visualizations can influence the judgment of MLLMs about the presence or absence of bridges in a network. We show that the inclusion of visualization improves confidence over a structured text-based input that could theoretically be helpful for answering the question. On the other hand, we observe that standard visualization techniques create a strong bias towards accepting or refuting the presence of a bridge -- independently of whether or not a bridge actually exists in the network. While our results indicate that the inclusion of visualization techniques can effectively influence the MLLM's judgment without compromising its self-reported confidence, they also imply that practitioners must be careful of allowing users to include visualizations in generative AI applications so as to avoid undesired hallucinations. 2025-11-05T16:34:12Z This manuscript was presented at VIS x GenAI, a workshop co-located with IEEE VIS 2025 Timo Brand Henry Förster Stephen G. Kobourov Jacob Miller http://arxiv.org/abs/2506.10036v2 Token Perturbation Guidance for Diffusion Models 2025-11-05T16:26:24Z Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models. 2025-06-10T21:25:46Z Accepted at NeurIPS 2025. Project page: https://github.com/TaatiTeam/Token-Perturbation-Guidance Javad Rajabi Soroush Mehraban Seyedmorteza Sadat Babak Taati http://arxiv.org/abs/2508.08826v3 Geometry-Aware Global Feature Aggregation for Real-Time Indirect Illumination 2025-11-05T15:51:33Z Real-time rendering with global illumination is crucial to afford the user realistic experience in virtual environments. We present a learning-based estimator to predict diffuse indirect illumination in screen space, which then is combined with direct illumination to synthesize globally-illuminated high dynamic range (HDR) results. Our approach tackles the challenges of capturing long-range/long-distance indirect illumination when employing neural networks and is generalized to handle complex lighting and scenarios. From the neural network thinking of the solver to the rendering equation, we present a novel network architecture to predict indirect illumination. Our network is equipped with a modified attention mechanism that aggregates global information guided by spacial geometry features, as well as a monochromatic design that encodes each color channel individually. We conducted extensive evaluations, and the experimental results demonstrate our superiority over previous learning-based techniques. Our approach excels at handling complex lighting such as varying-colored lighting and environment lighting. It can successfully capture distant indirect illumination and simulates the interreflections between textured surfaces well (i.e., color bleeding effects); it can also effectively handle new scenes that are not present in the training dataset. 2025-08-12T10:36:03Z 10 pages Meng Gai Guoping Wang Sheng Li http://arxiv.org/abs/2511.03147v1 Scheduling the Off-Diagonal Weingarten Loss of Neural SDFs for CAD Models 2025-11-05T03:09:55Z Neural signed distance functions (SDFs) have become a powerful representation for geometric reconstruction from point clouds, yet they often require both gradient- and curvature-based regularization to suppress spurious warp and preserve structural fidelity. FlatCAD introduced the Off-Diagonal Weingarten (ODW) loss as an efficient second-order prior for CAD surfaces, approximating full-Hessian regularization at roughly half the computational cost. However, FlatCAD applies a fixed ODW weight throughout training, which is suboptimal: strong regularization stabilizes early optimization but suppresses detail recovery in later stages. We present scheduling strategies for the ODW loss that assign a high initial weight to stabilize optimization and progressively decay it to permit fine-scale refinement. We investigate constant, linear, quintic, and step interpolation schedules, as well as an increasing warm-up variant. Experiments on the ABC CAD dataset demonstrate that time-varying schedules consistently outperform fixed weights. Our method achieves up to a 35% improvement in Chamfer Distance over the FlatCAD baseline, establishing scheduling as a simple yet effective extension of curvature regularization for robust CAD reconstruction. 2025-11-05T03:09:55Z Lecture Notes in Computer Science (LNCS), 20th International Symposium on Visual Computing 2025, 12 pages, 4 figures, preprint Haotian Yin Przemyslaw Musialski