https://arxiv.org/api/BeE/YrSUBw44mlCRTd7+lc3iicQ 2026-06-28T03:14:18Z 9390 1740 15 http://arxiv.org/abs/2508.09983v1 Story2Board: A Training-Free Approach for Expressive Storyboard Generation 2025-08-13T17:56:26Z

We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.

2025-08-13T17:56:26Z Project page is available at https://daviddinkevich.github.io/Story2Board/ David Dinkevich Matan Levy Omri Avrahami Dvir Samuel Dani Lischinski http://arxiv.org/abs/2508.09830v1 RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians 2025-08-13T14:05:21Z

In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.

2025-08-13T14:05:21Z ICCV 2025 Highlight. Shenxing and Jinxi are co-first authors. Code and data are available at: https://github.com/vLAR-group/RayletDF Shenxing Wei Jinxi Li Yafei Yang Siyuan Zhou Bo Yang http://arxiv.org/abs/2508.02368v3 Poncelet triangles: conic loci of the orthocenter and of the isogonal conjugate of a fixed point 2025-08-13T12:17:39Z

We prove that over a Poncelet triangle family interscribed between two nested ellipses $\mathcal{E},\mathcal{E}_c$, (i) the locus of the orthocenter is not only a conic, but it is axis-aligned and homothetic to a $90^o$-rotated copy of $\mathcal{E}$, and (ii) the locus of the isogonal conjugate of a fixed point $P$ is also a conic (the expected degree was four); a parabola (resp. line) if $P$ is on the (degree-four) envelope of the circumcircle (resp. on $\mathcal{E}$). We also show that the envelope of both the circumcircle and radical axis of incircle and circumcircle contain a conic component if and only if $\mathcal{E}_c$ is a circle. The former case is the union of two circles!

2025-08-04T12:54:56Z 18 pages, 14 figures, 2 tables Ronaldo A. Garcia Mark Helman Dan Reznik http://arxiv.org/abs/2401.07283v2 FROST-BRDF: A Fast and Robust Optimal Sampling Technique for BRDF Acquisition 2025-08-13T10:07:33Z

Efficient and accurate BRDF acquisition of real world materials is a challenging research problem that requires sampling millions of incident light and viewing directions. To accelerate the acquisition process, one needs to find a minimal set of sampling directions such that the recovery of the full BRDF is accurate and robust given such samples. In this paper, we formulate BRDF acquisition as a compressed sensing problem, where the sensing operator is one that performs sub-sampling of the BRDF signal according to a set of optimal sample directions. To solve this problem, we propose the Fast and Robust Optimal Sampling Technique (FROST) for designing a provably optimal sub-sampling operator that places light-view samples such that the recovery error is minimized. FROST casts the problem of designing an optimal sub-sampling operator for compressed sensing into a sparse representation formulation under the Multiple Measurement Vector (MMV) signal model. The proposed reformulation is exact, i.e. without any approximations, hence it converts an intractable combinatorial problem into one that can be solved with standard optimization techniques. As a result, FROST is accompanied by strong theoretical guarantees from the field of compressed sensing. We perform a thorough analysis of FROST-BRDF using a 10-fold cross-validation with publicly available BRDF datasets and show significant advantages compared to the state-of-the-art with respect to reconstruction quality. Finally, FROST is simple, both conceptually and in terms of implementation, it produces consistent results at each run, and it is at least two orders of magnitude faster than the prior art.

2024-01-14T13:02:55Z Submitted to IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG) Ehsan Miandji Tanaboon Tongbuasirilai Saghi Hajisharif Behnaz Kavoosighafi Jonas Unger http://arxiv.org/abs/2506.15290v2 Human Motion Capture from Loose and Sparse Inertial Sensors with Garment-aware Diffusion Models 2025-08-13T08:45:54Z

Motion capture using sparse inertial sensors has shown great promise due to its portability and lack of occlusion issues compared to camera-based tracking. Existing approaches typically assume that IMU sensors are tightly attached to the human body. However, this assumption often does not hold in real-world scenarios. In this paper, we present Garment Inertial Poser (GaIP), a method for estimating full-body poses from sparse and loosely attached IMU sensors. We first simulate IMU recordings using an existing garment-aware human motion dataset. Our transformer-based diffusion models synthesize loose IMU data and estimate human poses from this challenging loose IMU data. We also demonstrate that incorporating garment-related parameters during training on loose IMU data effectively maintains expressiveness and enhances the ability to capture variations introduced by looser or tighter garments. Our experiments show that our diffusion methods trained on simulated and synthetic data outperform state-of-the-art inertial full-body pose estimators, both quantitatively and qualitatively, opening up a promising direction for future research on motion capture from such realistic sensor placements.

2025-06-18T09:16:36Z Accepted by IJCAI 2025 Andela Ilic Jiaxi Jiang Paul Streli Xintong Liu Christian Holz http://arxiv.org/abs/2508.09610v1 DualPhys-GS: Dual Physically-Guided 3D Gaussian Splatting for Underwater Scene Reconstruction 2025-08-13T08:42:17Z

In 3D reconstruction of underwater scenes, traditional methods based on atmospheric optical models cannot effectively deal with the selective attenuation of light wavelengths and the effect of suspended particle scattering, which are unique to the water medium, and lead to color distortion, geometric artifacts, and collapsing phenomena at long distances. We propose the DualPhys-GS framework to achieve high-quality underwater reconstruction through a dual-path optimization mechanism. Our approach further develops a dual feature-guided attenuation-scattering modeling mechanism, the RGB-guided attenuation optimization model combines RGB features and depth information and can handle edge and structural details. In contrast, the multi-scale depth-aware scattering model captures scattering effects at different scales using a feature pyramid network and an attention mechanism. Meanwhile, we design several special loss functions. The attenuation scattering consistency loss ensures physical consistency. The water body type adaptive loss dynamically adjusts the weighting coefficients. The edge-aware scattering loss is used to maintain the sharpness of structural edges. The multi-scale feature loss helps to capture global and local structural information. In addition, we design a scene adaptive mechanism that can automatically identify the water-body-type characteristics (e.g., clear coral reef waters or turbid coastal waters) and dynamically adjust the scattering and attenuation parameters and optimization strategies. Experimental results show that our method outperforms existing methods in several metrics, especially in suspended matter-dense regions and long-distance scenes, and the reconstruction quality is significantly improved.

2025-08-13T08:42:17Z 12 pages, 4 figures Jiachen Li Guangzhi Han Jin Wan Yuan Gao Delong Han http://arxiv.org/abs/2505.18764v2 Efficient Differentiable Hardware Rasterization for 3D Gaussian Splatting 2025-08-13T03:09:42Z

Recent works demonstrate the advantages of hardware rasterization for 3D Gaussian Splatting (3DGS) in forward-pass rendering through fast GPU-optimized graphics and fixed memory footprint. However, extending these benefits to backward-pass gradient computation remains challenging due to graphics pipeline constraints. We present a differentiable hardware rasterizer for 3DGS that overcomes the memory and performance limitations of tile-based software rasterization. Our solution employs programmable blending for per-pixel gradient computation combined with a hybrid gradient reduction strategy (quad-level + subgroup) in fragment shaders, achieving over 10x faster backward rasterization versus naive atomic operations and 3x speedup over the canonical tile-based rasterizer. Systematic evaluation reveals 16-bit render targets (float16 and unorm16) as the optimal accuracy-efficiency trade-off, achieving higher gradient accuracy among mixed-precision rendering formats with execution speeds second only to unorm8, while float32 texture incurs severe forward pass performance degradation due to suboptimal hardware optimizations. Our method with float16 formats demonstrates 3.07x acceleration in full pipeline execution (forward + backward passes) on RTX4080 GPUs with the MipNeRF 360 dataset, outperforming the baseline tile-based renderer while preserving hardware rasterization's memory efficiency advantages -- incurring merely 2.67% of the memory overhead required for splat sorting operations. This work presents a unified differentiable hardware rasterization method that simultaneously optimizes runtime and memory usage for 3DGS, making it particularly suitable for resource-constrained devices with limited memory capacity.

2025-05-24T16:07:33Z 8 pages,2 figures Yitian Yuan Qianyue He http://arxiv.org/abs/2508.10934v1 ViPE: Video Pose Engine for 3D Geometric Perception 2025-08-12T18:39:13Z

Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.

2025-08-12T18:39:13Z Paper website: https://research.nvidia.com/labs/toronto-ai/vipe/ Jiahui Huang Qunjie Zhou Hesam Rabeti Aleksandr Korovko Huan Ling Xuanchi Ren Tianchang Shen Jun Gao Dmitry Slepichev Chen-Hsuan Lin Jiawei Ren Kevin Xie Joydeep Biswas Laura Leal-Taixe Sanja Fidler http://arxiv.org/abs/2508.11695v1 RefAdGen: High-Fidelity Advertising Image Generation 2025-08-12T18:25:31Z

The rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques has unlocked opportunities in generating diverse and compelling advertising images based on referenced product images and textual scene descriptions. This capability substantially reduces human labor and production costs in traditional marketing workflows. However, existing AIGC techniques either demand extensive fine-tuning for each referenced image to achieve high fidelity, or they struggle to maintain fidelity across diverse products, making them impractical for e-commerce and marketing industries. To tackle this limitation, we first construct AdProd-100K, a large-scale advertising image generation dataset. A key innovation in its construction is our dual data augmentation strategy, which fosters robust, 3D-aware representations crucial for realistic and high-fidelity image synthesis. Leveraging this dataset, we propose RefAdGen, a generation framework that achieves high fidelity through a decoupled design. The framework enforces precise spatial control by injecting a product mask at the U-Net input, and employs an efficient Attention Fusion Module (AFM) to integrate product features. This design effectively resolves the fidelity-efficiency dilemma present in existing methods. Extensive experiments demonstrate that RefAdGen achieves state-of-the-art performance, showcasing robust generalization by maintaining high fidelity and remarkable visual results for both unseen products and challenging real-world, in-the-wild images. This offers a scalable and cost-effective alternative to traditional workflows. Code and datasets are publicly available at https://github.com/Anonymous-Name-139/RefAdgen.

2025-08-12T18:25:31Z Yiyun Chen Weikai Yang http://arxiv.org/abs/2508.09062v1 VertexRegen: Mesh Generation with Continuous Level of Detail 2025-08-12T16:25:46Z

We introduce VertexRegen, a novel mesh generation framework that enables generation at a continuous level of detail. Existing autoregressive methods generate meshes in a partial-to-complete manner and thus intermediate steps of generation represent incomplete structures. VertexRegen takes inspiration from progressive meshes and reformulates the process as the reversal of edge collapse, i.e. vertex split, learned through a generative model. Experimental results demonstrate that VertexRegen produces meshes of comparable quality to state-of-the-art methods while uniquely offering anytime generation with the flexibility to halt at any step to yield valid meshes with varying levels of detail.

2025-08-12T16:25:46Z ICCV 2025. Project Page: https://vertexregen.github.io/ Xiang Zhang Yawar Siddiqui Armen Avetisyan Chris Xie Jakob Engel Henry Howard-Jenkins http://arxiv.org/abs/2508.08928v1 DASC: Depth-of-Field Aware Scene Complexity Metric for 3D Visualization on Light Field Display 2025-08-12T13:29:30Z

Light field display is one of the technologies providing 3D immersive visualization. However, a light field display generates only a limited number of light rays which results in finite angular and spatial resolutions. Therefore, 3D content can be shown with high quality only within a narrow depth range notated as Depth of Field (DoF) around the display screen. Outside this range, due to the appearance of aliasing artifacts, the quality degrades proportionally to the distance from the screen. One solution to mitigate the artifacts is depth of field rendering which blurs the content in the distorted regions, but can result in the removal of scene details. This research focuses on proposing a DoF Aware Scene Complexity (DASC) metric that characterizes 3D content based on geometrical and positional factors considering the light field display's DoF. In this research, we also evaluate the observers' preference across different level of blurriness caused by DoF rendering ranging from sharp, aliased scenes to overly smoothed alias-free scenes. We have conducted this study over multiple scenes that we created to account for different types of content. Based on the outcome of subjective studies, we propose a model that takes the value of DASC metric as input and predicts the preferred level of blurring for the given scene as output.

2025-08-12T13:29:30Z 12 pages, submitted in IEEE Transactions on Multimedia Kamran Akbar Robert Bregovic Federica Battisti http://arxiv.org/abs/2508.09235v1 TFZ: Topology-Preserving Compression of 2D Symmetric and Asymmetric Second-Order Tensor Fields 2025-08-12T11:21:49Z

In this paper, we present a novel compression framework, TFZ, that preserves the topology of 2D symmetric and asymmetric second-order tensor fields defined on flat triangular meshes. A tensor field assigns a tensor - a multi-dimensional array of numbers - to each point in space. Tensor fields, such as the stress and strain tensors, and the Riemann curvature tensor, are essential to both science and engineering. The topology of tensor fields captures the core structure of data, and is useful in various disciplines, such as graphics (for manipulating shapes and textures) and neuroscience (for analyzing brain structures from diffusion MRI). Lossy data compression may distort the topology of tensor fields, thus hindering downstream analysis and visualization tasks. TFZ ensures that certain topological features are preserved during lossy compression. Specifically, TFZ preserves degenerate points essential to the topology of symmetric tensor fields and retains eigenvector and eigenvalue graphs that represent the topology of asymmetric tensor fields. TFZ scans through each cell, preserving the local topology of each cell, and thereby ensuring certain global topological guarantees. We showcase the effectiveness of our framework in enhancing the lossy scientific data compressors SZ3 and SPERR.

2025-08-12T11:21:49Z 29 pages, 27 figures, to be presented at IEEE Vis 2025 (and published in IEEE TVCG 2026) Nathaniel Gorski Xin Liang Hanqi Guo Bei Wang http://arxiv.org/abs/2508.08831v1 DiffPhysCam: Differentiable Physics-Based Camera Simulation for Inverse Rendering and Embodied AI 2025-08-12T10:38:20Z

We introduce DiffPhysCam, a differentiable camera simulator designed to support robotics and embodied AI applications by enabling gradient-based optimization in visual perception pipelines. Generating synthetic images that closely mimic those from real cameras is essential for training visual models and enabling end-to-end visuomotor learning. Moreover, differentiable rendering allows inverse reconstruction of real-world scenes as digital twins, facilitating simulation-based robotics training. However, existing virtual cameras offer limited control over intrinsic settings, poorly capture optical artifacts, and lack tunable calibration parameters -- hindering sim-to-real transfer. DiffPhysCam addresses these limitations through a multi-stage pipeline that provides fine-grained control over camera settings, models key optical effects such as defocus blur, and supports calibration with real-world data. It enables both forward rendering for image synthesis and inverse rendering for 3D scene reconstruction, including mesh and material texture optimization. We show that DiffPhysCam enhances robotic perception performance in synthetic image tasks. As an illustrative example, we create a digital twin of a real-world scene using inverse rendering, simulate it in a multi-physics environment, and demonstrate navigation of an autonomous ground vehicle using images generated by DiffPhysCam.

2025-08-12T10:38:20Z 19 pages, 17 figures, and 4 tables Bo-Hsun Chen Nevindu M. Batagoda Dan Negrut http://arxiv.org/abs/2508.08754v1 Exploring Palette based Color Guidance in Diffusion Models 2025-08-12T09:02:10Z

With the advent of diffusion models, Text-to-Image (T2I) generation has seen substantial advancements. Current T2I models allow users to specify object colors using linguistic color names, and some methods aim to personalize color-object association through prompt learning. However, existing models struggle to provide comprehensive control over the color schemes of an entire image, especially for background elements and less prominent objects not explicitly mentioned in prompts. This paper proposes a novel approach to enhance color scheme control by integrating color palettes as a separate guidance mechanism alongside prompt instructions. We investigate the effectiveness of palette guidance by exploring various palette representation methods within a diffusion-based image colorization framework. To facilitate this exploration, we construct specialized palette-text-image datasets and conduct extensive quantitative and qualitative analyses. Our results demonstrate that incorporating palette guidance significantly improves the model's ability to generate images with desired color schemes, enabling a more controlled and refined colorization process.

2025-08-12T09:02:10Z Accepted to ACM MM 2025 Qianru Qiu Jiafeng Mao Xueting Wang http://arxiv.org/abs/2506.04664v2 A Fast Unsupervised Scheme for Polygonal Approximation 2025-08-12T05:44:15Z

This paper proposes a fast and unsupervised scheme for the polygonal approximation of a closed digital curve. It is demonstrated that the approximation scheme is faster than state-of-the-art approximation and is competitive with Rosin's measure and aesthetic aspects. The scheme comprises of three phases: initial segmentation, iterative vertex insertion, iterative merging, and vertex adjustment. The initial segmentation is used to detect sharp turns, that is, vertices that seemingly have high curvature. It is likely that some of the important vertices with low curvature might have been missed in the first phase; therefore, iterative vertex insertion is used to add vertices in a region where the curvature changes slowly but steadily. The initial phase may pick up some undesirable vertices, and thus merging is used to eliminate redundant vertices. Finally, vertex adjustment was used to enhance the aesthetic appearance of the approximation. The quality of the approximations was measured using the Rosin's method. The robustness of the proposed scheme with respect to geometric transformation was observed.

2025-06-05T06:18:48Z Bimal Kumar Ray