https://arxiv.org/api/iOGGfWStVfZEmg11AWbnFDPzKFo 2026-06-13T22:05:02Z 9323 210 15 http://arxiv.org/abs/2605.19551v1 AnchorFlow: Editable SVG Reconstruction via Sparse Anchor Point Fields 2026-05-19T08:49:17Z

Image-to-SVG reconstruction aims to produce vector graphics that are faithful to raster inputs and easy to edit. Existing methods face a structural trade-off in how vector structure is parameterized, including how many paths represent an image and how many anchor points define each path. High-fidelity methods often rely on many paths or densely parameterized curves, whereas overly compact SVG generation may deviate from the input geometry. This issue becomes more pronounced when local raster evidence is imperfect, where boundary-following reconstruction can introduce redundant anchors and fragmented structures. We argue that this trade-off should be addressed at the level of anchor placement, since anchors on Bezier curves define local path structure and strongly affect both accuracy and editability. We propose AnchorFlow, an editable SVG reconstruction framework that models path-level anchor placement with sparse anchor point fields. Given path-like foreground components extracted from a raster image, AnchorFlow predicts an image-conditioned sparse anchor field for each component and resolves it into an ordered Bezier path. Rendering-guided feedback then corrects local structural errors before re-resolution. The recovered paths are then assembled and optimized into the final SVG. Experiments on isolated paths and full images show that AnchorFlow achieves a favorable fidelity-editability trade-off, substantially reducing editable complexity while preserving competitive raster fidelity.

2026-05-19T08:49:17Z Mengnan Jiang Christian Franke Michele Franco Adesso Antonio Haas Grace Li Zhang http://arxiv.org/abs/2511.18209v3 MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning 2026-05-19T08:37:01Z

3D Human motion generation is pivotal across film, animation, gaming, and embodied intelligence. Traditional 3D motion synthesis relies on costly motion capture, while recent work shows that 2D videos provide rich, temporally coherent observations of human behavior. Existing approaches, however, either map high-level text descriptions to motion or rely solely on video conditioning, leaving a gap between generated dynamics and real-world motion statistics. We introduce MotionDuet, a multimodal framework that aligns motion generation with the distribution of video-derived representations. In this dual-conditioning paradigm, video cues extracted from a pretrained model (e.g., VideoMAE) ground low-level motion dynamics, while textual prompts provide semantic intent. To bridge the distribution gap across modalities, we propose Dual-stream Unified Encoding and Transformation (DUET) and a Distribution-Aware Structural Harmonization (DASH) loss. DUET fuses video-informed cues into the motion latent space via unified encoding and dynamic attention, while DASH aligns motion trajectories with both distributional and structural statistics of video features. An auto-guidance mechanism further balances textual and visual signals by leveraging a weakened copy of the model, enhancing controllability without sacrificing diversity. Extensive experiments demonstrate that MotionDuet generates realistic and controllable human motions, surpassing strong state-of-the-art baselines.

2025-11-22T22:57:40Z Yi-Yang Zhang Tengjiao Sun Pengcheng Fang Deng-Bao Wang Xiaohao Cai Min-Ling Zhang Hansung Kim http://arxiv.org/abs/2504.03758v4 Improved visual-information-driven model for crowd simulation and its modular application 2026-05-19T08:34:59Z

Crowd movement simulation is crucial for pedestrian safety management and facility design. Data-driven models offer the potential to improve realism and predictive accuracy, but most are developed for a single scenario, limiting their flexibility. We propose a data-driven crowd simulation model that incorporates refined visual-information extraction and explicit exit cues, aiming to improve flexibility across multiple scenarios by more effectively capturing core navigational features. The model is tested on four fundamental modules (bottleneck, corridor, corner, and T-junction) and further evaluated in a composite scenario using a modular approach. Results show that our model performs well across these scenarios, aligning with pedestrian movement in real-world experiments, and outperforms the classical knowledge-driven model in these scenarios. The research outcomes can provide inspiration for the development of data-driven crowd simulation models and advance the application of data-driven approaches.

2025-04-02T07:53:33Z Xuanwen Liang, Jiayu Chen, Eric Wai Ming Lee, & Wei Xie (2026). Improved visual-information-driven model for crowd simulation and its modular application. Chaos, Solitons & Fractals, 209, 118481 Xuanwen Liang Jiayu Chen Eric Wai Ming Lee Wei Xie 10.1016/j.chaos.2026.118481 http://arxiv.org/abs/2605.20290v1 TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction 2026-05-19T08:16:44Z

Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at https://github.com/xinzhang007/TelePhysics.

2026-05-19T08:16:44Z Xin Zhang Yabo Chen Yijie Fang Wanying Qu Haibin Huang Chi Zhang Feng Xu Xuelong Li http://arxiv.org/abs/2605.19484v1 CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing 2026-05-19T07:35:22Z

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

2026-05-19T07:35:22Z Haobo Hu Xiangwu Guo Zhiheng Chen Difei Gao Haotian Liu Libiao Jin Qi Mao http://arxiv.org/abs/2512.04556v3 DISK: Differentiable Sparse Kernel Complex for Efficient Spatially-Variant Convolution 2026-05-19T06:54:30Z

Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.

2025-12-04T08:20:07Z Accepted as a conference paper at ICLR 2026. OpenReview: https://openreview.net/forum?id=bbuxDoRD2D Zhizhen Wu Zhe Cao Yuchi Huo http://arxiv.org/abs/2605.19411v1 BrepForge: Factorized B-rep Synthesis via Wireframe Composition and Boundary-Conditioned Surface Instantiation 2026-05-19T06:04:26Z

Boundary representation (B-rep) is the de facto standard for modern CAD, yet learning-based B-rep synthesis remains challenging due to the tight coupling between discrete topology and continuous geometry. We observe a fundamental asymmetry in B-reps: while wireframe composition involves high-entropy structural decisions, the interior surface geometry is largely constrained by its boundary loops. Motivated by this observation, we propose BrepForge, a generative framework that factorizes B-rep synthesis into two stages: wireframe composition and boundary-conditioned surface instantiation. In the first stage, a face-aware autoregressive model serializes the wireframe into structured sequences that explicitly encode hierarchical Vertex-Edge-Face (V-E-F) connectivity, yielding a topologically complete scaffold. In the second stage, precise surface geometries are instantiated by incorporating learning-free geometric priors derived from boundaries, transforming the complex synthesis task into a structured refinement process. This factorized approach ensures both topological integrity and geometric precision, effectively addressing the inherent complexities of B-rep modeling. Extensive experiments demonstrate that BrepForge outperforms existing baselines with superior geometric complexity and topological validity.

2026-05-19T06:04:26Z Jing Li Yihang Fu Falai Chen http://arxiv.org/abs/2605.19355v1 Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance 2026-05-19T04:41:11Z

Retargeting motion across characters with varying body shapes while preserving interaction semantics, such as self-contact and near-body proximity, remains a challenging problem. While recent geometry-aware approaches address this by maintaining spatial relationships between predefined corresponding regions, their reliance on static correspondences often struggles when the target character exhibits exaggerated body proportions. In this paper, we present a geometry-aware motion retargeting framework that preserves interaction semantics by performing proximity matching over spatially adaptive anchors. Unlike prior methods with static anchor definitions, the proposed method dynamically repositions anchors to reachable regions on the target character. This is achieved via a Transformer-based anchor refinement strategy that predicts anchor displacements and constrains the translated anchors to remain on the target character geometry through differentiable soft projection. By incorporating pose-dependent spatial structures from the source character, the adapted anchors provide structurally coherent guidance for interaction-aware retargeting. Conditioned on these anchors, a graph-based autoencoder predicts target skeletal motion that preserves the spatial configuration of the source. To encourage task-aligned optimization between anchor adaptation and motion retargeting, we adopt an alternating training scheme in which each module is optimized in turn. Through extensive evaluations, we demonstrate that our method outperforms state-of-the-art approaches in preserving interaction fidelity across diverse character geometries.

2026-05-19T04:41:11Z SIGGRAPH 2026 / ACM TOG. Project page available at https://suzyn.github.io/space_page/ Soojin Choi Seokhyeon Hong Chaelin Kim Junghyun Nam Junhyuk Jeon Junyong Noh http://arxiv.org/abs/2605.19350v1 CompoSE: Compositional Synthesis and Editing of 3D Shapes via Part-Aware Control 2026-05-19T04:39:48Z

Creating and editing high-quality 3D content remains a central challenge in computer graphics. We address this challenge by introducing CompoSE, a novel method for Compositional Synthesis and Editing of 3D shapes via part-aware control. Our method takes as input a set of coarse geometric primitives (e.g., bounding boxes) that represent distinct object parts arranged in a particular spatial configuration, and synthesizes as output part-separated 3D objects that support localized granular (i.e., compositional) editing of individual parts. The key insight that enables our method is our use of a diffusion transformer architecture that alternates between processing each part locally and aggregating contextual information across parts globally, and features a novel conditioning technique that ensures strong adherence to the user's input. Importantly, our method learns to infer part semantics and symmetries directly from the user's coarse layout guidance, and does not require part-level text prompts. We demonstrate that our method enables powerful part-level editing capabilities, including context-aware substitution, addition, deletion, and style-preserving resizing operations. We show through extensive experiments that our method significantly outperforms existing approaches on guided synthesis, as measured by objective metrics and LLM-based evaluations.

2026-05-19T04:39:48Z Habib Slim Shariq Farooq Bhat Mohamed Elhoseiny Yifan Wang Mike Roberts http://arxiv.org/abs/2601.18993v2 FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction 2026-05-19T03:39:50Z

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

2026-01-26T22:03:46Z 12 pages, 10 figures. Accepted to SIGGRAPH Conference Papers 2026 Wei Cao Hao Zhang Fengrui Tian Yulun Wu Yingying Li Shenlong Wang Ning Yu Yaoyao Liu 10.1145/3799902.3811122 http://arxiv.org/abs/2605.19305v1 Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes 2026-05-19T03:33:52Z

This paper tackles the task of learning to generate signals over triangle meshes in a triangulation-agnostic manner, meaning the trained model can be applied to different meshes and triangulations effectively. Practically, the paper adapts the flow matching (FM) paradigm to a mesh-based, triangulation-agnostic setting. Theoretically, it proposes a specific noise distribution which is triangulation agnostic, to be used inside the FM model's denoising process. While noise distributions are usually trivial to devise for, e.g., images, devising a triangulation-agnostic distribution proves to be a much more difficult task. We formulate a mathematical definition of triangulation agnosticism of distributions, via their spectrum. We then show that a discretization of a specific Gaussian random field called a Matérn process holds these desired properties, and provides a simple and efficient sampling algorithm. We use it as our noise model, and adapt FM to the triangulation-agnostic setting by using a state-of-the-art approach for learning signals on meshes in the gradient domain -- PoissonNet -- as the denoiser. We conduct experiments on elaborate tasks such as sampling elastic rest states, and generating poses of humanoids. Our method is shown to be capable of producing highly realistic results for meshes of over one million triangles, significantly exceeding the state-of-the-art in quality and diversity.

2026-05-19T03:33:52Z In ACM Transactions on Graphics (SIGGRAPH 2026). Project page: https://matern-fm.github.io/ Tianshu Kuai Arman Maesumi Daniel Ritchie Noam Aigerman http://arxiv.org/abs/2605.19304v1 MMGS: 10$\times$ Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking 2026-05-19T03:33:02Z

While 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it suffers from significant overhead due to massive redundant primitives. Existing compression methods typically rely on local sampling or fixed pruning thresholds, which often struggle to balance redundancy reduction with high-fidelity rendering. To address this, we propose a novel framework that formulates Gaussian optimization as a global geometric distribution matching problem. Specifically, our approach integrates three components: (1) we introduce a multi-view 3D Gaussian contribution ranking mechanism that filters primitives using geometric consistency instead of local heuristics; (2) we propose a global Optimal Transport (OT)-based aggregation algorithm that merges redundant primitives while preserving the underlying geometry; and (3) we design an OT-based densification operator that maintains the Gaussian's distributional properties for stable optimization. Our approach achieves state-of-the-art rendering quality with only \textbf{10$\%$} primitives and \textbf{10$\times$} accelerated training speeds compared to vanilla 3DGS.

2026-05-19T03:33:02Z 19 pages Beizhen Zhao Sicheng Yu Ziran Yin Dongxu Shen Hao Wang http://arxiv.org/abs/2605.20274v1 PolycubeNet: A Dual-latent Diffusion Model for Polycube-Based Hexahedral Mesh Generation 2026-05-19T03:27:54Z

Hexahedral meshes are widely used in simulation pipelines, yet automatic generation remains challenging for complex CAD geometries. Polycube-based hexahedral meshing is a representative approach due to its regular, parameterization-friendly structure, but existing polycube construction methods often rely on intricate surface segmentation and local heuristics, which can produce artifacts or fail on difficult shapes. In this paper, we propose an end-to-end framework for polycube generation based on conditional diffusion models. Given an input geometry represented as a point cloud, our method directly produces a corresponding polycube point cloud, eliminating the need for explicit surface segmentation or predefined polycube templates. At the core of our approach is a dual-latent conditional diffusion architecture that confines computationally expensive self-attention operations to a fixed-capacity, low-dimensional latent space. This design effectively decouples computational complexity from the resolution of both the input geometry and the output polycube, thereby avoiding the quadratic cost typical of point cloud self-attention mechanisms while supporting flexible input and output resolutions. To obtain a hexahedral mesh, the generated polycube is aligned to the input shape via rigid and non-rigid point cloud registration to establish surface correspondence, followed by a polycube-to-hex pipeline. We additionally create and release a paired dataset of CAD meshes and their corresponding polycube meshes, together with the core implementation of our model. Experiments show that PolycubeNet generalizes to complex CAD models with arbitrary genus and produces high-quality polycube structures within seconds, improving robustness and efficiency over prior learning-based approaches.

2026-05-19T03:27:54Z Lu He Qitao Deng Junjiang Deng Liangbin Deng Yanjun Liang Wenting Yang Guoqiang Wang Na Lei http://arxiv.org/abs/2605.19200v1 Spatially Accelerated Winding Numbers for Curved Geometry 2026-05-18T23:58:02Z

The generalized winding number (GWN) is a scalar field that supports robust containment queries on curved geometry, including non-watertight, overlapping, and nested boundary representations. While queries can be easily parallelized over samples, direct evaluation on parametric curves and surfaces remains costly for large and complex models. Fast, state-of-the-art GWN approaches leverage a spatial index to approximate the GWN, typically coupled with a Taylor expansion which approximates the GWN contribution for far clusters of geometric primitives. However, such methods operate only on discrete inputs such as triangle meshes and point clouds, and would introduce containment errors near boundaries if applied to curved input. We extend support for fast GWN evaluation over arbitrary collections of NURBS curves in 2D and trimmed NURBS patches in 3D via a Bounding Volume Hierarchy that stores efficiently precomputed moment data in the hierarchy nodes. When querying the hierarchy, approximations for far clusters are used alongside direct evaluation for nearby NURBS primitives, achieving sub-linear complexity while preserving the geometric features in the vicinity of the query point. Central to our performance improvements is an adaptive subdivision strategy for NURBS primitives during a preprocessing phase, creating better spatial partitions while retaining the same accuracy for containment decisions as a direct evaluation. We demonstrate the performance and accuracy of our approach across a large collection of 2D and 3D datasets.

2026-05-18T23:58:02Z 15 pages, 17 figures Jacob Spainhour Brad Whitlock Kenneth Weiss http://arxiv.org/abs/2605.18735v1 PIXLRelight: Controllable Relighting via Intrinsic Conditioning 2026-05-18T17:55:03Z

We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse and forward rendering, or require costly per-image optimization. Our key idea is to bridge physically based rendering (PBR) and learned image synthesis through a shared intrinsic conditioning that can be obtained from either real photographs or PBR renders. At training time, paired multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals, which condition the model. At inference time, the same conditioning is computed from a path-traced render of a coarse 3D reconstruction of the input under user-specified PBR lights. A transformer-based neural renderer then applies the target illumination to the source photograph, preserving fine image detail through a per-pixel affine modulation. PIXLRelight enables arbitrary PBR-style lighting control, achieves state-of-the-art relighting quality, and runs in under a tenth of a second per image. Code and models are available at https://mlfarinha.github.io/pixl-relight/.

2026-05-18T17:55:03Z Project page: https://mlfarinha.github.io/pixl-relight/. Under review Miguel Farinha Ronald Clark