https://arxiv.org/api/EyjsLU5yjkKDjcxHlKdIIaEaPoQ 2026-06-14T12:43:52Z 9323 420 15 http://arxiv.org/abs/2604.22984v1 BrickNet: Graph-Backed Generative Brick Assembly 2026-04-24T19:55:51Z

We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet

2026-04-24T19:55:51Z CVPR 2026; project page: https://kulits.github.io/BrickNet Peter Kulits Cordelia Schmid http://arxiv.org/abs/2604.21749v2 CuRast: Cuda-Based Software Rasterization for Billions of Triangles 2026-04-24T12:39:18Z

Previous work shows that small triangles can be rasterized efficiently with compute shaders. Building on this insight, we explore how far this can be pushed for massive triangle datasets without the need to construct acceleration structures in advance. Method: A 3-stage rasterization pipeline first rasterizes small triangles directly in stage 1, using atomicMin to store the closest fragments. Larger triangles are forwarded to stages 2 and 3. Results: CuRast can render models with hundreds of millions of triangles up to 2-5x (unique) or up to 12x (instanced) faster than Vulkan. Vulkan remains an order of magnitude faster for low-poly meshes. Limitations: We currently focus on dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction. Blending/Transparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently. Future Work: To make it suitable for games and a wider range of use cases, future work will need to (1) optimize handling of scenes with tens of thousands of nodes/meshes, (2) add support for hierarchical clustered LODs such as those produced by Meshoptimizer, (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast. Source Code: https://github.com/m-schuetz/CuRast

2026-04-23T14:57:24Z Markus Schütz Lukas Lipp Elias Kristmann Michael Wimmer http://arxiv.org/abs/2512.22502v3 Topology-Preserving Scalar Field Optimization for Boundary-Conforming Spiral Toolpaths on Multiply Connected Freeform Surfaces 2026-04-24T11:22:10Z

Multiply connected freeform surface features are widely encountered in industrial components, where toolpath generation often suffers from discontinuities, sharp turns, non-uniform scallop heights, and incomplete boundary coverage. This paper proposes a scalar-field variational optimization method for milling that produces continuous, boundary-conforming, and non-self-intersecting toolpaths with smoother transitions, more uniform spacing, and reduced redundant path length. A feasible singularity-free initial scalar field with boundary-conforming iso-level sets is first constructed via conformal slit mapping. The optimization is then reformulated as a topology-preserving mesh deformation process governed by boundary-synchronous updates, whereby the continuity, boundary-conformity, and non-self-intersection requirements of the toolpath are converted into mesh-shape constraints maintained throughout the iterative optimization. As a result, the proposed method achieves globally optimized path spacing and improved scallop-height uniformity while preserving trajectory smoothness. Milling experiments show that, compared with a state-of-the-art conformal slit mapping-based method, the proposed approach improves machining efficiency by 14.24%, enhances scallop-height uniformity by 5.70%, and reduces milling impact-induced vibrations by over 10%. The proposed strategy provides an effective solution for high-performance machining of complex multiply connected freeform components.

2025-12-27T07:05:51Z Reorganized the manuscript and added more detailed explanations of the workflow and multiple case studies Shen Changqing Xu Bingzhou Qi Bosong Zhang Xiaojian Yan Sijie Ding Han http://arxiv.org/abs/2603.19500v2 Teaching an Agent to Sketch One Part at a Time 2026-04-23T23:00:01Z

We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.

2026-03-19T22:08:53Z Xiaodan Du Ruize Xu David Yunis Yael Vinker Greg Shakhnarovich http://arxiv.org/abs/2604.21931v1 Seeing Fast and Slow: Learning the Flow of Time in Videos 2026-04-23T17:59:57Z

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

2026-04-23T17:59:57Z Project page: https://seeing-fast-and-slow.github.io/ Yen-Siang Wu Rundong Luo Jingsen Zhu Tao Tu Ali Farhadi Matthew Wallingford Yu-Chiang Frank Wang Steve Marschner Wei-Chiu Ma http://arxiv.org/abs/2604.21810v1 Multiscale Super Resolution without Image Priors 2026-04-23T16:05:29Z

We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens). We show that images acquired with pairwise coprime pixel sizes lead to a system with a stable inverse, and furthermore, that super-resolution images can be reconstructed efficiently using Fourier domain techniques or iterative least squares methods. Our mathematical analysis provides an expression for the expected error of the least squares reconstruction for large signals assuming i.i.d. noise that elucidates the noise-resolution tradeoff. These results are validated through both one- and two-dimensional experiments that leverage charge-coupled device (CCD) hardware binning to explore reconstructions over a large range of effective pixel sizes. Finally, two-dimensional reconstructions for a series of targets are used to demonstrate the advantages of multiscale super-resolution, and implications of these results for common imaging systems are discussed.

2026-04-23T16:05:29Z Daniel Fu Gabby Litterio Pedro Felzenszwalb Rashid Zia http://arxiv.org/abs/2604.21689v1 StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition 2026-04-23T13:55:22Z

Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at https://kwanyun.github.io/StyleID_page/

2026-04-23T13:55:22Z SIGGRAPH 2026 / ACM TOG. Project page at https://kwanyun.github.io/StyleID_page/ Kwan Yun Changmin Lee Ayeong Jeong Youngseo Kim Seungmi Lee Junyong Noh http://arxiv.org/abs/2604.21575v1 OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction 2026-04-23T11:55:19Z

Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.

2026-04-23T11:55:19Z Project Page: https://zcai0612.github.io/OmniFit/ Zeyu Cai Yuliang Xiu Renke Wang Zhijing Shao Xiaoben Li Siyuan Yu Chao Xu Yang Liu Baigui Sun Jian Yang Zhenyu Zhang http://arxiv.org/abs/2512.06834v2 COIVis: Eye-tracking-based Visual Exploration of Concept Learning in MOOC Videos 2026-04-23T08:15:52Z

Massive Open Online Courses (MOOCs) make high-quality instruction accessible. However, the lack of face-to-face interaction makes it difficult for instructors to obtain feedback on learners' performance and provide more effective instructional guidance. Traditional analytical approaches, such as clickstream logs or quiz scores, capture only coarse-grained learning outcomes and offer limited insight into learners' moment-to-moment cognitive states. In this study, we propose COIVis, an eye tracking-based visual analytics system that supports concept-level exploration of learning processes in MOOC videos. COIVis first extracts course concepts from multimodal video content and aligns them with the temporal structure and screen space of the lecture, defining Concepts of Interest (COIs), which anchor abstract concepts to specific spatiotemporal regions. Learners' gaze trajectories are transformed into COI sequences, and five interpretable learner-state features -- Attention, Cognitive Load, Interest, Preference, and Synchronicity -- are computed at the COI level based on eye tracking metrics. Building on these representations, COIVis provides a narrative, multi-view visualization enabling instructors to move from cohort-level overviews to individual learning paths, quickly locate problematic concepts, and compare diverse learning strategies. We evaluate COIVis through two case studies and in-depth user-feedback interviews. The results demonstrate that COIVis effectively provides instructors with valuable insights into the consistency and anomalies of learners' learning patterns, thereby supporting timely and personalized interventions for learners and optimizing instructional design.

2025-12-07T13:14:49Z 17pages, 8 figures Zhiguang Zhou Ruiqi Yu Yuming Ma Hao Ni Guojun Li Li Ye Xiaoying Wang Yize Li Yigang Wang Yong Wang http://arxiv.org/abs/2601.05127v2 LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization 2026-04-23T07:24:19Z

Recent diffusion-based image editing methods commonly rely on text or high-level instructions to guide the generation process, offering intuitive but coarse control. In contrast, we focus on explicit, prompt-free editing, where the user directly specifies the modification by cropping and pasting an object or sub-object into a chosen location within an image. This operation affords precise spatial and visual control, yet it introduces a fundamental challenge: preserving the identity of the pasted object while harmonizing it with its new context. We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE, a saliency-guided modulation of rotational positional encoding (RoPE) that loosens the positional constraints to continuously control the attention field of view. By relaxing RoPE in this manner, our method smoothly steers the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object, enabling a balanced trade-off between identity retention and contextual blending. Our approach provides a flexible and intuitive framework for image editing, achieving seamless compositional results without textual descriptions or complex user input.

2026-01-08T17:17:47Z Accepted to SIGGRAPH 2026. Project Page: https://snap-research.github.io/LooseRoPE/ Etai Sella Yoav Baron Hadar Averbuch-Elor Daniel Cohen-Or Or Patashnik http://arxiv.org/abs/2604.19192v2 Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images 2026-04-23T06:35:35Z

We present an approach for enhancing non-playable characters (NPCs) in games by combining large language models (LLMs) with computer vision to provide contextual awareness of their surroundings. Conventional NPCs typically rely on pre-scripted dialogue and lack spatial understanding, which limits their responsiveness to player actions and reduces overall immersion. Our method addresses these limitations by capturing panoramic images of an NPC's environment and applying semantic segmentation to identify objects and their spatial positions. The extracted information is used to generate a structured JSON representation of the environment, combining object locations derived from segmentation with additional scene graph data within the NPC's bounding sphere, encoded as directional vectors. This representation is provided as input to the LLM, enabling NPCs to incorporate spatial knowledge into player interactions. As a result, NPCs can dynamically reference nearby objects, landmarks, and environmental features, leading to more believable and engaging gameplay. We describe the technical implementation of the system and evaluate it in two stages. First, an expert interview was conducted to gather feedback and identify areas for improvement. After integrating these refinements, a user study was performed, showing that participants preferred the context-aware NPCs over a non-context-aware baseline, confirming the effectiveness of the proposed approach.

2026-04-21T07:59:36Z Grega Radež Ciril Bohak http://arxiv.org/abs/2605.13862v1 Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation 2026-04-22T17:50:03Z

We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0, with substantial improvements across generation fidelity, simulation-ready capabilities, and application coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression and more efficient decoding. For texture and material generation, we replace the cascaded pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic conditioning for improved material precision and visual fidelity. Beyond single-object generation, Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware decomposition, and training-free articulation generation, enabling coherent scene construction and part-level physical interaction across physics and graphics engines. A large-scale human preference study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates of 69.0% to 89.9% in textured 3D asset generation. Seed3D 2.0 is available on https://exp.volcengine.com/ark/vision?_vtm_=0.0.c70961.d701978.0&mode=vision&modelId=doubao-seed3d-2-0-260328&tab=Gen3D

2026-04-22T17:50:03Z Seed3D 2.0 Technical Report; Official Page on https://seed.bytedance.com/seed3d_2_0 Diandian Gu Jing Lin Gaohong Liu Jiahang Liu Su Ma Guang Shi Jun Wang Qinlong Wang Qianyi Wu Zhongcong Xu Xuanyu Yi Zihao Yu Jianfeng Zhang Zhuolin Zheng Yifan Zhu Rui Chen Hengkai Guo Xiaoyang Guo Mingcong Han Xu Han Xiu Li Yixun Liang Weiqiang Lou Junzhe Lu Guan Luo Minghan Qin Shuguang Wang Yuang Wang http://arxiv.org/abs/2604.20759v1 Autark: A Serverless Toolkit for Prototyping Urban Visual Analytics Systems 2026-04-22T16:48:24Z

The development of visual analytics (VA) systems has traditionally been a labor-intensive process, balancing design methodologies with complex software engineering practices. In domain-specific fields like urban VA, this challenge is amplified by heterogeneous data streams and a reliance on complex, multi-service architectures that hinder fast development, deployment, and reproducibility. Despite the richness of the urban VA literature, the field lacks a consolidated toolkit that encapsulates the core components of these systems, such as spatial data management, analytical processing, and visualization, into a unified, lightweight framework. In this paper, we introduce Autark, a serverless toolkit designed for the rapid prototyping of urban VA systems. Autark provides domain-aware abstractions through a self-contained architecture, enabling researchers to transition from design intention to deployed, shareable systems within hours. Furthermore, Autark's structured, tightly scoped interfaces make it well-suited for AI-assisted coding workflows, where LLMs produce more reliable code when composing from well-defined abstractions rather than generating complex solutions from scratch. Our contributions are: (1) the Autark toolkit, a serverless architecture for rapid prototyping of urban VA; (2) a comparative study of LLM coding effectiveness with and without Autark; and (3) a series of usage scenarios demonstrating its capability to streamline the creation of robust, shareable urban VA prototypes. Autark is available at https://autarkjs.org/.

2026-04-22T16:48:24Z Autark is available at https://autarkjs.org/ Lucas Alexandre João Rulff Talisson Souza Gustavo Moreira Daniel de Oliveira Claudio Silva Fabio Miranda Marcos Lage http://arxiv.org/abs/2603.24725v2 Confidence-Based Mesh Extraction from 3D Gaussians 2026-04-22T13:37:13Z

Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.

2026-03-25T18:52:04Z Project Page: https://r4dl.github.io/CoMe/ Lukas Radl Felix Windisch Andreas Kurz Thomas Köhler Michael Steiner Markus Steinberger http://arxiv.org/abs/2604.20539v1 Animator-Centric Skeleton Generation on Objects with Fine-Grained Details 2026-04-22T13:18:22Z

Skeleton generation is essential for animating 3D assets, but current deep learning methods remain limited: they cannot handle the growing structural complexity of modern models and offer minimal controllability, creating a major bottleneck for real-world animation workflows. To address this, we propose an animator-centric SG framework that achieves high-quality skeleton prediction on complex inputs while providing intuitive control handles. Our contributions are threefold. First, we curate a large-scale dataset of 82,633 rigged meshes with diverse and complicated structures. Second, we introduce a novel semantic-aware tokenization scheme for auto-regressive modeling. This scheme effectively complements purely geometric prior methods by subdividing bones into semantically meaningful groups, thereby enhancing robustness to structural complexity and enabling a key control mechanism. Third, we design a learnable density interval module that allows animators to exert soft, direct control over bone density. Extensive experiments demonstrate that our framework not only generates high-quality skeletons for challenging inputs but also successfully fulfills two critical requirements from professional animators.

2026-04-22T13:18:22Z Accepted by CVPR2026 Mingze Sun Cheng Zeng Jiansong Pei Junhao Chen Chaoyue Song Shaohui Wang Tianyuan Chang Bin Huang Zijiao Zeng Ruqi Huang