https://arxiv.org/api/jom9OmRCT8/ydEDafscmsr2ES64 2026-06-22T15:10:11Z 9354 1005 15 http://arxiv.org/abs/2510.13794v4 MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control 2026-01-18T17:46:05Z

MimicKit is an open-source framework for training motion controllers using motion imitation and reinforcement learning. The codebase provides implementations of commonly-used motion-imitation techniques and RL algorithms. This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures. The codebase is designed to be modular and easily configurable, enabling convenient modification and extension to new characters and tasks. The open-source codebase is available at: https://github.com/xbpeng/MimicKit.

2025-10-15T17:51:42Z Xue Bin Peng http://arxiv.org/abs/2601.12481v1 NeuralFur: Animal Fur Reconstruction From Multi-View Images 2026-01-18T16:46:38Z

Reconstructing realistic animal fur geometry from images is a challenging task due to the fine-scale details, self-occlusion, and view-dependent appearance of fur. In contrast to human hairstyle reconstruction, there are also no datasets that can be leveraged to learn a fur prior for different animals. In this work, we present a first multi-view-based method for high-fidelity 3D fur modeling of animals using a strand-based representation, leveraging the general knowledge of a vision language model. Given multi-view RGB images, we first reconstruct a coarse surface geometry using traditional multi-view stereo techniques. We then use a vision language model (VLM) system to retrieve information about the realistic length structure of the fur for each part of the body. We use this knowledge to construct the animal's furless geometry and grow strands atop it. The fur reconstruction is supervised with both geometric and photometric losses computed from multi-view images. To mitigate orientation ambiguities stemming from the Gabor filters that are applied to the input images, we additionally utilize the VLM to guide the strands' growth direction and their relation to the gravity vector that we incorporate as a loss. With this new schema of using a VLM to guide 3D reconstruction from multi-view inputs, we show generalization across a variety of animals with different fur types. For additional results and code, please refer to https://neuralfur.is.tue.mpg.de.

2026-01-18T16:46:38Z For additional results and code, please refer to https://neuralfur.is.tue.mpg.de Vanessa Sklyarova Berna Kabadayi Anastasios Yiannakidis Giorgio Becherini Michael J. Black Justus Thies http://arxiv.org/abs/2603.05511v1 An Embodied Companion for Visual Storytelling 2026-01-18T08:15:16Z

As artificial intelligence shifts from pure tool for delegation toward agentic collaboration, its use in the arts can shift beyond the exploration of machine autonomy toward synergistic co-creation. While our earlier robotic works utilized automation to distance the artist's intent from the final mark, we present Companion: an artistic apparatus that integrates a drawing robot with Large Language Models (LLMs) to re-center human-machine presence. By leveraging in-context learning and real-time tool use, the system engages in bidirectional interaction via speech and sketching. This approach transforms the robot from a passive executor into a playful co-creative partner capable of driving shared visual storytelling into unexpected aesthetic territories. To validate this collaborative shift, we employed the Consensual Assessment Technique (CAT) with a panel of seven art-world experts. Results confirm that the system produces works with a distinct aesthetic identity and professional exhibition merit, demonstrating the potential of AI as a highly capable artistic collaborator.

2026-01-18T08:15:16Z 35 pages, 18 figures Patrick Tresset Markus Wulfmeier http://arxiv.org/abs/2601.12257v1 Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy 2026-01-18T04:40:00Z

Conventional imaging requires a line of sight to create accurate visual representations of a scene. In certain circumstances, however, obtaining a suitable line of sight may be impractical, dangerous, or even impossible. Non-line-of-sight (NLOS) imaging addresses this challenge by reconstructing the scene from indirect measurements. Recently, passive NLOS methods that use an ordinary photograph of the subtle shadow cast onto a visible wall by the hidden scene have gained interest. These methods are currently limited to 1D or low-resolution 2D color imaging or to localizing a hidden object whose shape is approximately known. Here, we generalize this class of methods and demonstrate a 3D reconstruction of a hidden scene from an ordinary NLOS photograph. To achieve this, we propose a novel reformulation of the light transport model that conveniently decomposes the hidden scene into \textit{light-occluding} and \textit{non-light-occluding} components to yield a separable non-linear least squares (SNLLS) inverse problem. We develop two solutions: A gradient-based optimization method and a physics-inspired neural network approach, which we call Soft Shadow diffusion (SSD). Despite the challenging ill-conditioned inverse problem encountered here, our approaches are effective on numerous 3D scenes in real experimental scenarios. Moreover, SSD is trained in simulation but generalizes well to unseen classes in simulation and real-world NLOS scenes. SSD also shows surprising robustness to noise and ambient illumination.

2026-01-18T04:40:00Z European Conference on Computer Vision (ECCV 2024) Fadlullah Raji John Murray-Bruce http://arxiv.org/abs/2601.12234v1 Proc3D: Procedural 3D Generation and Parametric Editing of 3D Shapes with Large Language Models 2026-01-18T03:08:08Z

Generating 3D models has traditionally been a complex task requiring specialized expertise. While recent advances in generative AI have sought to automate this process, existing methods produce non-editable representation, such as meshes or point clouds, limiting their adaptability for iterative design. In this paper, we introduce Proc3D, a system designed to generate editable 3D models while enabling real-time modifications. At its core, Proc3D introduces procedural compact graph (PCG), a graph representation of 3D models, that encodes the algorithmic rules and structures necessary for generating the model. This representation exposes key parameters, allowing intuitive manual adjustments via sliders and checkboxes, as well as real-time, automated modifications through natural language prompts using Large Language Models (LLMs). We demonstrate Proc3D's capabilities using two generative approaches: GPT-4o with in-context learning (ICL) and a fine-tuned LLAMA-3 model. Experimental results show that Proc3D outperforms existing methods in editing efficiency, achieving more than 400x speedup over conventional approaches that require full regeneration for each modification. Additionally, Proc3D improves ULIP scores by 28%, a metric that evaluates the alignment between generated 3D models and text prompts. By enabling text-aligned 3D model generation along with precise, real-time parametric edits, Proc3D facilitates highly accurate text-based image editing applications.

2026-01-18T03:08:08Z Fadlullah Raji Stefano Petrangeli Matheus Gadelha Yu Shen Uttaran Bhattacharya Gang Wu http://arxiv.org/abs/2406.03599v2 Hi5: Synthetic Data for Inclusive, Robust, Hand Pose Estimation 2026-01-16T21:07:53Z

Hand pose estimation plays a vital role in capturing subtle nonverbal cues essential for understanding human affect. However, collecting diverse, expressive real-world data remains challenging due to labor-intensive manual annotation that often underrepresents demographic diversity and natural expressions. To address this issue, we introduce a cost-effective approach to generating synthetic data using high-fidelity 3D hand models and a wide range of affective hand poses. Our method includes varied skin tones, genders, dynamic environments, realistic lighting conditions, and diverse naturally occurring gesture animations. The resulting dataset, Hi5, contains 583,000 pose-annotated images, carefully balanced to reflect natural diversity and emotional expressiveness. Models trained exclusively on Hi5 achieve performance comparable to human-annotated datasets, exhibiting superior robustness to occlusions and consistent accuracy across diverse skin tones -- which is crucial for reliably recognizing expressive gestures in affective computing applications. Our results demonstrate that synthetic data effectively addresses critical limitations of existing datasets, enabling more inclusive, expressive, and reliable gesture recognition systems while achieving competitive performance in pose estimation benchmarks. The Hi5 dataset, data synthesis pipeline, source code, and game engine project are publicly released to support further research in synthetic hand-gesture applications.

2024-06-05T19:45:10Z Masum Hasan Cengiz Ozel Nina Long Alexander Martin Samuel Potter Tariq Adnan Sangwu Lee Ehsan Hoque http://arxiv.org/abs/2601.03869v2 Bayesian Monocular Depth Refinement via Neural Radiance Fields 2026-01-15T01:46:55Z

Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate improvements on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.

2026-01-07T12:32:39Z IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025) Proc. IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI), pp. 488-492, 2025 Arun Muthukkumar 10.1109/ACAI68217.2025.11406626 http://arxiv.org/abs/2512.23696v2 OpenPBR: Novel Features and Implementation Details 2026-01-14T13:18:48Z

OpenPBR is a physically based, standardized uber-shader developed for interoperable material authoring and rendering across VFX, animation, and design visualization workflows. This document serves as a companion to the official specification, offering deeper insight into the model's development and more detailed implementation guidance, including code examples and mathematical derivations. We begin with a description of the model's formal structure and theoretical foundations - covering slab-based layering, statistical mixing, and microfacet theory - before turning to its physical components. These include metallic, dielectric, subsurface, and glossy-diffuse base substrates, followed by thin-film iridescence, coat, and fuzz layers. A special-case mode for rendering thin-walled objects is also described. Additional sections explore technical topics in greater depth, such as the decoupling of specular reflectivity from transmission, the choice of parameterization for subsurface scattering, and the detailed physics of coat darkening and thin-film interference. We also discuss planned extensions, including hazy specular reflection and retroreflection.

2025-12-29T18:53:00Z Part of Physically Based Shading in Theory and Practice, SIGGRAPH 2025 Course Jamie Portsmouth Peter Kutz Stephen Hill 10.1145/3721241.3733991 http://arxiv.org/abs/2601.09428v1 Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps 2026-01-14T12:17:34Z

We introduce a new method of generating Computer Aided Design (CAD) profiles via a sequence of simple geometric constructions including curve offsetting, rotations and intersections. These sequences start with geometry provided by a designer and build up the points and curves of the final profile step by step. We demonstrate that adding construction steps between the designer's input geometry and the final profile improves generation quality in a similar way to the introduction of a chain of thought in language models. Similar to the constraints in a parametric CAD model, the construction sequences reduce the degrees of freedom in the modeled shape to a small set of parameter values which can be adjusted by the designer, allowing parametric editing with the constructed geometry evaluated to floating point precision. In addition we show that applying reinforcement learning to the construction sequences gives further improvements over a wide range of metrics, including some which were not explicitly optimized.

2026-01-14T12:17:34Z Siyi Li Joseph G. Lambourne Longfei Zhang Pradeep Kumar Jayaraman Karl. D. D. Willis http://arxiv.org/abs/2601.09417v1 Variable Basis Mapping for Real-Time Volumetric Visualization 2026-01-14T12:11:14Z

Real-time visualization of large-scale volumetric data remains challenging, as direct volume rendering and voxel-based methods suffer from prohibitively high computational cost. We propose Variable Basis Mapping (VBM), a framework that transforms volumetric fields into 3D Gaussian Splatting (3DGS) representations through wavelet-domain analysis. First, we precompute a compact Wavelet-to-Gaussian Transition Bank that provides optimal Gaussian surrogates for canonical wavelet atoms across multiple scales. Second, we perform analytical Gaussian construction that maps discrete wavelet coefficients directly to 3DGS parameters using a closed-form, mathematically principled rule. Finally, a lightweight image-space fine-tuning stage further refines the representation to improve rendering fidelity. Experiments on diverse datasets demonstrate that VBM significantly accelerates convergence and enhances rendering quality, enabling real-time volumetric visualization.

2026-01-14T12:11:14Z 11 pages. Under review Qibiao Li Yuxuan Wang Youcheng Cai Huangsheng Du Ligang Liu http://arxiv.org/abs/2603.29569v1 AdaptDiff: Adaptive Guidance in Diffusion Models for Diverse and Identity-Consistent Face Synthesis (Student Abstract) 2026-01-14T11:03:51Z

Diffusion models conditioned on identity embeddings enable the generation of synthetic face images that consistently preserve identity across multiple samples. Recent work has shown that introducing an additional negative condition through classifier-free guidance during sampling provides a mechanism to suppress undesired attributes, thus improving inter-class separability. Building on this insight, we propose a dynamic weighting scheme for the negative condition that adapts throughout the sampling trajectory. This strategy leverages the complementary strengths of positive and negative conditions at different stages of generation, leading to more diverse yet identity-consistent synthetic data.

2026-01-14T11:03:51Z Accepted at AAAI 2026 Student Abstract and Poster Program Eduarda Caldeira Tahar Chettaoui Naser Damer Fadi Boutros http://arxiv.org/abs/2508.13990v2 Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models 2026-01-14T09:53:51Z

Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.

2025-08-19T16:31:41Z 10 pages, 6 figures Daniel Klötzl Ozan Tastekin David Hägele Marina Evers Daniel Weiskopf 10.1109/UncertaintyVisualization68947.2025.00010 http://arxiv.org/abs/2506.10035v3 FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training 2026-01-13T18:20:18Z

Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20\% of the hierarchy pruned. Our code will be available soon.

2025-06-10T20:48:30Z 14 pages Fuhan Cai Yong Guo Jie Li Wenbo Li Jian Chen Xiangzhong Fang http://arxiv.org/abs/2507.17336v3 Temporal Smoothness-Aware Rate-Distortion Optimized 4D Gaussian Splatting 2026-01-13T17:15:56Z

Dynamic 4D Gaussian Splatting (4DGS) effectively extends the high-speed rendering capabilities of 3D Gaussian Splatting (3DGS) to represent volumetric videos. However, the large number of Gaussians, substantial temporal redundancies, and especially the absence of an entropy-aware compression framework result in large storage requirements. Consequently, this poses significant challenges for practical deployment, efficient edge-device processing, and data transmission. In this paper, we introduce a novel end-to-end RD-optimized compression framework tailored for 4DGS, aiming to enable flexible, high-fidelity rendering across varied computational platforms. Leveraging Fully Explicit Dynamic Gaussian Splatting (Ex4DGS), one of the state-of-the-art 4DGS methods, as our baseline, we start from the existing 3DGS compression methods for compatibility while effectively addressing additional challenges introduced by the temporal axis. In particular, instead of storing motion trajectories independently per point, we employ a wavelet transform to reflect the real-world smoothness prior, significantly enhancing storage efficiency. This approach yields significantly improved compression ratios and provides a user-controlled balance between compression efficiency and rendering quality. Extensive experiments demonstrate the effectiveness of our method, achieving up to 91$\times$ compression compared to the original Ex4DGS model while maintaining high visual fidelity. These results highlight the applicability of our framework for real-time dynamic scene rendering in diverse scenarios, from resource-constrained edge devices to high-performance environments. The source code is available at https://github.com/HyeongminLEE/RD4DGS.

2025-07-23T09:05:13Z 24 pages, 10 figures, NeurIPS 2025 Hyeongmin Lee Kyungjune Baek http://arxiv.org/abs/2601.08429v1 Deep Learning Based Facial Retargeting Using Local Patches 2026-01-13T10:56:15Z

In the era of digital animation, the quest to produce lifelike facial animations for virtual characters has led to the development of various retargeting methods. While the retargeting facial motion between models of similar shapes has been very successful, challenges arise when the retargeting is performed on stylized or exaggerated 3D characters that deviate significantly from human facial structures. In this scenario, it is important to consider the target character's facial structure and possible range of motion to preserve the semantics assumed by the original facial motions after the retargeting. To achieve this, we propose a local patch-based retargeting method that transfers facial animations captured in a source performance video to a target stylized 3D character. Our method consists of three modules. The Automatic Patch Extraction Module extracts local patches from the source video frame. These patches are processed through the Reenactment Module to generate correspondingly re-enacted target local patches. The Weight Estimation Module calculates the animation parameters for the target character at every frame for the creation of a complete facial animation sequence. Extensive experiments demonstrate that our method can successfully transfer the semantic meaning of source facial expressions to stylized characters with considerable variations in facial feature proportion.

2026-01-13T10:56:15Z Eurographics 25 Computer Graphics Forum 2024 Yeonsoo Choi Inyup Lee Sihun Cha Seonghyeon Kim Sunjin Jung Junyong Noh 10.1111/cgf.15263