https://arxiv.org/api/X9xeCdNytHtcD/4SkI2bPDhPsHo 2026-06-17T17:45:56Z 9346 735 15 http://arxiv.org/abs/2603.10326v1 FC-4DFS: Frequency-controlled Flexible 4D Facial Expression Synthesizing 2026-03-11T01:51:11Z

4D facial expression synthesizing is a critical problem in the fields of computer vision and graphics. Current methods lack flexibility and smoothness when simulating the inter-frame motion of expression sequences. In this paper, we propose a frequency-controlled 4D facial expression synthesizing method, FC-4DFS. Specifically, we introduce a frequency-controlled LSTM network to generate 4D facial expression sequences frame by frame from a given neutral landmark with a given length. Meanwhile, we propose a temporal coherence loss to enhance the perception of temporal sequence motion and improve the accuracy of relative displacements. Furthermore, we designed a Multi-level Identity-Aware Displacement Network based on a cross-attention mechanism to reconstruct the 4D facial expression sequences from landmark sequences. Finally, our FC-4DFS achieves flexible and SOTA generation results of 4D facial expression sequences with different lengths on CoMA and Florence4D datasets. The code will be available on GitHub.

2026-03-11T01:51:11Z Xin Lu Chuanqing Zhuang. Zhengda Lu Yiqun Wang Jun Xiao http://arxiv.org/abs/2412.00638v2 Sketch-Guided Stylized Landscape Cinemagraph Synthesis 2026-03-10T23:16:16Z

Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow elements. To achieve intuitive and detailed control of the generated cinemagraphs, sketches provide a feasible solution to convey personalized design requirements beyond text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial landscape generation and provides sketch controls for both spatial and motion cues. The latent diffusion model first generates target stylized landscape images along with realistic versions. Then, a pre-trained object detection model obtains masks for the flow regions. We propose a latent motion diffusion model to estimate motion field in fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated motion fields in the masked fluid regions with the prompt. To synthesize cinemagraph frames, the pixels within fluid regions are warped to target locations at each timestep using a U-Net based frame generator. The results verified that Sketch2Cinemagraph can generate aesthetically appealing stylized cinemagraphs with continuous temporal flow from sketch inputs. We showcase the advantages of Sketch2Cinemagraph through qualitative and quantitative comparisons against the state-of-the-art approaches.

2024-12-01T01:32:59Z 16 pages, 18 figures, accepted in Computer and Graphics Computers & Graphics,Volume 135,2026 Hao Jin Hengyuan Chang Xiaoxuan Xie Zhengyang Wang Xusheng Du Shaojun Hu Haoran Xie 10.1016/j.cag.2026.104547 http://arxiv.org/abs/2603.10256v1 ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA 2026-03-10T22:23:36Z

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

2026-03-10T22:23:36Z Aviad Dahan Moran Yanuka Noa Kraicer Lior Wolf Raja Giryes http://arxiv.org/abs/2603.09925v1 On the Structural Failure of Chamfer Distance in 3D Shape Optimization 2026-03-10T17:21:23Z

Chamfer distance is the standard training loss for point cloud reconstruction, completion, and generation, yet directly optimizing it can produce worse Chamfer values than not optimizing it at all. We show that this paradoxical failure is gradient-structural. The per-point Chamfer gradient creates a many-to-one collapse that is the unique attractor of the forward term and cannot be resolved by any local regularizer, including repulsion, smoothness, and density-aware re-weighting. We derive a necessary condition for collapse suppression: coupling must propagate beyond local neighborhoods. In a controlled 2D setting, shared-basis deformation suppresses collapse by providing global coupling; in 3D shape morphing, a differentiable MPM prior instantiates the same principle, consistently reducing the Chamfer gap across 20 directed pairs with a 2.5$\times$ improvement on the topologically complex dragon. The presence or absence of non-local coupling determines whether Chamfer optimization succeeds or collapses. This provides a practical design criterion for any pipeline that optimizes point-level distance metrics.

2026-03-10T17:21:23Z 27 pages, including supplementary material Chang-Yong Song David Hyde http://arxiv.org/abs/2603.09832v1 Prompt-Driven Color Accessibility Evaluation in Diffusion-based Image Generation Models 2026-03-10T15:55:29Z

Generative models are increasingly integrated into creative workflows. While text-to-image generation excels in visual quality and diversity, color accessibility for users with Color Vision Deficiencies (CVD) remains largely unexplored. Our work systematically evaluates color accessibility in images generated by a common pretrained diffusion model, prompted to improve accessibility across diverse categories. We quantify performance using established, off-the-shelf CVD simulation methods and introduce "CVDLoss", a new metric measuring differences in image gradients indicative of structural detail. We validate CVDLoss against a commonly used daltonization method, demonstrating its sensitivity to color accessibility modifications. Applying CVDLoss to model outputs reveals that existing diffusion models struggle to reliably respond to accessibility-focused prompts. Consequently, our study establishes CVDLoss as a valuable evaluation tool for accessibility-aware image generation and post-processing, offering insights into current generative models' limitations in addressing color accessibility.

2026-03-10T15:55:29Z Xinyao Zhuang Jose Echevarria Kaan Akşit http://arxiv.org/abs/2603.09548v1 A comprehensive study of time-of-flight non-line-of-sight imaging 2026-03-10T11:57:23Z

Time-of-Flight non-line-of-sight (ToF NLOS) imaging techniques provide state-of-the-art reconstructions of scenes hidden around corners by inverting the optical path of indirect photons scattered by visible surfaces and measured by picosecond resolution sensors. The emergence of a wide range of ToF NLOS imaging methods with heterogeneous formulae and hardware implementations obscures the assessment of both their theoretical and experimental aspects. We present a comprehensive study of a representative set of ToF NLOS imaging methods by discussing their similarities and differences under common formulation and hardware. We first outline the problem statement under a common general forward model for ToF NLOS measurements, and the typical assumptions that yield tractable inverse models. We discuss the relationship of the resulting simplified forward and inverse models to a family of Radon transforms, and how migrating these to the frequency domain relates to recent phasor-based virtual line-of-sight imaging models for NLOS imaging that obey the constraints of conventional lens-based imaging systems. We then evaluate performance of the selected methods on hidden scenes captured under the same hardware setup and similar photon counts. Our experiments show that existing methods share similar limitations on spatial resolution, visibility, and sensitivity to noise when operating under equal hardware constraints, with particular differences that stem from method-specific parameters. We expect our methodology to become a reference in future research on ToF NLOS imaging to obtain objective comparisons of existing and new methods.

2026-03-10T11:57:23Z Julio Marco Adrian Jarabo Ji Hyun Nam Alberto Tosi Diego Gutierrez Andreas Velten http://arxiv.org/abs/2412.18380v2 ARSGaussian: 3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis 2026-03-10T11:45:12Z

Novel View Synthesis (NVS) can reconstruct scenes from multi-view images and synthesize novel images from new viewpoints, which provides technical support for tasks such as target recognition and environmental perception. Aerial remote sensing can conveniently capture a wealth of multi-view images with just a few flights. However, the challenges brought by large distances and sparse viewing angles during collection can cause the model to easily produce floaters and overgrowth issues due to geometric estimation errors. This results in low visual quality and a lack of precise geometric estimation capabilities. Therefore, this study presents ARSGaussian, an innovative novel view synthesis (NVS) method for aerial remote sensing. The method incorporates LiDAR point cloud as constraints into the 3D Gaussian Splatting approach, adaptively guiding the Gaussians to grow and split along geometric benchmarks, thereby addressing the overgrowth and floaters issues. Additionally, considering the geometric distortions arising from data acquisition, coordinate transformations with distortion parameters are integrated to replace the simple pinhole camera model parameters to achieve pixel-level alignment between LiDAR point cloud and multi-view optical images, facilitating the accurate fusion of heterogeneous data and achieving the high-precision geo-alignment. Moreover, depth, normal and scale consistency losses are introduced into the regularization process to guide Gaussians toward real depth and plane representations, significantly improving geometric estimation accuracy. To address the current lack of dense airborne hybrid datasets, we have established and released AIR-LONGYAN, an open-source dataset containing a dense LiDAR point cloud (8 pts/m) and multi-view optical images captured by airborne scanners and cameras in diverse scenes....

2024-12-24T12:08:50Z This is the author's version of a work that was accepted for publication in [ISPRS]. Changes resulting from the publishing process... may not be reflected in this document ISPRS Journal of Photogrammetry and Remote Sensing,Volume 231,2026,Pages 288-306,ISSN 0924-2716, Yiling Yao Bing Zhang Wenjuan Zhang Lianru Gao Dailiang Peng Bocheng Li Yaning Wang Bowen Wang 10.1016/j.isprsjprs.2025.10.022 http://arxiv.org/abs/2412.14776v3 Collaborative Problem Solving in Mixed Reality: A Study on Visual Graph Analysis 2026-03-09T23:23:39Z

Problem solving is a composite cognitive process, invoking a number of cognitive mechanisms, such as perception and memory. Individuals may form collectives to solve a given problem together in collaboration, especially when complexity is perceived to be high. To determine if and when collaborative problem solving is desired in the context of visual graph analysis, we compare ad hoc pairs to individuals and nominal pairs, when solving different tasks in mixed reality. We discuss the results of an experiment with 72 participants performed in two countries and three languages. We apply the concept of task instance complexity to quantify the visual demand of tasks used in the experiment. Our results show the importance of using nominal groups as a benchmark for evaluating collaborative virtual environments. We conclude that 3D graph representation is not sufficient to induce better collaborative results compared to the benchmark.

2024-12-19T12:02:47Z Pages: 1 -- 14 in IEEE Transactions on Visualization and Computer Graphics, March 2026 Dimitar Garkov Tommaso Piselli Emilio Di Giacomo Karsten Klein Giuseppe Liotta Fabrizio Montecchiani Falk Schreiber 10.1109/TVCG.2026.3671472 http://arxiv.org/abs/2603.08645v1 Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization 2026-03-09T17:24:11Z

Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.

2026-03-09T17:24:11Z Matan Levy Gavriel Habib Issar Tzachor Dvir Samuel Rami Ben-Ari Nir Darshan Or Litany Dani Lischinski http://arxiv.org/abs/2603.08503v1 Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction 2026-03-09T15:35:56Z

Omnidirectional images are increasingly used in robotics and vision due to their wide field of view. However, extending 3D Gaussian Splatting (3DGS) to panoramic camera models remains challenging, as existing formulations are designed for perspective projections and naive adaptations often introduce distortion and geometric inconsistencies. We present Spherical-GOF, an omnidirectional Gaussian rendering framework built upon Gaussian Opacity Fields (GOF). Unlike projection-based rasterization, Spherical-GOF performs GOF ray sampling directly on the unit sphere in spherical ray space, enabling consistent ray-Gaussian interactions for panoramic rendering. To make the spherical ray casting efficient and robust, we derive a conservative spherical bounding rule for fast ray-Gaussian culling and introduce a spherical filtering scheme that adapts Gaussian footprints to distortion-varying panoramic pixel sampling. Extensive experiments on standard panoramic benchmarks (OmniBlender and OmniPhotos) demonstrate competitive photometric quality and substantially improved geometric consistency. Compared with the strongest baseline, Spherical-GOF reduces depth reprojection error by 57% and improves cycle inlier ratio by 21%. Qualitative results show cleaner depth and more coherent normal maps, with strong robustness to global panorama rotations. We further validate generalization on OmniRob, a real-world robotic omnidirectional dataset introduced in this work, featuring UAV and quadruped platforms. The source code and the OmniRob dataset will be released at https://github.com/1170632760/Spherical-GOF.

2026-03-09T15:35:56Z The source code and dataset will be released at https://github.com/1170632760/Spherical-GOF Zhe Yang Guoqiang Zhao Sheng Wu Kai Luo Kailun Yang http://arxiv.org/abs/2510.25660v2 mitransient: Transient light transport in Mitsuba 3 2026-03-09T14:00:52Z

mitransient is a light transport simulation tool that extends Mitsuba 3 with support for time-resolved simulations. In essence, mitransient extends conventional rendering by adding a temporal dimension which accounts for the time of flight of light. This allows rapid prototyping of novel transient imaging systems without the need of costly or difficult-to-operate hardware. Our code is trivially easy to install through pip, and consists of Python modules that can run both in CPU and GPU by leveraging the JIT capabilities of Mitsuba 3. It provides physically-based simulations of complex phenomena, including a wide variety of realistic materials and participating media such as fog or smoke. In addition, we extend Mitsuba 3's functionality to support time-resolved polarization tracking of light and transient differentiable rendering. Finally, we also include tools that simplify the use of our simulations for non-line-of-sight imaging, enabling realistic scene setups with capture noise to be simulated in just seconds of minutes. Altogether, we hope that mitransient will support the research community in developing novel algorithms for transient imaging.

2025-10-29T16:23:08Z 6 pages, 6 figures. For further documentation for mitransient see https://mitransient.readthedocs.io Diego Royo Jorge Garcia-Pueyo Miguel Crespo Guillermo Enguita Óscar Pueyo-Ciutad Diego Bielsa http://arxiv.org/abs/2510.22712v2 Step2Motion: Locomotion Reconstruction from Pressure Sensing Insoles 2026-03-09T10:16:26Z

Human motion is fundamentally driven by continuous physical interaction with the environment. Whether walking, running, or simply standing, the forces exchanged between our feet and the ground provide crucial insights for understanding and reconstructing human movement. Recent advances in wearable insole devices offer a compelling solution for capturing these forces in diverse, real-world scenarios. Sensor insoles pose no constraint on the users' motion (unlike mocap suits) and are unaffected by line-of-sight limitations (in contrast to optical systems). These qualities make sensor insoles an ideal choice for robust, unconstrained motion capture, particularly in outdoor environments. Surprisingly, leveraging these devices with recent motion reconstruction methods remains largely unexplored. Aiming to fill this gap, we present Step2Motion, the first approach to reconstruct human locomotion from multi-modal insole sensors. Our method utilizes pressure and inertial data-accelerations and angular rates-captured by the insoles to reconstruct human motion. We evaluate the effectiveness of our approach across a range of experiments to show its versatility for diverse locomotion styles, from simple ones like walking or jogging up to moving sideways, on tiptoes, slightly crouching, or dancing.

2025-10-26T15:12:02Z Eurographics 2026 Jose Luis Ponton Eduardo Alvarado Lin Geng Foo Nuria Pelechano Carlos Andujar Marc Habermann 10.1111/cgf.70405 http://arxiv.org/abs/2603.08023v1 Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model 2026-03-09T06:59:03Z

Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{https://vision3d-lab.github.io/mambadance}.

2026-03-09T06:59:03Z Accepted by WACV 2026 Sangjune Park Inhyeok Choi Donghyeon Soon Youngwoo Jeon Kyungdon Joo http://arxiv.org/abs/2603.07988v1 TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size 2026-03-09T05:52:13Z

Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.

2026-03-09T05:52:13Z CVPR 2026. Project page: https://splionar.github.io/TeamHOI/ Code: https://github.com/sail-sg/TeamHOI Stefan Lionar Gim Hee Lee http://arxiv.org/abs/2603.07776v1 Parameterized Brushstroke Style Transfer 2026-03-08T19:33:36Z

Computer Vision-based Style Transfer techniques have been used for many years to represent artistic style. However, most contemporary methods have been restricted to the pixel domain; in other words, the style transfer approach has been modifying the image pixels to incorporate artistic style. However, real artistic work is made of brush strokes with different colors on a canvas. Pixel-based approaches are unnatural for representing these images. Hence, this paper discusses a style transfer method that represents the image in the brush stroke domain instead of the RGB domain, which has better visual improvement over pixel-based methods.

2026-03-08T19:33:36Z Uma Meleti Siyu Huang