https://arxiv.org/api/1f+T5JToLcgWyMkxAAKC2kLFB+g 2026-06-25T22:11:13Z 9383 1365 15 http://arxiv.org/abs/2511.01894v1 LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency 2025-10-29T08:12:32Z

Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x -- 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% -- 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC's ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.

2025-10-29T08:12:32Z Fangbing Liu Pengfei Duan Wen Li Yi He http://arxiv.org/abs/2510.25234v1 Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation 2025-10-29T07:29:21Z

Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.

2025-10-29T07:29:21Z 18 pages, 6 figures, accepted to ICXR 2025 conference Yuxiang Mao Zhijie Zhang Zhiheng Zhang Jiawei Liu Chen Zeng Shihong Xia http://arxiv.org/abs/2510.25159v1 Fast and Robust Point Containment Queries on Trimmed Surface 2025-10-29T04:28:24Z

Point containment queries on trimmed surfaces are fundamental to CAD modeling, solid geometry processing, and surface tessellation. Existing approaches such as ray casting and generalized winding numbers often face limitations in robustness and computational efficiency. We propose a fast and numerically stable method for performing containment queries on trimmed surfaces, including those with periodic parameterizations. Our approach introduces a recursive winding number computation scheme that replaces costly curve subdivision with an ellipse-based bound for Bezier segments, enabling linear-time evaluation. For periodic surfaces, we lift trimming curves to the universal covering space, allowing accurate and consistent winding number computation even for non-contractible or discontinuous loops in parameter domain. Experiments show that our method achieves substantial speedups over existing winding-number algorithms while maintaining high robustness in the presence of geometric noise, open boundaries, and periodic topologies. We further demonstrate its effectiveness in processing real B-Rep models and in robust tessellation of trimmed surfaces.

2025-10-29T04:28:24Z Anchang Bao Enya Shen Jianmin Wang http://arxiv.org/abs/2510.25152v1 Off-Centered WoS-Type Solvers with Statistical Weighting 2025-10-29T04:09:50Z

Stochastic PDE solvers have emerged as a powerful alternative to traditional discretization-based methods for solving partial differential equations (PDEs), especially in geometry processing and graphics. While off-centered estimators enhance sample reuse in WoS-type Monte Carlo solvers, they introduce correlation artifacts and bias when Green's functions are approximated. In this paper, we propose a statistically weighted off-centered WoS-type estimator that leverages local similarity filtering to selectively combine samples across neighboring evaluation points. Our method balances bias and variance through a principled weighting strategy that suppresses unreliable estimators. We demonstrate our approach's effectiveness on various PDEs,including screened Poisson equations and boundary conditions, achieving consistent improvements over existing solvers such as vanilla Walk on Spheres, mean value caching, and boundary value caching. Our method also naturally extends to gradient field estimation and mixed boundary problems.

2025-10-29T04:09:50Z SIGGRAPH Asia 2025 conference paper Anchang Bao Jie Xu Enya Shen Jianmin Wang http://arxiv.org/abs/2510.24486v1 Fast and accurate neural reflectance transformation imaging through knowledge distillation 2025-10-28T15:00:07Z

Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...

2025-10-28T15:00:07Z 18 pages Tinsae G. Dulecha Leonardo Righetto Ruggero Pintus Enrico Gobbetti Andrea Giachetti http://arxiv.org/abs/2505.10755v3 Procedural Generation of Articulated Simulation-Ready Assets 2025-10-28T11:05:00Z

We introduce Infinigen-Articulated, a toolkit for generating realistic, procedurally generated articulated assets for robotics simulation. We include procedural generators for 18 common articulated object categories along with high-level utilities for use creating custom articulated assets in Blender. We also provide an export pipeline to integrate the resulting assets along with their physical properties into common robotics simulators. Experiments demonstrate that assets sampled from these generators are effective for movable object segmentation, training generalizable reinforcement learning policies, and sim-to-real transfer of imitation learning policies.

2025-05-15T23:47:58Z Updated to include information on newly implemented assets, new experimental results (both simulation and real world), and additional features including material and dynamics parameters Abhishek Joshi Beining Han Jack Nugent Max Gonzalez Saez-Diez Yiming Zuo Jonathan Liu Hongyu Wen Stamatis Alexandropoulos Karhan Kayan Anna Calveri Tao Sun Gaowen Liu Yi Shao Alexander Raistrick Jia Deng http://arxiv.org/abs/2504.03099v3 Capturing Non-Linear Human Perspective in Line Drawings 2025-10-27T23:53:18Z

Artist-drawn sketches only loosely conform to analytical models of perspective projection; the deviation of human-drawn perspective from analytical perspective models is persistent and well documented, but has yet to be algorithmically replicated. We encode this deviation between human and analytic perspectives as a continuous function in 3D space and develop a method to learn it. We seek deviation functions that (i)mimic artist deviation on our training data; (ii)generalize to other shapes; (iii)are consistent across different views of the same shape; and (iv)produce outputs that appear human-drawn. The natural data for learning this deviation is pairs of artist sketches of 3D shapes and best-matching analytical camera views of the same shapes. However, a core challenge in learning perspective deviation is the heterogeneity of human drawing choices, combined with relative data paucity (the datasets we rely on have only a few dozen training pairs). We sidestep this challenge by learning perspective deviation from an individual pair of an artist sketch of a 3D shape and the contours of the same shape rendered from a best-matching analytical camera view. We first match contours of the depicted shape to artist strokes, then learn a spatially continuous local perspective deviation function that modifies the camera perspective projecting the contours to their corresponding strokes. This function retains key geometric properties that artists strive to preserve when depicting 3D content, thus satisfying (i) and (iv) above. We generalize our method to alternative shapes and views (ii, iii) via a self-augmentation approach that algorithmically generates training data for nearby views, and enforces spatial smoothness and consistency across all views. We compare our results to potential alternatives, demonstrating the superiority of the proposed approach.

2025-04-04T00:57:48Z Jinfan Yang Leo Foord-Kelcey Suzuran Takikawa Nicholas Vining Niloy Mitra Alla Sheffer http://arxiv.org/abs/2510.23605v1 Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling 2025-10-27T17:59:51Z

Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.

2025-10-27T17:59:51Z NeurIPS 2025, 38 pages, 22 figures Shuhong Zheng Ashkan Mirzaei Igor Gilitschenski http://arxiv.org/abs/2510.14081v3 Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images 2025-10-27T13:30:00Z

We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

2025-10-15T20:36:28Z This work received the Best Paper Honorable Mention at the AMFG Workshop, ICCV 2025 Emanuel Garbin Guy Adam Oded Krams Zohar Barzelay Eran Guendelman Michael Schwarz Matteo Presutto Moran Vatelmacher Yigal Shenkman Eli Peker Itai Druker Uri Patish Yoav Blum Max Bluvstein Junxuan Li Rawal Khirodkar Shunsuke Saito http://arxiv.org/abs/2510.23122v1 FlowCapX: Physics-Grounded Flow Capture with Long-Term Consistency 2025-10-27T08:55:50Z

We present FlowCapX, a physics-enhanced framework for flow reconstruction from sparse video inputs, addressing the challenge of jointly optimizing complex physical constraints and sparse observational data over long time horizons. Existing methods often struggle to capture turbulent motion while maintaining physical consistency, limiting reconstruction quality and downstream tasks. Focusing on velocity inference, our approach introduces a hybrid framework that strategically separates representation and supervision across spatial scales. At the coarse level, we resolve sparse-view ambiguities via a novel optimization strategy that aligns long-term observation with physics-grounded velocity fields. By emphasizing vorticity-based physical constraints, our method enhances physical fidelity and improves optimization stability. At the fine level, we prioritize observational fidelity to preserve critical turbulent structures. Extensive experiments demonstrate state-of-the-art velocity reconstruction, enabling velocity-aware downstream tasks, e.g., accurate flow analysis, scene augmentation with tracer visualization and re-simulation.

2025-10-27T08:55:50Z Ningxiao Tao Liru Zhang Xingyu Ni Mengyu Chu Baoquan Chen http://arxiv.org/abs/2501.13918v2 Improving Video Generation with Human Feedback 2025-10-27T08:22:57Z

Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.

2025-01-23T18:55:41Z https://github.com/KwaiVGI/VideoAlign Jie Liu Gongye Liu Jiajun Liang Ziyang Yuan Xiaokun Liu Mingwu Zheng Xiele Wu Qiulin Wang Menghan Xia Xintao Wang Xiaohong Liu Fei Yang Pengfei Wan Di Zhang Kun Gai Yujiu Yang Wanli Ouyang http://arxiv.org/abs/2510.22632v1 Environment-aware Motion Matching 2025-10-26T11:28:50Z

Interactive applications demand believable characters that respond naturally to dynamic environments. Traditional character animation techniques often struggle to handle arbitrary situations, leading to a growing trend of dynamically selecting motion-captured animations based on predefined features. While Motion Matching has proven effective for locomotion by aligning to target trajectories, animating environment interactions and crowd behaviors remains challenging due to the need to consider surrounding elements. Existing approaches often involve manual setup or lack the naturalism of motion capture. Furthermore, in crowd animation, body animation is frequently treated as a separate process from trajectory planning, leading to inconsistencies between body pose and root motion. To address these limitations, we present Environment-aware Motion Matching, a novel real-time system for full-body character animation that dynamically adapts to obstacles and other agents, emphasizing the bidirectional relationship between pose and trajectory. In a preprocessing step, we extract shape, pose, and trajectory features from a motion capture database. At runtime, we perform an efficient search that matches user input and current pose while penalizing collisions with a dynamic environment. Our method allows characters to naturally adjust their pose and trajectory to navigate crowded scenes.

2025-10-26T11:28:50Z Published in ACM TOG and presented in SIGGRAPH ASIA 2025. Project webpage: https://upc-virvig.github.io/Environment-aware-Motion-Matching/ Jose Luis Ponton Sheldon Andrews Carlos Andujar Nuria Pelechano 10.1145/3763334 http://arxiv.org/abs/2509.20414v2 SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent 2025-10-26T04:10:24Z

Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.

2025-09-24T09:06:41Z Accepted by NeurIPS 2025, 26 pages Yandan Yang Baoxiong Jia Shujie Zhang Siyuan Huang http://arxiv.org/abs/2410.15068v4 A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Reconstruction 2025-10-26T03:39:20Z

Reconstruction of High Dynamic Range (HDR) from Low Dynamic Range (LDR) images is an important computer vision task. There is a significant amount of research utilizing both conventional non-learning methods and modern data-driven approaches, focusing on using both single-exposed and multi-exposed LDR for HDR image reconstruction. However, most current state-of-the-art methods require high-quality paired {LDR;HDR} datasets with limited literature use of unpaired datasets, that is, methods that learn the LDR-HDR mapping between domains. This paper proposes CycleHDR, a method that integrates self-supervision into a modified semantic- and cycle-consistent adversarial architecture that utilizes unpaired LDR and HDR datasets for training. Our method introduces novel artifact- and exposure-aware generators to address visual artifact removal. It also puts forward an encoder and loss to address semantic consistency, another under-explored topic. CycleHDR is the first to use semantic and contextual awareness for the LDR-HDR reconstruction task in a self-supervised setup. The method achieves state-of-the-art performance across several benchmark datasets and reconstructs high-quality HDR images. The official website of this work is available at: https://github.com/HrishavBakulBarua/Cycle-HDR

2024-10-19T11:11:58Z Hrishav Bakul Barua Kalin Stefanov Lemuel Lai En Che Abhinav Dhall KokSheik Wong Ganesh Krishnasamy http://arxiv.org/abs/2510.22199v1 MOGRAS: Human Motion with Grasping in 3D Scenes 2025-10-25T07:39:02Z

Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.

2025-10-25T07:39:02Z British Machine Vision Conference Workshop - From Scene Understanding to Human Modeling Kunal Bhosikar Siddharth Katageri Vivek Madhavaram Kai Han Charu Sharma