https://arxiv.org/api/1f+T5JToLcgWyMkxAAKC2kLFB+g2026-06-25T22:11:13Z9383136515http://arxiv.org/abs/2511.01894v1LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency2025-10-29T08:12:32ZRecent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x -- 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% -- 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC's ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.2025-10-29T08:12:32ZFangbing LiuPengfei DuanWen LiYi Hehttp://arxiv.org/abs/2510.25234v1Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation2025-10-29T07:29:21ZExpressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.2025-10-29T07:29:21Z18 pages, 6 figures, accepted to ICXR 2025 conferenceYuxiang MaoZhijie ZhangZhiheng ZhangJiawei LiuChen ZengShihong Xiahttp://arxiv.org/abs/2510.25159v1Fast and Robust Point Containment Queries on Trimmed Surface2025-10-29T04:28:24ZPoint containment queries on trimmed surfaces are fundamental to CAD modeling, solid geometry processing, and surface tessellation. Existing approaches such as ray casting and generalized winding numbers often face limitations in robustness and computational efficiency.
We propose a fast and numerically stable method for performing containment queries on trimmed surfaces, including those with periodic parameterizations. Our approach introduces a recursive winding number computation scheme that replaces costly curve subdivision with an ellipse-based bound for Bezier segments, enabling linear-time evaluation. For periodic surfaces, we lift trimming curves to the universal covering space, allowing accurate and consistent winding number computation even for non-contractible or discontinuous loops in parameter domain.
Experiments show that our method achieves substantial speedups over existing winding-number algorithms while maintaining high robustness in the presence of geometric noise, open boundaries, and periodic topologies. We further demonstrate its effectiveness in processing real B-Rep models and in robust tessellation of trimmed surfaces.2025-10-29T04:28:24ZAnchang BaoEnya ShenJianmin Wanghttp://arxiv.org/abs/2510.25152v1Off-Centered WoS-Type Solvers with Statistical Weighting2025-10-29T04:09:50ZStochastic PDE solvers have emerged as a powerful alternative to traditional discretization-based methods for solving partial differential equations (PDEs), especially in geometry processing and graphics. While off-centered estimators enhance sample reuse in WoS-type Monte Carlo solvers, they introduce correlation artifacts and bias when Green's functions are approximated. In this paper, we propose a statistically weighted off-centered WoS-type estimator that leverages local similarity filtering to selectively combine samples across neighboring evaluation points. Our method balances bias and variance through a principled weighting strategy that suppresses unreliable estimators. We demonstrate our approach's effectiveness on various PDEs,including screened Poisson equations and boundary conditions, achieving consistent improvements over existing solvers such as vanilla Walk on Spheres, mean value caching, and boundary value caching. Our method also naturally extends to gradient field estimation and mixed boundary problems.2025-10-29T04:09:50ZSIGGRAPH Asia 2025 conference paperAnchang BaoJie XuEnya ShenJianmin Wanghttp://arxiv.org/abs/2510.24486v1Fast and accurate neural reflectance transformation imaging through knowledge distillation2025-10-28T15:00:07ZReflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...2025-10-28T15:00:07Z18 pagesTinsae G. DulechaLeonardo RighettoRuggero PintusEnrico GobbettiAndrea Giachettihttp://arxiv.org/abs/2505.10755v3Procedural Generation of Articulated Simulation-Ready Assets2025-10-28T11:05:00ZWe introduce Infinigen-Articulated, a toolkit for generating realistic, procedurally generated articulated assets for robotics simulation. We include procedural generators for 18 common articulated object categories along with high-level utilities for use creating custom articulated assets in Blender. We also provide an export pipeline to integrate the resulting assets along with their physical properties into common robotics simulators. Experiments demonstrate that assets sampled from these generators are effective for movable object segmentation, training generalizable reinforcement learning policies, and sim-to-real transfer of imitation learning policies.2025-05-15T23:47:58ZUpdated to include information on newly implemented assets, new experimental results (both simulation and real world), and additional features including material and dynamics parametersAbhishek JoshiBeining HanJack NugentMax Gonzalez Saez-DiezYiming ZuoJonathan LiuHongyu WenStamatis AlexandropoulosKarhan KayanAnna CalveriTao SunGaowen LiuYi ShaoAlexander RaistrickJia Denghttp://arxiv.org/abs/2504.03099v3Capturing Non-Linear Human Perspective in Line Drawings2025-10-27T23:53:18ZArtist-drawn sketches only loosely conform to analytical models of perspective projection; the deviation of human-drawn perspective from analytical perspective models is persistent and well documented, but has yet to be algorithmically replicated. We encode this deviation between human and analytic perspectives as a continuous function in 3D space and develop a method to learn it. We seek deviation functions that (i)mimic artist deviation on our training data; (ii)generalize to other shapes; (iii)are consistent across different views of the same shape; and (iv)produce outputs that appear human-drawn. The natural data for learning this deviation is pairs of artist sketches of 3D shapes and best-matching analytical camera views of the same shapes. However, a core challenge in learning perspective deviation is the heterogeneity of human drawing choices, combined with relative data paucity (the datasets we rely on have only a few dozen training pairs). We sidestep this challenge by learning perspective deviation from an individual pair of an artist sketch of a 3D shape and the contours of the same shape rendered from a best-matching analytical camera view. We first match contours of the depicted shape to artist strokes, then learn a spatially continuous local perspective deviation function that modifies the camera perspective projecting the contours to their corresponding strokes. This function retains key geometric properties that artists strive to preserve when depicting 3D content, thus satisfying (i) and (iv) above. We generalize our method to alternative shapes and views (ii, iii) via a self-augmentation approach that algorithmically generates training data for nearby views, and enforces spatial smoothness and consistency across all views. We compare our results to potential alternatives, demonstrating the superiority of the proposed approach.2025-04-04T00:57:48ZJinfan YangLeo Foord-KelceySuzuran TakikawaNicholas ViningNiloy MitraAlla Shefferhttp://arxiv.org/abs/2510.23605v1Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling2025-10-27T17:59:51ZCurrent 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.2025-10-27T17:59:51ZNeurIPS 2025, 38 pages, 22 figuresShuhong ZhengAshkan MirzaeiIgor Gilitschenskihttp://arxiv.org/abs/2510.14081v3Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images2025-10-27T13:30:00ZWe present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.2025-10-15T20:36:28ZThis work received the Best Paper Honorable Mention at the AMFG Workshop, ICCV 2025Emanuel GarbinGuy AdamOded KramsZohar BarzelayEran GuendelmanMichael SchwarzMatteo PresuttoMoran VatelmacherYigal ShenkmanEli PekerItai DrukerUri PatishYoav BlumMax BluvsteinJunxuan LiRawal KhirodkarShunsuke Saitohttp://arxiv.org/abs/2510.23122v1FlowCapX: Physics-Grounded Flow Capture with Long-Term Consistency2025-10-27T08:55:50ZWe present FlowCapX, a physics-enhanced framework for flow reconstruction from sparse video inputs, addressing the challenge of jointly optimizing complex physical constraints and sparse observational data over long time horizons. Existing methods often struggle to capture turbulent motion while maintaining physical consistency, limiting reconstruction quality and downstream tasks. Focusing on velocity inference, our approach introduces a hybrid framework that strategically separates representation and supervision across spatial scales. At the coarse level, we resolve sparse-view ambiguities via a novel optimization strategy that aligns long-term observation with physics-grounded velocity fields. By emphasizing vorticity-based physical constraints, our method enhances physical fidelity and improves optimization stability. At the fine level, we prioritize observational fidelity to preserve critical turbulent structures. Extensive experiments demonstrate state-of-the-art velocity reconstruction, enabling velocity-aware downstream tasks, e.g., accurate flow analysis, scene augmentation with tracer visualization and re-simulation.2025-10-27T08:55:50ZNingxiao TaoLiru ZhangXingyu NiMengyu ChuBaoquan Chenhttp://arxiv.org/abs/2501.13918v2Improving Video Generation with Human Feedback2025-10-27T08:22:57ZVideo generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.2025-01-23T18:55:41Zhttps://github.com/KwaiVGI/VideoAlignJie LiuGongye LiuJiajun LiangZiyang YuanXiaokun LiuMingwu ZhengXiele WuQiulin WangMenghan XiaXintao WangXiaohong LiuFei YangPengfei WanDi ZhangKun GaiYujiu YangWanli Ouyanghttp://arxiv.org/abs/2510.22632v1Environment-aware Motion Matching2025-10-26T11:28:50ZInteractive applications demand believable characters that respond naturally to dynamic environments. Traditional character animation techniques often struggle to handle arbitrary situations, leading to a growing trend of dynamically selecting motion-captured animations based on predefined features. While Motion Matching has proven effective for locomotion by aligning to target trajectories, animating environment interactions and crowd behaviors remains challenging due to the need to consider surrounding elements. Existing approaches often involve manual setup or lack the naturalism of motion capture. Furthermore, in crowd animation, body animation is frequently treated as a separate process from trajectory planning, leading to inconsistencies between body pose and root motion. To address these limitations, we present Environment-aware Motion Matching, a novel real-time system for full-body character animation that dynamically adapts to obstacles and other agents, emphasizing the bidirectional relationship between pose and trajectory. In a preprocessing step, we extract shape, pose, and trajectory features from a motion capture database. At runtime, we perform an efficient search that matches user input and current pose while penalizing collisions with a dynamic environment. Our method allows characters to naturally adjust their pose and trajectory to navigate crowded scenes.2025-10-26T11:28:50ZPublished in ACM TOG and presented in SIGGRAPH ASIA 2025. Project webpage: https://upc-virvig.github.io/Environment-aware-Motion-Matching/Jose Luis PontonSheldon AndrewsCarlos AndujarNuria Pelechano10.1145/3763334http://arxiv.org/abs/2509.20414v2SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent2025-10-26T04:10:24ZIndoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.2025-09-24T09:06:41ZAccepted by NeurIPS 2025, 26 pagesYandan YangBaoxiong JiaShujie ZhangSiyuan Huanghttp://arxiv.org/abs/2410.15068v4A Cycle Ride to HDR: Semantics Aware Self-Supervised Framework for Unpaired LDR-to-HDR Image Reconstruction2025-10-26T03:39:20ZReconstruction of High Dynamic Range (HDR) from Low Dynamic Range (LDR) images is an important computer vision task. There is a significant amount of research utilizing both conventional non-learning methods and modern data-driven approaches, focusing on using both single-exposed and multi-exposed LDR for HDR image reconstruction. However, most current state-of-the-art methods require high-quality paired {LDR;HDR} datasets with limited literature use of unpaired datasets, that is, methods that learn the LDR-HDR mapping between domains. This paper proposes CycleHDR, a method that integrates self-supervision into a modified semantic- and cycle-consistent adversarial architecture that utilizes unpaired LDR and HDR datasets for training. Our method introduces novel artifact- and exposure-aware generators to address visual artifact removal. It also puts forward an encoder and loss to address semantic consistency, another under-explored topic. CycleHDR is the first to use semantic and contextual awareness for the LDR-HDR reconstruction task in a self-supervised setup. The method achieves state-of-the-art performance across several benchmark datasets and reconstructs high-quality HDR images. The official website of this work is available at: https://github.com/HrishavBakulBarua/Cycle-HDR2024-10-19T11:11:58ZHrishav Bakul BaruaKalin StefanovLemuel Lai En CheAbhinav DhallKokSheik WongGanesh Krishnasamyhttp://arxiv.org/abs/2510.22199v1MOGRAS: Human Motion with Grasping in 3D Scenes2025-10-25T07:39:02ZGenerating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.2025-10-25T07:39:02ZBritish Machine Vision Conference Workshop - From Scene Understanding to Human ModelingKunal BhosikarSiddharth KatageriVivek MadhavaramKai HanCharu Sharma