https://arxiv.org/api/ZcxQlod0qWFRhyF1/gFZsF4C/Y4 2026-06-26T02:48:07Z 9390 1425 15 http://arxiv.org/abs/2510.14976v1 Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation 2025-10-16T17:59:56Z

Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.

2025-10-16T17:59:56Z Accepted to ICCV 2025. Project page: https://stevenlsw.github.io/ponimator/ Shaowei Liu Chuan Guo Bing Zhou Jian Wang http://arxiv.org/abs/2412.10426v2 CAP: Evaluation of Persuasive and Creative Image Generation 2025-10-16T02:33:33Z

We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) generation and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. Our findings reveal that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models' capabilities in producing images that are better aligned, more creative, and more persuasive.

2024-12-10T19:54:59Z Aysan Aghazadeh Adriana Kovashka http://arxiv.org/abs/2510.14146v1 PoissonNet: A Local-Global Approach for Learning on Surfaces 2025-10-15T22:25:44Z

Many network architectures exist for learning on meshes, yet their constructions entail delicate trade-offs between difficulty learning high-frequency features, insufficient receptive field, sensitivity to discretization, and inefficient computational overhead. Drawing from classic local-global approaches in mesh processing, we introduce PoissonNet, a novel neural architecture that overcomes all of these deficiencies by formulating a local-global learning scheme, which uses Poisson's equation as the primary mechanism for feature propagation. Our core network block is simple; we apply learned local feature transformations in the gradient domain of the mesh, then solve a Poisson system to propagate scalar feature updates across the surface globally. Our local-global learning framework preserves the features's full frequency spectrum and provides a truly global receptive field, while remaining agnostic to mesh triangulation. Our construction is efficient, requiring far less compute overhead than comparable methods, which enables scalability -- both in the size of our datasets, and the size of individual training samples. These qualities are validated on various experiments where, compared to previous intrinsic architectures, we attain state-of-the-art performance on semantic segmentation and parameterizing highly-detailed animated surfaces. Finally, as a central application of PoissonNet, we show its ability to learn deformations, significantly outperforming state-of-the-art architectures that learn on surfaces.

2025-10-15T22:25:44Z In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 2025, 16 pages Arman Maesumi Tanish Makadia Thibault Groueix Vladimir G. Kim Daniel Ritchie Noam Aigerman 10.1145/3763298 http://arxiv.org/abs/2503.15225v2 A Personalized Data-Driven Generative Model of Human Repetitive Motion 2025-10-15T16:43:12Z

The deployment of autonomous virtual avatars (in extended reality) and robots in human group activities -- such as rehabilitation therapy, sports, and manufacturing -- is expected to increase as these technologies become more pervasive. Designing cognitive architectures and control strategies to drive these agents requires realistic models of human motion. Furthermore, recent research has shown that each person exhibits a unique velocity signature, highlighting how individual motor behaviors are both rich in variability and internally consistent. However, existing models only provide simplified descriptions of human motor behavior, hindering the development of effective cognitive architectures. In this work, we first show that motion amplitude provides a valid and complementary characterization of individual motor signatures. Then, we propose a fully data-driven approach, based on long short-term memory neural networks, to generate original motion that captures the unique features of specific individuals. We validate the architecture using real human data from participants performing spontaneous oscillatory motion. Extensive analyses show that state-of-the-art Kuramoto-like models fail to replicate individual motor signatures, whereas our model accurately reproduces the velocity distribution and amplitude envelopes of the individual it was trained on, while remaining distinct from others.

2025-03-19T14:03:20Z 12 pages, 6 figures Angelo Di Porzio Marco Coraggio http://arxiv.org/abs/2502.03207v2 MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent 2025-10-15T14:53:56Z

We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.

2025-02-05T14:26:07Z Xinyao Liao Xianfang Zeng Liao Wang Gang Yu Guosheng Lin Chi Zhang http://arxiv.org/abs/2410.05038v3 GARField: Addressing the visual Sim-to-Real gap in garment manipulation with mesh-attached radiance fields 2025-10-15T12:02:49Z

While humans intuitively manipulate garments and other textile items swiftly and accurately, it is a significant challenge for robots. A factor crucial to human performance is the ability to imagine, a priori, the intended result of the manipulation intents and hence develop predictions on the garment pose. That ability allows us to plan from highly obstructed states, adapt our plans as we collect more information and react swiftly to unforeseen circumstances. Conversely, robots struggle to establish such intuitions and form tight links between plans and observations. We can partly attribute this to the high cost of obtaining densely labelled data for textile manipulation, both in quality and quantity. The problem of data collection is a long-standing issue in data-based approaches to garment manipulation. As of today, generating high-quality and labelled garment manipulation data is mainly attempted through advanced data capture procedures that create simplified state estimations from real-world observations. However, this work proposes a novel approach to the problem by generating real-world observations from object states. To achieve this, we present GARField (Garment Attached Radiance Field), the first differentiable rendering architecture, to our knowledge, for data generation from simulated states stored as triangle meshes. Code is available on https://ddonatien.github.io/garfield-website/

2024-10-07T13:50:15Z Project site: https://ddonatien.github.io/garfield-website/ Donatien Delehelle Darwin G. Caldwell Fei Chen 10.1109/ROBIO64047.2024.10907327 http://arxiv.org/abs/2510.13381v1 Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering 2025-10-15T10:21:36Z

Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.

2025-10-15T10:21:36Z Accepted at ICCV-2025, project page: https://dynamic-ugsdf.github.io/ Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025 Siddharth Tourani Jayaram Reddy Akash Kumbar Satyajit Tourani Nishant Goyal Madhava Krishna N. Dinesh Reddy Muhammad Haris Khan 10.1109/ICCV51701.2025.02698 http://arxiv.org/abs/2504.05803v3 PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis 2025-10-15T09:22:15Z

Recent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.

2025-04-08T08:35:59Z Yihuan Huang Jiajun Liu Yanzhen Ren Jun Xue Wuyang Liu Zongkun Sun http://arxiv.org/abs/2510.13303v1 Automated document processing system for government agencies using DBNET++ and BART models 2025-10-15T08:48:02Z

An automatic document classification system is presented that detects textual content in images and classifies documents into four predefined categories (Invoice, Report, Letter, and Form). The system supports both offline images (e.g., files on flash drives, HDDs, microSD) and real-time capture via connected cameras, and is designed to mitigate practical challenges such as variable illumination, arbitrary orientation, curved or partially occluded text, low resolution, and distant text. The pipeline comprises four stages: image capture and preprocessing, text detection [1] using a DBNet++ (Differentiable Binarization Network Plus) detector, and text classification [2] using a BART (Bidirectional and Auto-Regressive Transformers) classifier, all integrated within a user interface implemented in Python with PyQt5. The achieved results by the system for text detection in images were good at about 92.88% through 10 hours on Total-Text dataset that involve high resolution images simulate a various and very difficult challenges. The results indicate the proposed approach is effective for practical, mixed-source document categorization in unconstrained imaging scenarios.

2025-10-15T08:48:02Z 8 pages, 12 figures, article International Journal of Circuit, Computing and Networking 2025; 6(2): 34-41 Aya Kaysan Bahjat 10.33545/27075923.2025.v6.i2a.100 http://arxiv.org/abs/2507.20881v2 Endoscopic Depth Estimation Based on Deep Learning: A Survey 2025-10-15T08:16:08Z

Endoscopic depth estimation is a critical technology for improving the safety and precision of minimally invasive surgery. It has attracted considerable attention from researchers in medical imaging, computer vision, and robotics. Over the past decade, a large number of methods have been developed. Despite the existence of several related surveys, a comprehensive overview focusing on recent deep learning-based techniques is still limited. This paper endeavors to bridge this gap by systematically reviewing the state-of-the-art literature. Specifically, we provide a thorough survey of the field from three key perspectives: data, methods, and applications. Firstly, at the data level, we describe the acquisition process of publicly available datasets. Secondly, at the methodological level, we introduce both monocular and stereo deep learning-based approaches for endoscopic depth estimation. Thirdly, at the application level, we identify the specific challenges and corresponding solutions for the clinical implementation of depth estimation technology, situated within concrete clinical scenarios. Finally, we outline potential directions for future research, such as domain adaptation, real-time implementation, and the synergistic fusion of depth information with sensor technologies, thereby providing a valuable starting point for researchers to engage with and advance the field toward clinical translation.

2025-07-28T14:34:45Z Ke Niu Zeyun Liu Xue Feng Heng Li Qika Lin Kaize Shi http://arxiv.org/abs/2402.18074v2 A Bijective Image Retargeting Algorithm Based on Conformal Energy 2025-10-15T07:05:01Z

Image retargeting, which resizes images to one with a prescribed aspect ratio by determining an optimal warping map, has gained substantial interest in imaging science. Despite significant advances, existing methods often fail to ensure bijective warping maps essential for preserving visual information. This paper introduces a novel bijective image retargeting model through conformal energy minimization of the deformation field. The proposed model establishes mathematical rigor by proving the well-posedness for the optimal warping map in both continuous and discrete settings and showing that the discrete solutions converge to their continuous counterpart under mesh refinement. Numerical experiments corroborate the model's efficacy and the convergence of discrete solutions during progressive mesh subdivision processes, validating both theoretical guarantees and practical performance.

2024-02-28T06:06:34Z Chengyang Liu Michael K. Ng http://arxiv.org/abs/2510.13168v1 MiGumi: Making Tightly Coupled Integral Joints Millable 2025-10-15T05:37:44Z

Traditional integral wood joints, despite their strength, durability, and elegance, remain rare in modern workflows due to the cost and difficulty of manual fabrication. CNC milling offers a scalable alternative, but directly milling traditional joints often fails to produce functional results because milling induces geometric deviations, such as rounded inner corners, that alter the target geometries of the parts. Since joints rely on tightly fitting surfaces, such deviations introduce gaps or overlaps that undermine fit or block assembly. We propose to overcome this problem by (1) designing a language that represent millable geometry, and (2) co-optimizing part geometries to restore coupling. We introduce Millable Extrusion Geometry (MXG), a language for representing geometry as the outcome of milling operations performed with flat-end drill bits. MXG represents each operation as a subtractive extrusion volume defined by a tool direction and drill radius. This parameterization enables the modeling of artifact-free geometry under an idealized zero-radius drill bit, matching traditional joint designs. Increasing the radius then reveals milling-induced deviations, which compromise the integrity of the joint. To restore coupling, we formalize tight coupling in terms of both surface proximity and proximity constraints on the mill-bit paths associated with mating surfaces. We then derive two tractable, differentiable losses that enable efficient optimization of joint geometry. We evaluate our method on 30 traditional joint designs, demonstrating that it produces CNC-compatible, tightly fitting joints that approximates the original geometry. By reinterpreting traditional joints for CNC workflows, we continue the evolution of this heritage craft and help ensure its relevance in future making practices.

2025-10-15T05:37:44Z SIGGRAPH Asia/TOG 2025; project page: https://bardofcodes.github.io/migumi/ Aditya Ganeshan Kurt Fleischer Wenzel Jakob Ariel Shamir Daniel Ritchie Takeo Igarashi Maria Larsson http://arxiv.org/abs/2405.16807v3 Extreme Compression of Adaptive Neural Images 2025-10-14T20:42:42Z

Implicit Neural Representations (INRs) and Neural Fields are a novel paradigm for signal representation, from images and audio to 3D scenes and videos. The fundamental idea is to represent a signal as a continuous and differentiable neural network. This new approach poses new theoretical questions and challenges. Considering a neural image as a 2D image represented as a neural network, we aim to explore novel neural image compression. In this work, we present a novel analysis on compressing neural fields, with focus on images and introduce Adaptive Neural Images (ANI), an efficient neural representation that enables adaptation to different inference or transmission requirements. Our proposed method allows us to reduce the bits-per-pixel (bpp) of the neural image by 8 times, without losing sensitive details or harming fidelity. Our work offers a new framework for developing compressed neural fields. We achieve a new state-of-the-art in terms of PSNR/bpp trade-off thanks to our successful implementation of 4-bit neural representations.

2024-05-27T03:54:09Z ICCV 2025 Workshop - Binary and Extreme Quantization for Computer Vision Leo Hoshikawa Marcos V. Conde Takeshi Ohashi Atsushi Irie http://arxiv.org/abs/2412.16461v2 Optimizing Parameters for Static Equilibrium of Discrete Elastic Rods with Active-Set Cholesky 2025-10-14T20:05:05Z

We propose a parameter optimization method for achieving static equilibrium of discrete elastic rods. Our method simultaneously optimizes material stiffness and rest shape parameters under box constraints to exactly enforce zero net force while avoiding stability issues and violations of physical laws. For efficiency, we split our constrained optimization problem into primal and dual subproblems via the augmented Lagrangian method, while handling the dual subproblem via simple vector updates. To efficiently solve the box-constrained primal subproblem, we propose a new active-set Cholesky preconditioner. Our method surpasses prior work in generality, robustness, and speed.

2024-12-21T03:26:16Z Tetsuya Takahashi Christopher Batty http://arxiv.org/abs/2510.12785v1 MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars 2025-10-14T17:56:14Z

Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.

2025-10-14T17:56:14Z 18 pages, 12 figures Felix Taubner Ruihang Zhang Mathieu Tuli Sherwin Bahmani David B. Lindell