https://arxiv.org/api/ZcxQlod0qWFRhyF1/gFZsF4C/Y42026-06-26T02:48:07Z9390142515http://arxiv.org/abs/2510.14976v1Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation2025-10-16T17:59:56ZClose-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.2025-10-16T17:59:56ZAccepted to ICCV 2025. Project page: https://stevenlsw.github.io/ponimator/Shaowei LiuChuan GuoBing ZhouJian Wanghttp://arxiv.org/abs/2412.10426v2CAP: Evaluation of Persuasive and Creative Image Generation2025-10-16T02:33:33ZWe address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) generation and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. Our findings reveal that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models' capabilities in producing images that are better aligned, more creative, and more persuasive.2024-12-10T19:54:59ZAysan AghazadehAdriana Kovashkahttp://arxiv.org/abs/2510.14146v1PoissonNet: A Local-Global Approach for Learning on Surfaces2025-10-15T22:25:44ZMany network architectures exist for learning on meshes, yet their constructions entail delicate trade-offs between difficulty learning high-frequency features, insufficient receptive field, sensitivity to discretization, and inefficient computational overhead. Drawing from classic local-global approaches in mesh processing, we introduce PoissonNet, a novel neural architecture that overcomes all of these deficiencies by formulating a local-global learning scheme, which uses Poisson's equation as the primary mechanism for feature propagation. Our core network block is simple; we apply learned local feature transformations in the gradient domain of the mesh, then solve a Poisson system to propagate scalar feature updates across the surface globally. Our local-global learning framework preserves the features's full frequency spectrum and provides a truly global receptive field, while remaining agnostic to mesh triangulation. Our construction is efficient, requiring far less compute overhead than comparable methods, which enables scalability -- both in the size of our datasets, and the size of individual training samples. These qualities are validated on various experiments where, compared to previous intrinsic architectures, we attain state-of-the-art performance on semantic segmentation and parameterizing highly-detailed animated surfaces. Finally, as a central application of PoissonNet, we show its ability to learn deformations, significantly outperforming state-of-the-art architectures that learn on surfaces.2025-10-15T22:25:44ZIn ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 2025, 16 pagesArman MaesumiTanish MakadiaThibault GroueixVladimir G. KimDaniel RitchieNoam Aigerman10.1145/3763298http://arxiv.org/abs/2503.15225v2A Personalized Data-Driven Generative Model of Human Repetitive Motion2025-10-15T16:43:12ZThe deployment of autonomous virtual avatars (in extended reality) and robots in human group activities -- such as rehabilitation therapy, sports, and manufacturing -- is expected to increase as these technologies become more pervasive. Designing cognitive architectures and control strategies to drive these agents requires realistic models of human motion. Furthermore, recent research has shown that each person exhibits a unique velocity signature, highlighting how individual motor behaviors are both rich in variability and internally consistent. However, existing models only provide simplified descriptions of human motor behavior, hindering the development of effective cognitive architectures. In this work, we first show that motion amplitude provides a valid and complementary characterization of individual motor signatures. Then, we propose a fully data-driven approach, based on long short-term memory neural networks, to generate original motion that captures the unique features of specific individuals. We validate the architecture using real human data from participants performing spontaneous oscillatory motion. Extensive analyses show that state-of-the-art Kuramoto-like models fail to replicate individual motor signatures, whereas our model accurately reproduces the velocity distribution and amplitude envelopes of the individual it was trained on, while remaining distinct from others.2025-03-19T14:03:20Z12 pages, 6 figuresAngelo Di PorzioMarco Coraggiohttp://arxiv.org/abs/2502.03207v2MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent2025-10-15T14:53:56ZWe propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.2025-02-05T14:26:07ZXinyao LiaoXianfang ZengLiao WangGang YuGuosheng LinChi Zhanghttp://arxiv.org/abs/2410.05038v3GARField: Addressing the visual Sim-to-Real gap in garment manipulation with mesh-attached radiance fields2025-10-15T12:02:49ZWhile humans intuitively manipulate garments and other textile items swiftly and accurately, it is a significant challenge for robots. A factor crucial to human performance is the ability to imagine, a priori, the intended result of the manipulation intents and hence develop predictions on the garment pose. That ability allows us to plan from highly obstructed states, adapt our plans as we collect more information and react swiftly to unforeseen circumstances. Conversely, robots struggle to establish such intuitions and form tight links between plans and observations. We can partly attribute this to the high cost of obtaining densely labelled data for textile manipulation, both in quality and quantity. The problem of data collection is a long-standing issue in data-based approaches to garment manipulation. As of today, generating high-quality and labelled garment manipulation data is mainly attempted through advanced data capture procedures that create simplified state estimations from real-world observations. However, this work proposes a novel approach to the problem by generating real-world observations from object states. To achieve this, we present GARField (Garment Attached Radiance Field), the first differentiable rendering architecture, to our knowledge, for data generation from simulated states stored as triangle meshes. Code is available on https://ddonatien.github.io/garfield-website/2024-10-07T13:50:15ZProject site: https://ddonatien.github.io/garfield-website/Donatien DelehelleDarwin G. CaldwellFei Chen10.1109/ROBIO64047.2024.10907327http://arxiv.org/abs/2510.13381v1Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering2025-10-15T10:21:36ZDynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.2025-10-15T10:21:36ZAccepted at ICCV-2025, project page: https://dynamic-ugsdf.github.io/Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025Siddharth TouraniJayaram ReddyAkash KumbarSatyajit TouraniNishant GoyalMadhava KrishnaN. Dinesh ReddyMuhammad Haris Khan10.1109/ICCV51701.2025.02698http://arxiv.org/abs/2504.05803v3PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis2025-10-15T09:22:15ZRecent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.2025-04-08T08:35:59ZYihuan HuangJiajun LiuYanzhen RenJun XueWuyang LiuZongkun Sunhttp://arxiv.org/abs/2510.13303v1Automated document processing system for government agencies using DBNET++ and BART models2025-10-15T08:48:02ZAn automatic document classification system is presented that detects textual content in images and classifies documents into four predefined categories (Invoice, Report, Letter, and Form). The system supports both offline images (e.g., files on flash drives, HDDs, microSD) and real-time capture via connected cameras, and is designed to mitigate practical challenges such as variable illumination, arbitrary orientation, curved or partially occluded text, low resolution, and distant text. The pipeline comprises four stages: image capture and preprocessing, text detection [1] using a DBNet++ (Differentiable Binarization Network Plus) detector, and text classification [2] using a BART (Bidirectional and Auto-Regressive Transformers) classifier, all integrated within a user interface implemented in Python with PyQt5. The achieved results by the system for text detection in images were good at about 92.88% through 10 hours on Total-Text dataset that involve high resolution images simulate a various and very difficult challenges. The results indicate the proposed approach is effective for practical, mixed-source document categorization in unconstrained imaging scenarios.2025-10-15T08:48:02Z8 pages, 12 figures, articleInternational Journal of Circuit, Computing and Networking 2025; 6(2): 34-41Aya Kaysan Bahjat10.33545/27075923.2025.v6.i2a.100http://arxiv.org/abs/2507.20881v2Endoscopic Depth Estimation Based on Deep Learning: A Survey2025-10-15T08:16:08ZEndoscopic depth estimation is a critical technology for improving the safety and precision of minimally invasive surgery. It has attracted considerable attention from researchers in medical imaging, computer vision, and robotics. Over the past decade, a large number of methods have been developed. Despite the existence of several related surveys, a comprehensive overview focusing on recent deep learning-based techniques is still limited. This paper endeavors to bridge this gap by systematically reviewing the state-of-the-art literature. Specifically, we provide a thorough survey of the field from three key perspectives: data, methods, and applications. Firstly, at the data level, we describe the acquisition process of publicly available datasets. Secondly, at the methodological level, we introduce both monocular and stereo deep learning-based approaches for endoscopic depth estimation. Thirdly, at the application level, we identify the specific challenges and corresponding solutions for the clinical implementation of depth estimation technology, situated within concrete clinical scenarios. Finally, we outline potential directions for future research, such as domain adaptation, real-time implementation, and the synergistic fusion of depth information with sensor technologies, thereby providing a valuable starting point for researchers to engage with and advance the field toward clinical translation.2025-07-28T14:34:45ZKe NiuZeyun LiuXue FengHeng LiQika LinKaize Shihttp://arxiv.org/abs/2402.18074v2A Bijective Image Retargeting Algorithm Based on Conformal Energy2025-10-15T07:05:01ZImage retargeting, which resizes images to one with a prescribed aspect ratio by determining an optimal warping map, has gained substantial interest in imaging science. Despite significant advances, existing methods often fail to ensure bijective warping maps essential for preserving visual information. This paper introduces a novel bijective image retargeting model through conformal energy minimization of the deformation field. The proposed model establishes mathematical rigor by proving the well-posedness for the optimal warping map in both continuous and discrete settings and showing that the discrete solutions converge to their continuous counterpart under mesh refinement. Numerical experiments corroborate the model's efficacy and the convergence of discrete solutions during progressive mesh subdivision processes, validating both theoretical guarantees and practical performance.2024-02-28T06:06:34ZChengyang LiuMichael K. Nghttp://arxiv.org/abs/2510.13168v1MiGumi: Making Tightly Coupled Integral Joints Millable2025-10-15T05:37:44ZTraditional integral wood joints, despite their strength, durability, and elegance, remain rare in modern workflows due to the cost and difficulty of manual fabrication. CNC milling offers a scalable alternative, but directly milling traditional joints often fails to produce functional results because milling induces geometric deviations, such as rounded inner corners, that alter the target geometries of the parts. Since joints rely on tightly fitting surfaces, such deviations introduce gaps or overlaps that undermine fit or block assembly. We propose to overcome this problem by (1) designing a language that represent millable geometry, and (2) co-optimizing part geometries to restore coupling. We introduce Millable Extrusion Geometry (MXG), a language for representing geometry as the outcome of milling operations performed with flat-end drill bits. MXG represents each operation as a subtractive extrusion volume defined by a tool direction and drill radius. This parameterization enables the modeling of artifact-free geometry under an idealized zero-radius drill bit, matching traditional joint designs. Increasing the radius then reveals milling-induced deviations, which compromise the integrity of the joint. To restore coupling, we formalize tight coupling in terms of both surface proximity and proximity constraints on the mill-bit paths associated with mating surfaces. We then derive two tractable, differentiable losses that enable efficient optimization of joint geometry. We evaluate our method on 30 traditional joint designs, demonstrating that it produces CNC-compatible, tightly fitting joints that approximates the original geometry. By reinterpreting traditional joints for CNC workflows, we continue the evolution of this heritage craft and help ensure its relevance in future making practices.2025-10-15T05:37:44ZSIGGRAPH Asia/TOG 2025; project page: https://bardofcodes.github.io/migumi/Aditya GaneshanKurt FleischerWenzel JakobAriel ShamirDaniel RitchieTakeo IgarashiMaria Larssonhttp://arxiv.org/abs/2405.16807v3Extreme Compression of Adaptive Neural Images2025-10-14T20:42:42ZImplicit Neural Representations (INRs) and Neural Fields are a novel paradigm for signal representation, from images and audio to 3D scenes and videos. The fundamental idea is to represent a signal as a continuous and differentiable neural network. This new approach poses new theoretical questions and challenges. Considering a neural image as a 2D image represented as a neural network, we aim to explore novel neural image compression. In this work, we present a novel analysis on compressing neural fields, with focus on images and introduce Adaptive Neural Images (ANI), an efficient neural representation that enables adaptation to different inference or transmission requirements. Our proposed method allows us to reduce the bits-per-pixel (bpp) of the neural image by 8 times, without losing sensitive details or harming fidelity. Our work offers a new framework for developing compressed neural fields. We achieve a new state-of-the-art in terms of PSNR/bpp trade-off thanks to our successful implementation of 4-bit neural representations.2024-05-27T03:54:09ZICCV 2025 Workshop - Binary and Extreme Quantization for Computer VisionLeo HoshikawaMarcos V. CondeTakeshi OhashiAtsushi Iriehttp://arxiv.org/abs/2412.16461v2Optimizing Parameters for Static Equilibrium of Discrete Elastic Rods with Active-Set Cholesky2025-10-14T20:05:05ZWe propose a parameter optimization method for achieving static equilibrium of discrete elastic rods. Our method simultaneously optimizes material stiffness and rest shape parameters under box constraints to exactly enforce zero net force while avoiding stability issues and violations of physical laws. For efficiency, we split our constrained optimization problem into primal and dual subproblems via the augmented Lagrangian method, while handling the dual subproblem via simple vector updates. To efficiently solve the box-constrained primal subproblem, we propose a new active-set Cholesky preconditioner. Our method surpasses prior work in generality, robustness, and speed.2024-12-21T03:26:16ZTetsuya TakahashiChristopher Battyhttp://arxiv.org/abs/2510.12785v1MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars2025-10-14T17:56:14ZDigital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.2025-10-14T17:56:14Z18 pages, 12 figuresFelix TaubnerRuihang ZhangMathieu TuliSherwin BahmaniDavid B. Lindell