https://arxiv.org/api/yuu6KVcPDxawLbIX0SfCtXXEfdU 2026-06-26T07:36:27Z 9390 1485 15 http://arxiv.org/abs/2510.06517v1 Visualizing Multimodality in Combinatorial Search Landscapes 2025-10-07T23:29:19Z This work walks through different visualization techniques for combinatorial search landscapes, focusing on multimodality. We discuss different techniques from the landscape analysis literature, and how they can be combined to provide a more comprehensive view of the search landscape. We also include examples and discuss relevant work to show how others have used these techniques in practice, based on the geometric and aesthetic elements of the Grammar of Graphics. We conclude that there is no free lunch in visualization, and provide recommendations for future work as there are several paths to continue the work in this field. 2025-10-07T23:29:19Z 18 pages, 9 figures, Poster presented at the 2025 Symposium of the Norwegian Artificial Intelligence Society (NAIS 2025) on June 18, 2025 Xavier F. C. Sánchez-Díaz Ole Jakob Mengshoel http://arxiv.org/abs/2509.13306v3 Temporally Smooth Mesh Extraction for Procedural Scenes with Long-Range Camera Trajectories using Spacetime Octrees 2025-10-07T20:46:16Z The procedural occupancy function is a flexible and compact representation for creating 3D scenes. For rasterization and other tasks, it is often necessary to extract a mesh that represents the shape. Unbounded scenes with long-range camera trajectories, such as flying through a forest, pose a unique challenge for mesh extraction. A single static mesh representing all the geometric detail necessary for the full camera path can be prohibitively large. Therefore, independent meshes can be extracted for different camera views, but this approach may lead to popping artifacts during transitions. We propose a temporally coherent method for extracting meshes suitable for long-range camera trajectories in unbounded scenes represented by an occupancy function. The key idea is to perform 4D mesh extraction using a new spacetime tree structure called a binary-octree. Experiments show that, compared to existing baseline methods, our method offers superior visual consistency at a comparable cost. The code and the supplementary video for this paper are available at https://github.com/princeton-vl/BinocMesher. 2025-09-16T17:57:04Z Accepted as a Conference Paper to Siggraph Asia 2025. Updated related work and references to include Jang et al. (2022) Zeyu Ma Adam Finkelstein Jia Deng http://arxiv.org/abs/2510.07340v1 SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation 2025-10-07T18:01:55Z Personalized image generation aims to faithfully preserve a reference subject's identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples. 2025-10-07T18:01:55Z Yongzhi Li Saining Zhang Yibing Chen Boying Li Yanxin Zhang Xiaoyu Du http://arxiv.org/abs/2503.14573v3 Submillimeter-Accurate 3D Lumbar Spine Reconstruction from Biplanar X-Ray Images: Incorporating a Multi-Task Network and Landmark-Weighted Loss 2025-10-07T08:53:36Z To meet the clinical demand for accurate 3D lumbar spine assessment in a weight-bearing position, this study presents a novel, fully automatic framework for high-precision 3D reconstruction from biplanar X-ray images, overcoming the limitations of existing methods. The core of this method involves a novel multi-task deep learning network that simultaneously performs lumbar decomposition and landmark detection on the original biplanar radiographs. The decomposition effectively eliminates interference from surrounding tissues, simplifying subsequent image registration, while the landmark detection provides an initial pose estimation for the Statistical Shape Model (SSM), enhancing the efficiency and robustness of the registration process. Building on this, we introduce a landmark-weighted 2D-3D registration strategy. By assigning higher weights to complex posterior structures like the transverse and spinous processes during optimization, this strategy significantly enhances the reconstruction accuracy of the posterior arch. Our method was validated against a gold standard derived from registering CT segmentations to the biplanar X-rays. It sets a new benchmark by achieving sub-millimeter accuracy and completes the full reconstruction and measurement workflow in under 20 seconds, establishing a state-of-the-art combination of precision and speed. This fast and low-dose pipeline provides a powerful automated tool for diagnosing lumbar conditions such as spondylolisthesis and scoliosis in their functional, weight-bearing state. 2025-03-18T15:00:39Z 27 pages, 16 figures, 9 tables Wanxin Yu Zhemin Zhu Cong Wang Yihang Bao Chunjie Xia Rongshan Cheng Yan Yu Tsung-Yuan Tsai http://arxiv.org/abs/2510.05532v1 Teamwork: Collaborative Diffusion with Low-rank Coordination and Adaptation 2025-10-07T02:44:57Z Large pretrained diffusion models can provide strong priors beneficial for many graphics applications. However, generative applications such as neural rendering and inverse methods such as SVBRDF estimation and intrinsic image decomposition require additional input or output channels. Current solutions for channel expansion are often application specific and these solutions can be difficult to adapt to different diffusion models or new tasks. This paper introduces Teamwork: a flexible and efficient unified solution for jointly increasing the number of input and output channels as well as adapting a pretrained diffusion model to new tasks. Teamwork achieves channel expansion without altering the pretrained diffusion model architecture by coordinating and adapting multiple instances of the base diffusion model (\ie, teammates). We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address both adaptation and coordination between the different teammates. Furthermore Teamwork supports dynamic (de)activation of teammates. We demonstrate the flexibility and efficiency of Teamwork on a variety of generative and inverse graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic decomposition, neural shading, and intrinsic image synthesis. 2025-10-07T02:44:57Z Sam Sartor Pieter Peers 10.1145/3757377.3763870 http://arxiv.org/abs/2510.05081v1 SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder 2025-10-06T17:51:04Z Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains. 2025-10-06T17:51:04Z Project page at: https://ronen94.github.io/SAEdit/ Ronen Kamenetsky Sara Dorfman Daniel Garibi Roni Paiss Or Patashnik Daniel Cohen-Or http://arxiv.org/abs/2510.04999v1 Bridging Text and Video Generation: A Survey 2025-10-06T16:39:05Z Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating coherent visual content from natural language prompts. From its inception, the field has advanced from adversarial models to diffusion-based models, yielding higher-fidelity, temporally consistent outputs. Yet challenges persist, such as alignment, long-range coherence, and computational efficiency. Addressing this evolving landscape, we present a comprehensive survey of text-to-video generative models, tracing their development from early GANs and VAEs to hybrid Diffusion-Transformer (DiT) architectures, detailing how these models work, what limitations they addressed in their predecessors, and why shifts toward new architectural paradigms were necessary to overcome challenges in quality, coherence, and control. We provide a systematic account of the datasets, which the surveyed text-to-video models were trained and evaluated on, and, to support reproducibility and assess the accessibility of training such models, we detail their training configurations, including their hardware specifications, GPU counts, batch sizes, learning rates, optimizers, epochs, and other key hyperparameters. Further, we outline the evaluation metrics commonly used for evaluating such models and present their performance across standard benchmarks, while also discussing the limitations of these metrics and the emerging shift toward more holistic, perception-aligned evaluation strategies. Finally, drawing from our analysis, we outline the current open challenges and propose a few promising future directions, laying out a perspective for future researchers to explore and build upon in advancing T2V research and applications. 2025-10-06T16:39:05Z Nilay Kumar Priyansh Bhandari G. Maragatham http://arxiv.org/abs/2510.04637v1 Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents 2025-10-06T09:41:37Z We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals. The output of the agentic system is translated into high-level guidance for the gesture generator, resulting in realistic movement at both the behavioral and motion levels. Furthermore, the agentic system periodically examines the movements of interlocutors and infers their intentions, forming a continuous feedback loop that enables dynamic and responsive interactions between the two participants. User studies and quantitative evaluations show that our model significantly improves the quality of dyadic interactions, producing natural, synchronized nonverbal behaviors. 2025-10-06T09:41:37Z SIGGRAPH ASIA 2025 (Conference Track); Project page: https://pku-mocca.github.io/Social-Agent-Page/ Zeyi Zhang Yanju Zhou Heyuan Yao Tenglong Ao Xiaohang Zhan Libin Liu 10.1145/3757377.3763879 http://arxiv.org/abs/2510.04536v1 3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG 2025-10-06T07:00:06Z This paper proposes "3Dify," a procedural 3D computer graphics (3D-CG) generation framework utilizing Large Language Models (LLMs). The framework enables users to generate 3D-CG content solely through natural language instructions. 3Dify is built upon Dify, an open-source platform for AI application development, and incorporates several state-of-the-art LLM-related technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG). For 3D-CG generation support, 3Dify automates the operation of various Digital Content Creation (DCC) tools via MCP. When DCC tools do not support MCP-based interaction, the framework employs the Computer-Using Agent (CUA) method to automate Graphical User Interface (GUI) operations. Moreover, to enhance image generation quality, 3Dify allows users to provide feedback by selecting preferred images from multiple candidates. The LLM then learns variable patterns from these selections and applies them to subsequent generations. Furthermore, 3Dify supports the integration of locally deployed LLMs, enabling users to utilize custom-developed models and to reduce both time and monetary costs associated with external API calls by leveraging their own computational resources. 2025-10-06T07:00:06Z Shun-ichiro Hayashi Daichi Mukunoki Tetsuya Hoshino Satoshi Ohshima Takahiro Katagiri http://arxiv.org/abs/2412.06702v2 CHOICE: Coordinated Human-Object Interaction in Cluttered Environments for Pick-and-Place Actions 2025-10-05T20:18:41Z Animating human-scene interactions such as pick-and-place tasks in cluttered, complex layouts is a challenging task, with objects of a wide variation of geometries and articulation under scenarios with various obstacles. The main difficulty lies in the sparsity of the motion data compared to the wide variation of the objects and environments as well as the poor availability of transition motions between different tasks, increasing the complexity of the generalization to arbitrary conditions. To cope with this issue, we develop a system that tackles the interaction synthesis problem as a hierarchical goal-driven task. Firstly, we develop a bimanual scheduler that plans a set of keyframes for simultaneously controlling the two hands to efficiently achieve the pick-and-place task from an abstract goal signal such as the target object selected by the user. Next, we develop a neural implicit planner that generates guidance hand trajectories under diverse object shape/types and obstacle layouts. Finally, we propose a linear dynamic model for our DeepPhase controller that incorporates a Kalman filter to enable smooth transitions in the frequency domain, resulting in a more realistic and effective multi-objective control of the character.Our system can produce a wide range of natural pick-and-place movements with respect to the geometry of objects, the articulation of containers and the layout of the objects in the scene. 2024-12-09T17:49:00Z ACM Transaction on Graphics 2025;21 pages, 15 figures; Webpage: https://lujintaozju.github.io/publications/CHOICE/ Jintao Lu He Zhang Yuting Ye Takaaki Shiratori Sebastian Starke Taku Komura 10.1145/3770746 http://arxiv.org/abs/2506.18671v4 TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography 2025-10-05T08:08:58Z Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to encode temporal and identity information. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model's ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation. 2025-06-23T14:15:20Z Yuqin Dai Wanlu Zhu Ronghui Li Xiu Li Zhenyu Zhang Jun Li Jian Yang http://arxiv.org/abs/2502.01045v2 WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction 2025-10-05T04:39:13Z In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at https://wyiguanw.github.io/WonderHuman/. 2025-02-03T04:43:41Z IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 12, 2025 Zilong Wang Zhiyang Dou Yuan Liu Cheng Lin Xiao Dong Yunhui Guo Chenxu Zhang Xin Li Wenping Wang Xiaohu Guo 10.1109/TVCG.2025.3618268 http://arxiv.org/abs/2510.03964v1 Enhancing Foveated Rendering with Weighted Reservoir Sampling 2025-10-04T22:30:06Z Spatiotemporal sensitivity to high frequency information declines with increased peripheral eccentricity. Foveated rendering exploits this by decreasing the spatial resolution of rendered images in peripheral vision, reducing the rendering cost by omitting high frequency details. As foveation levels increase, the rendering quality is reduced, and traditional foveated rendering systems tend not to preserve samples that were previously rendered at high spatial resolution in previous frames. Additionally, prior research has shown that saccade landing positions are distributed around a target location rather than landing at a single point, and that even during fixations, eyes perform small microsaccades around a fixation point. This creates an opportunity for sampling from temporally neighbouring frames with differing foveal locations to reduce the required rendered size of the foveal region while achieving a higher perceived image quality. We further observe that the temporal presentation of pixels frame-to-frame can be viewed as a data stream, presenting a random sampling problem. Following this intuition, we propose a Weighted Reservoir Sampling technique to efficiently maintain a reservoir of the perceptually relevant high quality pixel samples from previous frames and incorporate them into the computation of the current frame. This allows the renderer to render a smaller region of foveal pixels per frame by temporally reusing pixel samples that are still relevant to reconstruct a higher perceived image quality, while allowing for higher levels of foveation. Our method operates on the output of foveated rendering, and runs in under 1\,ms at 4K resolution, making it highly efficient and integrable with real-time VR and AR foveated rendering systems. 2025-10-04T22:30:06Z To appear in The 18th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG '25), December 03-05, 2025, Zurich, Switzerland Ville Cantory Darya Biparva Haoyu Tan Tongyu Nie John Schroeder Ruofei Du Victoria Interrante Piotr Didyk 10.1145/3769047.3769058 http://arxiv.org/abs/2510.03837v1 Joint Neural SDF Reconstruction and Semantic Segmentation for CAD Models 2025-10-04T15:29:36Z We propose a simple, data-efficient pipeline that augments an implicit reconstruction network based on neural SDF-based CAD parts with a part-segmentation head trained under PartField-generated supervision. Unlike methods tied to fixed taxonomies, our model accepts meshes with any number of parts and produces coherent, geometry-aligned labels in a single pass. We evaluate on randomly sampled CAD meshes from the ABC dataset with intentionally varied part cardinalities, including over-segmented shapes, and report strong performance across reconstruction (CDL1/CDL2, F1-micro, NC) and segmentation (mIoU, Accuracy), together with a new Segmentation Consistency metric that captures local label smoothness. We attach a lightweight segmentation head to the Flat-CAD SDF trunk; on a paired evaluation it does not alter reconstruction while providing accurate part labels for meshes with any number of parts. Even under degraded reconstructions on thin or intricate geometries, segmentation remains accurate and label-coherent, often preserving the correct part count. Our approach therefore offers a practical route to semantically structured CAD meshes without requiring curated taxonomies or exact palette matches. We discuss limitations in boundary precision, partly due to per-face supervision, and outline paths toward boundary-aware training and higher resolution labels. 2025-10-04T15:29:36Z Shen Fan Przemyslaw Musialski http://arxiv.org/abs/2505.04961v2 Physics-Based Motion Imitation with Adversarial Differential Discriminators 2025-10-04T08:33:02Z Multi-objective optimization problems, which require the simultaneous optimization of multiple objectives, are prevalent across numerous applications. Existing multi-objective optimization methods often rely on manually-tuned aggregation functions to formulate a joint optimization objective. The performance of such hand-tuned methods is heavily dependent on careful weight selection, a time-consuming and laborious process. These limitations also arise in the setting of reinforcement-learning-based motion tracking methods for physically simulated characters, where intricately crafted reward functions are typically used to achieve high-fidelity results. Such solutions not only require domain expertise and significant manual tuning, but also limit the applicability of the resulting reward function across diverse skills. To bridge this gap, we present a novel adversarial multi-objective optimization technique that is broadly applicable to a range of multi-objective reinforcement-learning tasks, including motion tracking. Our proposed Adversarial Differential Discriminator (ADD) receives a single positive sample, yet is still effective at guiding the optimization process. We demonstrate that our technique can enable characters to closely replicate a variety of acrobatic and agile behaviors, achieving comparable quality to state-of-the-art motion-tracking methods, without relying on manually-designed reward functions. Code and results are available at https://add-moo.github.io/. 2025-05-08T05:42:33Z SIGGRAPH Asia 2025 Conference Papers Ziyu Zhang Sergey Bashkirov Dun Yang Yi Shi Michael Taylor Xue Bin Peng 10.1145/3757377.3763819