https://arxiv.org/api/hPgRWOlArIlXrK/40NuvD94MnEE 2026-06-27T23:03:51Z 9390 1680 15 http://arxiv.org/abs/2407.15842v4 DiffArtist: Towards Structure and Appearance Controllable Image Stylization 2025-08-27T10:30:27Z

Artistic styles are defined by both their structural and appearance elements. Existing neural stylization techniques primarily focus on transferring appearance-level features such as color and texture, often neglecting the equally crucial aspect of structural stylization. To address this gap, we introduce \textbf{DiffArtist}, the first 2D stylization method to offer fine-grained, simultaneous control over both structure and appearance style strength. This dual controllability is achieved by representing structure and appearance generation as separate diffusion processes, necessitating no further tuning or additional adapters. To properly evaluate this new capability of dual stylization, we further propose a Multimodal LLM-based stylization evaluator that aligns significantly better with human preferences than existing metrics. Extensive analysis shows that DiffArtist achieves superior style fidelity and dual-controllability compared to state-of-the-art methods. Its text-driven, training-free design and unprecedented dual controllability make it a powerful and interactive tool for various creative applications. Project homepage: https://diffusionartist.github.io.

2024-07-22T17:58:05Z Accepted to ACM MM 2025, Homepage: https://DiffusionArtist.github.io Ruixiang Jiang Changwen Chen 10.1145/3746027.3755010 http://arxiv.org/abs/2508.16439v3 PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark 2025-08-27T08:33:44Z

Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.

2025-08-22T14:50:55Z Adil Bahaj Oumaima Fadi Mohamed Chetouani Mounir Ghogho http://arxiv.org/abs/2508.19518v1 Fast Texture Transfer for XR Avatars via Barycentric UV Conversion 2025-08-27T02:14:18Z

We present a fast and efficient method for transferring facial textures onto SMPL-X-based full-body avatars. Unlike conventional affine-transform methods that are slow and prone to visual artifacts, our method utilizes a barycentric UV conversion technique. Our approach precomputes the entire UV mapping into a single transformation matrix, enabling texture transfer in a single operation. This results in a speedup of over 7000x compared to the baseline, while also significantly improving the final texture quality by eliminating boundary artifacts. Through quantitative and qualitative evaluations, we demonstrate that our method offers a practical solution for personalization in immersive XR applications. The code is available online.

2025-08-27T02:14:18Z Hail Song Seokhwan Yang Woontack Woo http://arxiv.org/abs/2508.19204v1 LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding 2025-08-26T17:04:49Z

Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control -- they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts -- producing realistic and geometrically consistent 3D generations of complex driving scenes.

2025-08-26T17:04:49Z Project webpage: https://light.princeton.edu/LSD-3D Julian Ost Andrea Ramazzina Amogh Joshi Maximilian Bömer Mario Bijelic Felix Heide http://arxiv.org/abs/2508.19140v1 A Bag of Tricks for Efficient Implicit Neural Point Clouds 2025-08-26T15:49:10Z

Implicit Neural Point Cloud (INPC) is a recent hybrid representation that combines the expressiveness of neural fields with the efficiency of point-based rendering, achieving state-of-the-art image quality in novel view synthesis. However, as with other high-quality approaches that query neural networks during rendering, the practical usability of INPC is limited by comparatively slow rendering. In this work, we present a collection of optimizations that significantly improve both the training and inference performance of INPC without sacrificing visual fidelity. The most significant modifications are an improved rasterizer implementation, more effective sampling techniques, and the incorporation of pre-training for the convolutional neural network used for hole-filling. Furthermore, we demonstrate that points can be modeled as small Gaussians during inference to further improve quality in extrapolated, e.g., close-up views of the scene. We design our implementations to be broadly applicable beyond INPC and systematically evaluate each modification in a series of experiments. Our optimized INPC pipeline achieves up to 25% faster training, 2x faster rendering, and 20% reduced VRAM usage paired with slight image quality improvements.

2025-08-26T15:49:10Z Project page: https://fhahlbohm.github.io/inpc_v2/ Florian Hahlbohm Linus Franke Leon Overkämping Paula Wespe Susana Castillo Martin Eisemann Marcus Magnor http://arxiv.org/abs/2505.04831v2 Steerable Scene Generation with Post Training and Inference-Time Search 2025-08-26T14:49:48Z

Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/

2025-05-07T22:07:42Z Project website: https://steerable-scene-generation.github.io/ Nicholas Pfaff Hongkai Dai Sergey Zakharov Shun Iwase Russ Tedrake http://arxiv.org/abs/2508.19323v1 A Technical Review on Comparison and Estimation of Steganographic Tools 2025-08-26T14:36:50Z

Steganography is technique of hiding a data under cover media using different steganography tools. Image steganography is hiding of data (Text/Image/Audio/Video) under a cover as Image. This review paper presents classification of image steganography and the comparison of various Image steganography tools using different image formats. Analyzing numerous tools on the basis of Image features and extracting the best one. Some of the tools available in the market were selected based on the frequent use; these tools were tested using the same input on all of them. Specific text was embedded within all host images for each of the six Steganography tools selected. The results of the experiment reveal that all the six tools were relatively performing at the same level, though some software performs better than others through efficiency. And it was based on the image features like size, dimensions, and pixel value and histogram differentiation.

2025-08-26T14:36:50Z 20 Ms. Preeti P. Bhatt Rakesh R. Savant http://arxiv.org/abs/2508.18944v1 PanoHair: Detailed Hair Strand Synthesis on Volumetric Heads 2025-08-26T11:36:14Z

Achieving realistic hair strand synthesis is essential for creating lifelike digital humans, but producing high-fidelity hair strand geometry remains a significant challenge. Existing methods require a complex setup for data acquisition, involving multi-view images captured in constrained studio environments. Additionally, these methods have longer hair volume estimation and strand synthesis times, which hinder efficiency. We introduce PanoHair, a model that estimates head geometry as signed distance fields using knowledge distillation from a pre-trained generative teacher model for head synthesis. Our approach enables the prediction of semantic segmentation masks and 3D orientations specifically for the hair region of the estimated geometry. Our method is generative and can generate diverse hairstyles with latent space manipulations. For real images, our approach involves an inversion process to infer latent codes and produces visually appealing hair strands, offering a streamlined alternative to complex multi-view data acquisition setups. Given the latent code, PanoHair generates a clean manifold mesh for the hair region in under 5 seconds, along with semantic and orientation maps, marking a significant improvement over existing methods, as demonstrated in our experiments.

2025-08-26T11:36:14Z Shashikant Verma Shanmuganathan Raman http://arxiv.org/abs/2411.17513v2 Human Vision Constrained Super-Resolution 2025-08-26T09:27:46Z

Modern deep-learning super-resolution (SR) techniques process images and videos independently of the underlying content and viewing conditions. However, the sensitivity of the human visual system (HVS) to image details changes depending on the underlying image characteristics, such as spatial frequency, luminance, color, contrast, or motion; as well viewing condition aspects such as ambient lighting and distance to the display. This observation suggests that computational resources spent on up-sampling images/videos may be wasted whenever a viewer cannot resolve the synthesized details i.e the resolution of details exceeds the resolving capability of human vision. Motivated by this observation, we propose a human vision inspired and architecture-agnostic approach for controlling SR techniques to deliver visually optimal results while limiting computational complexity. Its core is an explicit Human Visual Processing Framework (HVPF) that dynamically and locally guides SR methods according to human sensitivity to specific image details and viewing conditions. We demonstrate the application of our framework in combination with network branching to improve the computational efficiency of SR methods. Quantitative and qualitative evaluations, including user studies, demonstrate the effectiveness of our approach in reducing FLOPS by factors of 2$\times$ and greater, without sacrificing perceived quality.

2024-11-26T15:24:45Z Volodymyr Karpenko Taimoor Tariq Jorge Condor Piotr Didyk 10.1109/ICCVW69036.2025.00498 http://arxiv.org/abs/2508.18525v1 Controllable Single-shot Animation Blending with Temporal Conditioning 2025-08-25T21:55:16Z

Training a generative model on a single human skeletal motion sequence without being bound to a specific kinematic tree has drawn significant attention from the animation community. Unlike text-to-motion generation, single-shot models allow animators to controllably generate variations of existing motion patterns without requiring additional data or extensive retraining. However, existing single-shot methods do not explicitly offer a controllable framework for blending two or more motions within a single generative pass. In this paper, we present the first single-shot motion blending framework that enables seamless blending by temporally conditioning the generation process. Our method introduces a skeleton-aware normalization mechanism to guide the transition between motions, allowing smooth, data-driven control over when and how motions blend. We perform extensive quantitative and qualitative evaluations across various animation styles and different kinematic skeletons, demonstrating that our approach produces plausible, smooth, and controllable motion blends in a unified and efficient manner.

2025-08-25T21:55:16Z Accepted to the AI for Visual Arts Workshop at ICCV 2025 Eleni Tselepi Spyridon Thermos Gerasimos Potamianos http://arxiv.org/abs/2508.18481v1 Impact of Target and Tool Visualization on Depth Perception and Usability in Optical See-Through AR 2025-08-25T20:45:00Z

Optical see-through augmented reality (OST-AR) systems like Microsoft HoloLens 2 hold promise for arm's distance guidance (e.g., surgery), but depth perception of the hologram and occlusion of real instruments remain challenging. We present an evaluation of how visualizing the target object with different transparencies and visualizing a tracked tool (virtual proxy vs. real tool vs. no tool tracking) affects depth perception and system usability. Ten participants performed two experiments on HoloLens 2. In Experiment 1, we compared high-transparency vs. low-transparency target rendering in a depth matching task at arm's length. In Experiment 2, participants performed a simulated surgical pinpoint task on a frontal bone target under six visualization conditions ($2 \times 3$: two target transparencies and three tool visualization modes: virtual tool hologram, real tool, or no tool tracking). We collected data on depth matching error, target localization error, system usability, task workload, and qualitative feedback. Results show that a more opaque target yields significantly lower depth estimation error than a highly transparent target at arm's distance. Moreover, showing the real tool (occluding the virtual target) led to the highest accuracy and usability with the lowest workload, while not tracking the tool yielded the worst performance and user ratings. However, making the target highly transparent, while allowing the real tool to remain visible, slightly impaired depth cues and did not improve usability. Our findings underscore that correct occlusion cues, rendering virtual content opaque and occluding it with real tools in real time, are critical for depth perception and precision in OST-AR. Designers of arm-distance AR systems should prioritize robust tool tracking and occlusion handling; if unavailable, cautiously use transparency to balance depth perception and tool visibility.

2025-08-25T20:45:00Z Yue Yang Xue Xie Xinkai Wang Hui Zhang Chiming Yu Xiaoxian Xiong Lifeng Zhu Yuanyi Zheng Jue Cen Bruce Daniel Fred Baik http://arxiv.org/abs/2508.17620v1 Enhancing Reference-based Sketch Colorization via Separating Reference Representations 2025-08-25T02:56:49Z

Reference-based sketch colorization methods have garnered significant attention for the potential application in animation and digital illustration production. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially similar, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in artifacts and signif- icant quality degradation in colorization results. To address this issue, we conduct an in-depth analysis of the reference representations, defined as the intermedium to transfer information from reference to sketch. Building on this analysis, we introduce a novel framework that leverages distinct reference representations to optimize different aspects of the colorization process. Our approach decomposes colorization into modular stages, al- lowing region-specific reference injection to enhance visual quality and reference similarity while mitigating spatial artifacts. Specifically, we first train a backbone network guided by high-level semantic embeddings. We then introduce a background encoder and a style encoder, trained in separate stages, to enhance low-level feature transfer and improve reference similar- ity. This design also enables flexible inference modes suited for a variety of use cases. Extensive qualitative and quantitative evaluations, together with a user study, demonstrate the superior performance of our proposed method compared to existing approaches. Code and pre-trained weight will be made publicly available upon paper acceptance.

2025-08-25T02:56:49Z Dingkun Yan Xinrui Wang Zhuoru Li Suguru Saito Yusuke Iwasawa Yutaka Matsuo Jiaxian Guo http://arxiv.org/abs/2506.16627v2 FlatCAD: Fast Curvature Regularization of Neural SDFs for CAD Models 2025-08-24T19:14:03Z

Neural signed-distance fields (SDFs) are a versatile backbone for neural geometry representation, but enforcing CAD-style developability usually requires Gaussian-curvature penalties with full Hessian evaluation and second-order differentiation, which are costly in memory and time. We introduce an off-diagonal Weingarten loss that regularizes only the mixed shape operator term that represents the gap between principal curvatures and flattens the surface. We present two variants: a finite-difference version using six SDF evaluations plus one gradient, and an auto-diff version using a single Hessian-vector product. Both converge to the exact mixed term and preserve the intended geometric properties without assembling the full Hessian. On the ABC benchmarks the losses match or exceed Hessian-based baselines while cutting GPU memory and training time by roughly a factor of two. The method is drop-in and framework-agnostic, enabling scalable curvature-aware SDF learning for engineering-grade shape reconstruction. Our code is available at https://flatcad.github.io/.

2025-06-19T21:54:08Z Computer Graphics Forum, Proceedings of Pacific Graphics 2025, 12 pages, 10 figures, preprint Computer Graphics Forum, Volume 44 (2025), Number 7 Haotian Yin Aleksander Plocharski Michal Jan Wlodarczyk Mikolaj Kida Przemyslaw Musialski 10.1111/cgf.70249 http://arxiv.org/abs/2508.17386v1 Wave Tracing: Generalizing The Path Integral To Wave Optics 2025-08-24T14:46:44Z

Modeling the wave nature of light and the propagation and diffraction of electromagnetic fields is crucial for the accurate simulation of many phenomena, yet wave simulations are significantly more computationally complex than classical ray-based models. In this work, we start by analyzing the classical path integral formulation of light transport and rigorously study which wave-optical phenomena can be reproduced by it. We then introduce a bilinear path integral generalization for wave-optical light transport that models the wave interference between paths. This formulation subsumes many existing methods that rely on shooting-bouncing rays or UTD-based diffractions, and serves to give insight into the challenges of such approaches and the difficulty of sampling good paths in a bilinear setting. With this foundation, we develop a weakly-local path integral based on region-to-region transport using elliptical cones that allows sampling individual paths that still model wave effects accurately. As with the classic path integral form of the light transport equation, our path integral makes it possible to derive a variety of practical transport algorithms. We present a complete system for wave tracing with elliptical cones, with applications in light transport for rendering and efficient simulation of long-wavelength radiation propagation and diffraction in complex environments.

2025-08-24T14:46:44Z Shlomi Steinberg Matt Pharr http://arxiv.org/abs/2508.17342v1 DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions 2025-08-24T12:53:09Z

Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods support direct dance synthesis, they fail to recognize that enabling users to edit dance movements is far more practical in real-world choreography scenarios. Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge. To achieve this goal, we first construct DanceRemix, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 25.3M dance frames and 84.5K pairs. In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely DanceEditor. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored, aligned music. Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specifically designed Cross-modality Editing Module (CEM). Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences. Thereby, the results display music harmonics while preserving fine-grained semantic alignment with text descriptions. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset. Code is available at https://lzvsdy.github.io/DanceEditor/.

2025-08-24T12:53:09Z ICCV 2025 Hengyuan Zhang Zhe Li Xingqun Qi Mengze Li Muyi Sun Man Zhang Sirui Han