https://arxiv.org/api/JomrObnhBNE34vm94P4mSG3eJ302026-06-27T23:55:22Z9390169515http://arxiv.org/abs/2506.16827v2Beyond Blur: A Fluid Perspective on Generative Diffusion Models2025-08-23T14:37:03ZWe propose a novel PDE-driven corruption process for generative image synthesis based on advection-diffusion processes which generalizes existing PDE-based approaches. Our forward pass formulates image corruption via a physically motivated PDE that couples directional advection with isotropic diffusion and Gaussian noise, controlled by dimensionless numbers (Peclet, Fourier). We implement this PDE numerically through a GPU-accelerated custom Lattice Boltzmann solver for fast evaluation. To induce realistic turbulence, we generate stochastic velocity fields that introduce coherent motion and capture multi-scale mixing. In the generative process, a neural network learns to reverse the advection-diffusion operator thus constituting a novel generative model. We discuss how previous methods emerge as specific cases of our operator, demonstrating that our framework generalizes prior PDE-based corruption techniques. We illustrate how advection improves the diversity and quality of the generated images while keeping the overall color palette unaffected. This work bridges fluid dynamics, dimensionless PDE theory, and deep generative modeling, offering a fresh perspective on physically informed image corruption processes for diffusion-based synthesis.2025-06-20T08:31:30ZICCV 2025 main conference, 8 pages paper, 20 pages appendix, 24 figures, supplementary pseudocode in appendix, https://iccv.thecvf.com/virtual/2025/poster/1176Grzegorz GruszczynskiJakub MeixnerMichal Jan WlodarczykPrzemyslaw Musialski10.1109/ICCV51701.2025.01655http://arxiv.org/abs/2509.00040v1Curve-based slicer for multi-axis DLP 3D printing2025-08-23T13:06:29ZThis paper introduces a novel curve-based slicing method for generating planar layers with dynamically varying orientations in digital light processing (DLP) 3D printing. Our approach effectively addresses key challenges in DLP printing, such as regions with large overhangs and staircase artifacts, while preserving its intrinsic advantages of high resolution and fast printing speeds. We formulate the slicing problem as an optimization task, in which parametric curves are computed to define both the slicing layers and the model partitioning through their tangent planes. These curves inherently define motion trajectories for the build platform and can be optimized to meet critical manufacturing objectives, including collision-free motion and floating-free deposition. We validate our method through physical experiments on a robotic multi-axis DLP printing setup, demonstrating that the optimized curves can robustly guide smooth, high-quality fabrication of complex geometries.2025-08-23T13:06:29ZChengkai DaiTao LiuDezhao GuoBinzhi SunGuoxin FangYeung YamCharlie C. L. Wanghttp://arxiv.org/abs/2508.17011v1A Survey of Deep Learning-based Point Cloud Denoising2025-08-23T12:53:24ZAccurate 3D geometry acquisition is essential for a wide range of applications, such as computer graphics, autonomous driving, robotics, and augmented reality. However, raw point clouds acquired in real-world environments are often corrupted with noise due to various factors such as sensor, lighting, material, environment etc, which reduces geometric fidelity and degrades downstream performance. Point cloud denoising is a fundamental problem, aiming to recover clean point sets while preserving underlying structures. Classical optimization-based methods, guided by hand-crafted filters or geometric priors, have been extensively studied but struggle to handle diverse and complex noise patterns. Recent deep learning approaches leverage neural network architectures to learn distinctive representations and demonstrate strong outcomes, particularly on complex and large-scale point clouds. Provided these significant advances, this survey provides a comprehensive and up-to-date review of deep learning-based point cloud denoising methods up to August 2025. We organize the literature from two perspectives: (1) supervision level (supervised vs. unsupervised), and (2) modeling perspective, proposing a functional taxonomy that unifies diverse approaches by their denoising principles. We further analyze architectural trends both structurally and chronologically, establish a unified benchmark with consistent training settings, and evaluate methods in terms of denoising quality, surface fidelity, point distribution, and computational efficiency. Finally, we discuss open challenges and outline directions for future research in this rapidly evolving field.2025-08-23T12:53:24ZJinxi WangBen FeiDasith de Silva EdirimuniZheng LiuYing HeXuequan Luhttp://arxiv.org/abs/2508.16911v1MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation2025-08-23T05:56:37ZWe introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance generation. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner. We include baseline evaluations on both tasks to support future research.2025-08-23T05:56:37ZAccepted at ICCV 2025. Project page: https://gprerit96.github.io/mdd-pagePrerit GuptaPurdue University, West Lafayette, IN, USAJason Alexander Fotso-PuepiPurdue University, West Lafayette, IN, USAZhengyuan LiPurdue University, West Lafayette, IN, USAJay MehtaPurdue University, West Lafayette, IN, USAAniket BeraPurdue University, West Lafayette, IN, USAhttp://arxiv.org/abs/2508.11177v2LayoutRectifier: An Optimization-based Post-processing for Graphic Design Layout Generation2025-08-23T04:28:44ZRecent deep learning methods can generate diverse graphic design layouts efficiently. However, these methods often create layouts with flaws, such as misalignment, unwanted overlaps, and unsatisfied containment. To tackle this issue, we propose an optimization-based method called LayoutRectifier, which gracefully rectifies auto-generated graphic design layouts to reduce these flaws while minimizing deviation from the generated layout. The core of our method is a two-stage optimization. First, we utilize grid systems, which professional designers commonly use to organize elements, to mitigate misalignments through discrete search. Second, we introduce a novel box containment function designed to adjust the positions and sizes of the layout elements, preventing unwanted overlapping and promoting desired containment. We evaluate our method on content-agnostic and content-aware layout generation tasks and achieve better-quality layouts that are more suitable for downstream graphic design tasks. Our method complements learning-based layout generation methods and does not require additional training.2025-08-15T03:06:56Z11 pages, Pacific Graphics 2025, https://jdily.github.io/layoutrectifier.github.io/I-Chao ShenAriel ShamirTakeo Igarashihttp://arxiv.org/abs/2508.16856v1A Workflow for Map Creation in Autonomous Vehicle Simulations2025-08-23T00:58:09ZThe fast development of technology and artificial intelligence has significantly advanced Autonomous Vehicle (AV) research, emphasizing the need for extensive simulation testing. Accurate and adaptable maps are critical in AV development, serving as the foundation for localization, path planning, and scenario testing. However, creating simulation-ready maps is often difficult and resource-intensive, especially with simulators like CARLA (CAR Learning to Act). Many existing workflows require significant computational resources or rely on specific simulators, limiting flexibility for developers. This paper presents a custom workflow to streamline map creation for AV development, demonstrated through the generation of a 3D map of a parking lot at Ontario Tech University. Future work will focus on incorporating SLAM technologies, optimizing the workflow for broader simulator compatibility, and exploring more flexible handling of latitude and longitude values to enhance map generation accuracy.2025-08-23T00:58:09Z6 pages, 12 figures. Published in the Proceedings of GEOProcessing 2025: The Seventeenth International Conference on Advanced Geographic Information Systems, Applications, and Services (IARIA)GEOProcessing 2025 (2025) 56-61Zubair IslamAhmaad AnsariGeorge DaoudMohamed El-Dariebyhttp://arxiv.org/abs/2508.16535v1Real-time 3D Light-field Viewing with Eye-tracking on Conventional Displays2025-08-22T16:56:47ZCreating immersive 3D visual experiences typically requires expensive and specialized hardware such as VR headsets, autostereoscopic displays, or active shutter glasses. These constraints limit the accessibility and everyday use of 3D visualization technologies in resource-constrained settings. To address this, we propose a low-cost system that enables real-time 3D light-field viewing using only a standard 2D monitor, a conventional RGB webcam, and red-cyan anaglyph glasses. The system integrates real-time eye-tracking to dynamically adapt the displayed light-field image to the user's head position with a lightweight rendering pipeline that selects and composites stereoscopic views from pre-captured light-field data. The resulting anaglyph image is updated in real-time, creating a more immersive and responsive 3D experience. The system operates entirely on CPU and maintains a stable frame rate of 30 FPS, confirming its feasibility on typical consumer-grade hardware. All of these highlight the potential of our approach as an accessible platform for interactive 3D applications in education, digital media, and beyond.2025-08-22T16:56:47ZTrung Hieu PhamChanh Minh TranEiji KamiokaXuan Tan Phanhttp://arxiv.org/abs/2508.14879v2MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds2025-08-22T16:12:04ZReconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding. The project homepage is available at \href{https://daibingquan.github.io/MeshCoder}{this link}.2025-08-20T17:50:15ZBingquan DaiLi Ray LuoQihong TangJie WangXinyu LianHao XuMinghan QinXudong XuBo DaiHaoqian WangZhaoyang LyuJiangmiao Panghttp://arxiv.org/abs/2508.16401v1Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars2025-08-22T14:02:24ZAudio-driven facial animation presents an effective solution for animating digital avatars. In this paper, we detail the technical aspects of NVIDIA Audio2Face-3D, including data acquisition, network architecture, retargeting methodology, evaluation metrics, and use cases. Audio2Face-3D system enables real-time interaction between human users and interactive avatars, facilitating facial animation authoring for game characters. To assist digital avatar creators and game developers in generating realistic facial animations, we have open-sourced Audio2Face-3D networks, SDK, training framework, and example dataset.2025-08-22T14:02:24Z NVIDIA :Chaeyeon ChungIlya FedorovMichael HuangAleksey KarmanovDmitry KorobchenkoRoger RiberaYeongho Seolhttp://arxiv.org/abs/2502.06860v3AutoSketch: VLM-assisted Style-Aware Vector Sketch Completion2025-08-22T06:58:44ZThe ability to automatically complete a partial sketch that depicts a complex scene, e.g., "a woman chatting with a man in the park", is very useful. However, existing sketch generation methods create sketches from scratch; they do not complete a partial sketch in the style of the original. To address this challenge, we introduce AutoSketch, a styleaware vector sketch completion method that accommodates diverse sketch styles. Our key observation is that the style descriptions of a sketch in natural language preserve the style during automatic sketch completion. Thus, we use a pretrained vision-language model (VLM) to describe the styles of the partial sketches in natural language and replicate these styles using newly generated strokes. We initially optimize the strokes to match an input prompt augmented by style descriptions extracted from the VLM. Such descriptions allow the method to establish a diffusion prior in close alignment with that of the partial sketch. Next, we utilize the VLM to generate an executable style adjustment code that adjusts the strokes to conform to the desired style. We compare our method with existing methods across various sketch styles and prompts, performed extensive ablation studies and qualitative and quantitative evaluations, and demonstrate that AutoSketch can support various sketch scenarios.2025-02-07T23:57:22Z11 pages, Hsiao-Yuan Chin and I-Chao Shen contributed equally to the paperHsiao-Yuan ChinI-Chao ShenYi-Ting ChiuAriel ShamirBing-Yu Chenhttp://arxiv.org/abs/2508.16696v1DecoMind: A Generative AI System for Personalized Interior Design Layouts2025-08-22T00:01:48ZThis paper introduces a system for generating interior design layouts based on user inputs, such as room type, style, and furniture preferences. CLIP extracts relevant furniture from a dataset, and a layout that contains furniture and a prompt are fed to Stable Diffusion with ControlNet to generate a design that incorporates the selected furniture. The design is then evaluated by classifiers to ensure alignment with the user's inputs, offering an automated solution for realistic interior design.2025-08-22T00:01:48Z~7 pages; ~32 figures; compiled with pdfLaTeX. Primary category: cs.CV. (Secondary: cs.AI)Reema AlshehriRawan AlotaibiLeen AlmasriRawan Altaweelhttp://arxiv.org/abs/2508.15773v1Scaling Group Inference for Diverse and High-Quality Generation2025-08-21T17:59:57ZGenerative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.2025-08-21T17:59:57ZProject website: https://www.cs.cmu.edu/~group-inference, GitHub: https://github.com/GaParmar/group-inferenceGaurav ParmarOr PatashnikDaniil OstashevKuan-Chieh WangKfir AbermanSrinivasa NarasimhanJun-Yan Zhuhttp://arxiv.org/abs/2508.15755v1Neural Robot Dynamics2025-08-21T17:54:41ZAccurate and efficient simulation of modern robots remains challenging due to their high degrees of freedom and intricate mechanisms. Neural simulators have emerged as a promising alternative to traditional analytical simulators, capable of efficiently predicting complex dynamics and adapting to real-world data; however, existing neural simulators typically require application-specific training and fail to generalize to novel tasks and/or environments, primarily due to inadequate representations of the global state. In this work, we address the problem of learning generalizable neural simulators for robots that are structured as articulated rigid bodies. We propose NeRD (Neural Robot Dynamics), learned robot-specific dynamics models for predicting future states for articulated rigid bodies under contact constraints. NeRD uniquely replaces the low-level dynamics and contact solvers in an analytical simulator and employs a robot-centric and spatially-invariant simulation state representation. We integrate the learned NeRD models as an interchangeable backend solver within a state-of-the-art robotics simulator. We conduct extensive experiments to show that the NeRD simulators are stable and accurate over a thousand simulation steps; generalize across tasks and environment configurations; enable policy learning exclusively in a neural engine; and, unlike most classical simulators, can be fine-tuned from real-world data to bridge the gap between simulation and reality.2025-08-21T17:54:41ZJie XuEric HeidenIretiayo AkinolaDieter FoxMiles MacklinYashraj Naranghttp://arxiv.org/abs/2508.06055v2LV-Net: Anatomy-aware lateral ventricle shape modeling with a case study on Alzheimer's disease2025-08-21T02:16:33ZLateral ventricle (LV) shape analysis holds promise as a biomarker for neurological diseases; however, challenges remain due to substantial shape variability across individuals and segmentation difficulties arising from limited MRI resolution. We introduce LV-Net, a novel framework for producing individualized 3D LV meshes from brain MRI by deforming an anatomy-aware joint LV-hippocampus template mesh. By incorporating anatomical relationships embedded within the joint template, LV-Net reduces boundary segmentation artifacts and improves reconstruction robustness. In addition, by classifying the vertices of the template mesh based on their anatomical adjacency, our method enhances point correspondence across subjects, leading to more accurate LV shape statistics. We demonstrate that LV-Net achieves superior reconstruction accuracy, even in the presence of segmentation imperfections, and delivers more reliable shape descriptors across diverse datasets. Finally, we apply LV-Net to Alzheimer's disease analysis, identifying LV subregions that show significantly associations with the disease relative to cognitively normal controls. The codes for LV shape modeling are available at https://github.com/PWonjung/LV_Shape_Modeling.2025-08-08T06:25:18ZWonjung Parkfor the Alzheimer's Disease Neuroimaging Initiative, the Australian Imaging Biomarkers and Lifestyle flagship study of ageingSuhyun Ahnfor the Alzheimer's Disease Neuroimaging Initiative, the Australian Imaging Biomarkers and Lifestyle flagship study of ageingJinah Parkfor the Alzheimer's Disease Neuroimaging Initiative, the Australian Imaging Biomarkers and Lifestyle flagship study of ageinghttp://arxiv.org/abs/2508.15047v1Emergent Crowds Dynamics from Language-Driven Multi-Agent Interactions2025-08-20T20:15:14ZAnimating and simulating crowds using an agent-based approach is a well-established area where every agent in the crowd is individually controlled such that global human-like behaviour emerges. We observe that human navigation and movement in crowds are often influenced by complex social and environmental interactions, driven mainly by language and dialogue. However, most existing work does not consider these dimensions and leads to animations where agent-agent and agent-environment interactions are largely limited to steering and fixed higher-level goal extrapolation.
We propose a novel method that exploits large language models (LLMs) to control agents' movement. Our method has two main components: a dialogue system and language-driven navigation. We periodically query agent-centric LLMs conditioned on character personalities, roles, desires, and relationships to control the generation of inter-agent dialogue when necessitated by the spatial and social relationships with neighbouring agents. We then use the conversation and each agent's personality, emotional state, vision, and physical state to control the navigation and steering of each agent. Our model thus enables agents to make motion decisions based on both their perceptual inputs and the ongoing dialogue.
We validate our method in two complex scenarios that exemplify the interplay between social interactions, steering, and crowding. In these scenarios, we observe that grouping and ungrouping of agents automatically occur. Additionally, our experiments show that our method serves as an information-passing mechanism within the crowd. As a result, our framework produces more realistic crowd simulations, with emergent group behaviours arising naturally from any environmental setting.2025-08-20T20:15:14ZYibo LiuLiam ShatzelBrandon HaworthTeseo Schneider