https://arxiv.org/api/PoSomZVUruZVaeOSj/t+gab1OXc2026-06-28T07:36:26Z9390180015http://arxiv.org/abs/2507.19830v2Taking Language Embedded 3D Gaussian Splatting into the Wild2025-08-05T01:40:57ZRecent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, little attention has been given to the immersive understanding of architectural styles and structural knowledge, which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for understanding the 3D structure of architectural components? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing.2025-07-26T07:00:32ZVisit our project page at https://yuzewang1998.github.io/takinglangsplatw/Yuze WangYue Qihttp://arxiv.org/abs/2503.11801v3Diffuse-CLoC: Guided Diffusion for Physics-based Character Look-ahead Control2025-08-05T00:46:24ZWe present Diffuse-CLoC, a guided diffusion framework for physics-based look-ahead control that enables intuitive, steerable, and physically realistic motion generation. While existing kinematics motion generation with diffusion models offer intuitive steering capabilities with inference-time conditioning, they often fail to produce physically viable motions. In contrast, recent diffusion-based control policies have shown promise in generating physically realizable motion sequences, but the lack of kinematics prediction limits their steerability. Diffuse-CLoC addresses these challenges through a key insight: modeling the joint distribution of states and actions within a single diffusion model makes action generation steerable by conditioning it on the predicted states. This approach allows us to leverage established conditioning techniques from kinematic motion generation while producing physically realistic motions. As a result, we achieve planning capabilities without the need for a high-level planner. Our method handles a diverse set of unseen long-horizon downstream tasks through a single pre-trained model, including static and dynamic obstacle avoidance, motion in-betweening, and task-space control. Experimental results show that our method significantly outperforms the traditional hierarchical framework of high-level motion diffusion and low-level tracking.2025-03-14T18:42:29ZXiaoyu HuangTakara TruongYunbo ZhangFangzhou YuJean Pierre SleimanJessica HodginsKoushil SreenathFarbod Farshidian10.1145/3731206http://arxiv.org/abs/2503.10624v3ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness2025-08-04T19:17:31ZFitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at https://boqian-li.github.io/ETCH/.2025-03-13T17:59:14ZPage: https://boqian-li.github.io/ETCH/, Code: https://github.com/boqian-li/ETCHBoqian LiHaiwen FengZeyu CaiMichael J. BlackYuliang Xiuhttp://arxiv.org/abs/2412.17804v3GausSim: Foreseeing Reality by Gaussian Simulator for Elastic Objects2025-08-04T13:59:08ZWe introduce GausSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. We leverage continuum mechanics and treat each kernel as a Center of Mass System (CMS) that represents continuous piece of matter, accounting for realistic deformations without idealized assumptions. To improve computational efficiency and fidelity, we employ a hierarchical structure that further organizes kernels into CMSs with explicit formulations, enabling a coarse-to-fine simulation approach. This structure significantly reduces computational overhead while preserving detailed dynamics. In addition, GausSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations. To validate our approach, we present a new dataset, READY, containing multi-view videos of real-world elastic deformations. Experimental results demonstrate that GausSim achieves superior performance compared to existing physics-driven baselines, offering a practical and accurate solution for simulating complex dynamic behaviors. Code and model are available at our project page: https://www.mmlab-ntu.com/project/gausim/index.html .2024-12-23T18:58:17ZAccepted by ICCV2025. Project page: https://www.mmlab-ntu.com/project/gausim/index.htmlYidi ShaoMu HuangChen Change LoyBo Daihttp://arxiv.org/abs/2508.02304v1ASDR: Exploiting Adaptive Sampling and Data Reuse for CIM-based Instant Neural Rendering2025-08-04T11:17:21ZNeural Radiance Fields (NeRF) offer significant promise for generating photorealistic images and videos. However, existing mainstream neural rendering models often fall short in meeting the demands for immediacy and power efficiency in practical applications. Specifically, these models frequently exhibit irregular access patterns and substantial computational overhead, leading to undesirable inference latency and high power consumption. Computing-in-memory (CIM), an emerging computational paradigm, has the potential to address these access bottlenecks and reduce the power consumption associated with model execution.
To bridge the gap between model performance and real-world scene requirements, we propose an algorithm-architecture co-design approach, abbreviated as ASDR, a CIM-based accelerator supporting efficient neural rendering. At the algorithmic level, we propose two rendering optimization schemes: (1) Dynamic sampling by online sensing of the rendering difficulty of different pixels, thus reducing access memory and computational overhead. (2) Reducing MLP overhead by decoupling and approximating the volume rendering of color and density. At the architecture level, we design an efficient ReRAM-based CIM architecture with efficient data mapping and reuse microarchitecture. Experiments demonstrate that our design can achieve up to $9.55\times$ and $69.75\times$ speedup over state-of-the-art NeRF accelerators and Xavier NX GPU in graphics rendering tasks with only $0.1$ PSNR loss.2025-08-04T11:17:21ZAccepted by the 2025 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2025). The paper will be presented at ASPLOS 2026Fangxin LiuHaomin LiBowen ZhuZongwu WangZhuoran SongHabing GuanLi Jianghttp://arxiv.org/abs/2412.19853v2Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation2025-08-03T18:03:12ZBalancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both. This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.2024-12-25T17:51:55ZConference paper at CVPR 2025. Project page: https://nadavc220.github.io/conditional-balance.github.io/Nadav Z. CohenOron NirAriel Shamirhttp://arxiv.org/abs/2501.18898v3GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling2025-08-03T07:25:09ZGenerating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications. Project Page: https://andypinxinliu.github.io/GestureLSM2025-01-31T05:34:59ZAccepted to ICCV 2025. Project Page: https://andypinxinliu.github.io/GestureLSMPinxin LiuLuchuan SongJunhua HuangHaiyang LiuChenliang Xuhttp://arxiv.org/abs/2411.15472v3KinMo: Kinematic-aware Human Motion Understanding and Generation2025-08-03T07:16:03ZCurrent human motion synthesis frameworks rely on global action descriptions, creating a modality gap that limits both motion understanding and generation capabilities. A single coarse description, such as run, fails to capture details such as variations in speed, limb positioning, and kinematic dynamics, leading to ambiguities between text and motion modalities. To address this challenge, we introduce KinMo, a unified framework built on a hierarchical describable motion representation that extends beyond global actions by incorporating kinematic group movements and their interactions. We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset and offering a scalable and cost-efficient solution for dataset enrichment. To leverage these structured descriptions, we propose Hierarchical Text-Motion Alignment that progressively integrates additional motion details, thereby improving semantic motion understanding. Furthermore, we introduce a coarse-to-fine motion generation procedure to leverage enhanced spatial understanding to improve motion synthesis. Experimental results show that KinMo significantly improves motion understanding, demonstrated by enhanced text-motion retrieval performance and enabling more fine-grained motion generation and editing capabilities. Project Page: https://andypinxinliu.github.io/KinMo2024-11-23T06:50:11ZAccepted to ICCV 2025; Project page: https://andypinxinliu.github.io/KinMoPengfei ZhangPinxin LiuPablo GarridoHyeongwoo KimBindita Chaudhurihttp://arxiv.org/abs/2508.01590v1A Plug-and-Play Multi-Criteria Guidance for Diverse In-Betweening Human Motion Generation2025-08-03T05:06:37ZIn-betweening human motion generation aims to synthesize intermediate motions that transition between user-specified keyframes. In addition to maintaining smooth transitions, a crucial requirement of this task is to generate diverse motion sequences. It is still challenging to maintain diversity, particularly when it is necessary for the motions within a generated batch sampling to differ meaningfully from one another due to complex motion dynamics. In this paper, we propose a novel method, termed the Multi-Criteria Guidance with In-Betweening Motion Model (MCG-IMM), for in-betweening human motion generation. A key strength of MCG-IMM lies in its plug-and-play nature: it enhances the diversity of motions generated by pretrained models without introducing additional parameters This is achieved by providing a sampling process of pretrained generative models with multi-criteria guidance. Specifically, MCG-IMM reformulates the sampling process of pretrained generative model as a multi-criteria optimization problem, and introduces an optimization process to explore motion sequences that satisfy multiple criteria, e.g., diversity and smoothness. Moreover, our proposed plug-and-play multi-criteria guidance is compatible with different families of generative models, including denoised diffusion probabilistic models, variational autoencoders, and generative adversarial networks. Experiments on four popular human motion datasets demonstrate that MCG-IMM consistently state-of-the-art methods in in-betweening motion generation task.2025-08-03T05:06:37ZIEEE Transactions on Multimedia 2025Hua YuJiao LiuXu GuiMelvin WongYaqing HouYew-Soon Onghttp://arxiv.org/abs/2508.01537v1FluidFormer: Transformer with Continuous Convolution for Particle-based Fluid Simulation2025-08-03T01:44:17ZLearning-based fluid simulation networks have been proven as viable alternatives to traditional numerical solvers for the Navier-Stokes equations. Existing neural methods follow Smoothed Particle Hydrodynamics (SPH) frameworks, which inherently rely only on local inter-particle interactions. However, we emphasize that global context integration is also essential for learning-based methods to stabilize complex fluid simulations. We propose the first Fluid Attention Block (FAB) with a local-global hierarchy, where continuous convolutions extract local features while self-attention captures global dependencies. This fusion suppresses the error accumulation and models long-range physical phenomena. Furthermore, we pioneer the first Transformer architecture specifically designed for continuous fluid simulation, seamlessly integrated within a dual-pipeline architecture. Our method establishes a new paradigm for neural fluid simulation by unifying convolution-based local features with attention-based global context modeling. FluidFormer demonstrates state-of-the-art performance, with stronger stability in complex fluid scenarios.2025-08-03T01:44:17ZNianyi WangYu ChenShuai Zhenghttp://arxiv.org/abs/2507.19718v2GSCache: Real-Time Radiance Caching for Volume Path Tracing using 3D Gaussian Splatting2025-08-02T23:59:11ZReal-time path tracing is rapidly becoming the standard for rendering in entertainment and professional applications. In scientific visualization, volume rendering plays a crucial role in helping researchers analyze and interpret complex 3D data. Recently, photorealistic rendering techniques have gained popularity in scientific visualization, yet they face significant challenges. One of the most prominent issues is slow rendering performance and high pixel variance caused by Monte Carlo integration. In this work, we introduce a novel radiance caching approach for path-traced volume rendering. Our method leverages advances in volumetric scene representation and adapts 3D Gaussian splatting to function as a multi-level, path-space radiance cache. This cache is designed to be trainable on the fly, dynamically adapting to changes in scene parameters such as lighting configurations and transfer functions. By incorporating our cache, we achieve less noisy, higher-quality images without increasing rendering costs. To evaluate our approach, we compare it against a baseline path tracer that supports uniform sampling and next-event estimation and the state-of-the-art for neural radiance caching. Through both quantitative and qualitative analyses, we demonstrate that our path-space radiance cache is a robust solution that is easy to integrate and significantly enhances the rendering quality of volumetric visualization applications while maintaining comparable computational efficiency.2025-07-25T23:55:54ZDavid BauerQi WuHamid GadirovKwan-Liu Mahttp://arxiv.org/abs/2508.01381v1ReMu: Reconstructing Multi-layer 3D Clothed Human from Image Layers2025-08-02T14:24:47ZThe reconstruction of multi-layer 3D garments typically requires expensive multi-view capture setups and specialized 3D editing efforts. To support the creation of life-like clothed human avatars, we introduce ReMu for reconstructing multi-layer clothed humans in a new setup, Image Layers, which captures a subject wearing different layers of clothing with a single RGB camera. To reconstruct physically plausible multi-layer 3D garments, a unified 3D representation is necessary to model these garments in a layered manner. Thus, we first reconstruct and align each garment layer in a shared coordinate system defined by the canonical body pose. Afterwards, we introduce a collision-aware optimization process to address interpenetration and further refine the garment boundaries leveraging implicit neural fields. It is worth noting that our method is template-free and category-agnostic, which enables the reconstruction of 3D garments in diverse clothing styles. Through our experiments, we show that our method reconstructs nearly penetration-free 3D clothed humans and achieves competitive performance compared to category-specific methods. Project page: https://eth-ait.github.io/ReMu/2025-08-02T14:24:47ZBMVC 2025 paper, 17 pages, 10 figuresOnat VuranHsuan-I Hohttp://arxiv.org/abs/2407.00552v2Efficient Resource Management in Multicast Short Video Streaming Systems2025-08-01T19:26:24ZThe surge in popularity of short-form video content, particularly through platforms like TikTok and Instagram, has led to an exponential increase in data traffic, presenting significant challenges in network resource management. Traditional unicast streaming methods, while straightforward, are inefficient in scenarios where videos need to be delivered to a large number of users simultaneously. Multicast streaming, which sends a single stream to multiple users, can drastically reduce the required bandwidth, yet it introduces complexities in resource allocation, especially in wireless environments where bandwidth is limited and user demands are heterogeneous. This paper introduces a novel multicast resource management framework tailored for the efficient distribution of short-form video content. The proposed framework dynamically optimizes resource allocation to enhance Quality of Service (QoS) and Quality of Experience (QoE) for multiple users, balancing the trade-offs between cost, efficiency, and user satisfaction. We implement a series of optimization algorithms that account for diverse network conditions and user requirements, ensuring optimal service delivery across varying network topologies. Experimental results demonstrate that our framework can effectively reduce bandwidth usage and decrease video startup delay compared to traditional multicast approaches, significantly improving overall user satisfaction. This study not only advances the understanding of multicast streaming dynamics but also provides practical insights into scalable and efficient video distribution strategies in congested network environments.2024-06-30T01:01:09ZarXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliationBetty SearcyZurh FarusBronny BushKevin MuhammadZubair Clintonhttp://arxiv.org/abs/2407.00513v2Dynamic Optimization of Video Streaming Quality Using Network Digital Twin Technology2025-08-01T19:26:02ZThis paper introduces a novel dynamic optimization framework for video streaming that leverages Network Digital Twin (NDT) technology to address the challenges posed by fluctuating wireless network conditions. Traditional adaptive streaming methods often struggle with rapid changes in network bandwidth, latency, and packet loss, leading to suboptimal user experiences characterized by frequent buffering and reduced video quality. Our proposed framework integrates a sophisticated NDT that models the wireless network in real-time and employs predictive analytics to forecast near-future network states. Utilizing machine learning techniques, specifically Random Forest and Neural Networks, the NDT predicts bandwidth availability, latency trends, and potential packet losses before they impact video transmission. Based on these predictions, our adaptive streaming algorithm dynamically adjusts video bitrates, resolution, and buffering strategies, thus ensuring an uninterrupted and high-quality viewing experience. Experimental validations demonstrate that our approach significantly enhances the Quality of Experience (QoE) by reducing buffering times by up to 50\% and improving resolution in varied network conditions compared to conventional streaming methods. This paper underscores the potential of integrating digital twin technology into multimedia transmission, paving the way for more resilient and user-centric video streaming solutions.2024-06-29T19:04:41ZarXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliationZurh FarusBetty SearcyTina NassisidKevin Muhammadhttp://arxiv.org/abs/2407.11020v2Enhancing Vehicular Networks with Generative AI: Opportunities and Challenges2025-08-01T19:20:31ZIn the burgeoning field of intelligent transportation systems, the integration of Generative Artificial Intelligence (AI) into vehicular networks presents a transformative potential for the automotive industry. This paper explores the innovative applications of generative AI in enhancing communication protocols, optimizing traffic management, and bolstering security frameworks within vehicular networks. By examining current technologies and recent advancements, we identify key challenges such as scalability, real-time data processing, and security vulnerabilities that come with AI integration. Additionally, we propose novel applications and methodologies that leverage generative AI to simulate complex network scenarios, generate adaptive communication schemes, and enhance predictive capabilities for traffic conditions. This study not only reviews the state of the art but also highlights significant opportunities where generative AI can lead to groundbreaking improvements in vehicular network efficiency and safety. Through this comprehensive exploration, our findings aim to guide future research directions and foster a deeper understanding of generative AI's role in the next generation of vehicular technologies.2024-07-01T02:04:27ZarXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliationTeef DavidKassi MuhammadKevin NassisidBronny Farus