https://arxiv.org/api/ARQ9LejD0s1CvlCKjL3a4diYWiI 2026-06-28T09:50:59Z 9390 1830 15 http://arxiv.org/abs/2503.20999v2 Text-Driven Voice Conversion via Latent State-Space Modeling 2025-07-30T17:02:18Z

Text-driven voice conversion allows customization of speaker characteristics and prosodic elements using textual descriptions. However, most existing methods rely heavily on direct text-to-speech training, limiting their flexibility in controlling nuanced style elements or timbral features. In this paper, we propose a novel \textbf{Latent State-Space} approach for text-driven voice conversion (\textbf{LSS-VC}). Our method treats each utterance as an evolving dynamical system in a continuous latent space. Drawing inspiration from mamba, which introduced a state-space model for efficient text-driven \emph{image} style transfer, we adapt a loosely related methodology for \emph{voice} style transformation. Specifically, we learn a voice latent manifold where style and content can be manipulated independently by textual style prompts. We propose an adaptive cross-modal fusion mechanism to inject style information into the voice latent representation, enabling interpretable and fine-grained control over speaker identity, speaking rate, and emphasis. Extensive experiments show that our approach significantly outperforms recent baselines in both subjective and objective quality metrics, while offering smoother transitions between styles, reduced artifacts, and more precise text-based style control.

2025-03-26T21:30:29Z arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliation Wen Li Sofia Martinez Priyanka Shah http://arxiv.org/abs/2503.20992v2 ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer 2025-07-30T17:02:04Z

Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.

2025-03-26T21:11:17Z arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliation Michael Brown Sofia Martinez Priya Singh http://arxiv.org/abs/2503.20988v2 Cross-Modal State-Space Graph Reasoning for Structured Summarization 2025-07-30T17:01:45Z

The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.

2025-03-26T21:06:56Z arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliation Hannah Kim Sofia Martinez Jason Lee http://arxiv.org/abs/2503.16133v2 Multi-Prompt Style Interpolation for Fine-Grained Artistic Control 2025-07-29T20:13:57Z

Text-driven image style transfer has seen remarkable progress with methods leveraging cross-modal embeddings for fast, high-quality stylization. However, most existing pipelines assume a \emph{single} textual style prompt, limiting the range of artistic control and expressiveness. In this paper, we propose a novel \emph{multi-prompt style interpolation} framework that extends the recently introduced \textbf{StyleMamba} approach. Our method supports blending or interpolating among multiple textual prompts (eg, ``cubism,'' ``impressionism,'' and ``cartoon''), allowing the creation of nuanced or hybrid artistic styles within a \emph{single} image. We introduce a \textit{Multi-Prompt Embedding Mixer} combined with \textit{Adaptive Blending Weights} to enable fine-grained control over the spatial and semantic influence of each style. Further, we propose a \emph{Hierarchical Masked Directional Loss} to refine region-specific style consistency. Experiments and user studies confirm our approach outperforms single-prompt baselines and naive linear combinations of styles, achieving superior style fidelity, text-image alignment, and artistic flexibility, all while maintaining the computational efficiency offered by the state-space formulation.

2025-03-20T13:29:32Z arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliation Lei Chen Hao Li Yuxin Zhang Chao Li Kai Wen http://arxiv.org/abs/2503.16129v2 Controllable Segmentation-Based Text-Guided Style Editing 2025-07-29T20:13:40Z

We present a novel approach for controllable, region-specific style editing driven by textual prompts. Building upon the state-space style alignment framework introduced by \emph{StyleMamba}, our method integrates a semantic segmentation model into the style transfer pipeline. This allows users to selectively apply text-driven style changes to specific segments (e.g., ``turn the building into a cyberpunk tower'') while leaving other regions (e.g., ``people'' or ``trees'') unchanged. By incorporating region-wise condition vectors and a region-specific directional loss, our method achieves high-fidelity transformations that respect both semantic boundaries and user-driven style descriptions. Extensive experiments demonstrate that our approach can flexibly handle complex scene stylizations in real-world scenarios, improving control and quality over purely global style transfer methods.

2025-03-20T13:24:41Z arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliation Jingwen Li Aravind Chandrasekar Mariana Rocha Chao Li Yuqing Chen http://arxiv.org/abs/2503.12291v2 Text-Driven Video Style Transfer with State-Space Models: Extending StyleMamba for Temporal Coherence 2025-07-29T20:13:17Z

StyleMamba has recently demonstrated efficient text-driven image style transfer by leveraging state-space models (SSMs) and masked directional losses. In this paper, we extend the StyleMamba framework to handle video sequences. We propose new temporal modules, including a \emph{Video State-Space Fusion Module} to model inter-frame dependencies and a novel \emph{Temporal Masked Directional Loss} that ensures style consistency while addressing scene changes and partial occlusions. Additionally, we introduce a \emph{Temporal Second-Order Loss} to suppress abrupt style variations across consecutive frames. Our experiments on DAVIS and UCF101 show that the proposed approach outperforms competing methods in terms of style consistency, smoothness, and computational efficiency. We believe our new framework paves the way for real-time text-driven video stylization with state-of-the-art perceptual results.

2025-03-15T23:09:03Z arXiv admin note: This paper has been withdrawn by arXiv due to disputed and unverifiable authorship and affiliation Chao Li Minsu Park Cristina Rossi Zhuang Li http://arxiv.org/abs/2411.17489v3 Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions 2025-07-29T14:56:41Z

Modern reconstruction techniques can effectively model complex 3D scenes from sparse 2D views. However, automatically assessing the quality of novel views and identifying artifacts is challenging due to the lack of ground truth images and the limitations of no-reference image metrics in predicting reliable artifact maps. The absence of such metrics hinders assessment of the quality of novel views and limits the adoption of post-processing techniques, such as inpainting, to enhance reconstruction quality. To tackle this, recent work has established a new category of metrics (cross-reference), predicting image quality solely by leveraging context from alternate viewpoint captures (arXiv:2404.14409). In this work, we propose a new cross-reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. Our approach utilizes image patch statistics from the training views to establish a scene-specific distribution, later used to identify poorly reconstructed regions in the novel views. Given the lack of good measures to evaluate cross-reference methods in the context of 3D reconstruction, we collected a novel human-labeled dataset of artifact and distortion maps in unseen reconstructed views. Through this dataset, we demonstrate that our method achieves state-of-the-art localization of artifacts in novel views, correlating with human assessment, even without aligned references. We can leverage our new metric to enhance applications like automatic image restoration, guided acquisition, or 3D reconstruction from sparse inputs. Find the project page at https://nihermann.github.io/puzzlesim/ .

2024-11-26T14:57:30Z Nicolai Hermann Jorge Condor Piotr Didyk http://arxiv.org/abs/2507.21686v1 Solving Boundary Handling Analytically in Two Dimensions for Smoothed Particle Hydrodynamics 2025-07-29T10:59:01Z

We present a fully analytic approach for evaluating boundary integrals in two dimensions for Smoothed Particle Hydrodynamics (SPH). Conventional methods often rely on boundary particles or wall re-normalization approaches derived from applying the divergence theorem, whereas our method directly evaluates the area integrals for SPH kernels and gradients over triangular boundaries. This direct integration strategy inherently accommodates higher-order boundary conditions, such as piecewise cubic fields defined via Finite Element stencils, enabling analytic and flexible coupling with mesh-based solvers. At the core of our approach is a general solution for compact polynomials of arbitrary degree over triangles by decomposing the boundary elements into elementary integrals that can be solved with closed-form solutions. We provide a complete, closed-form solution for these generalized integrals, derived by relating the angular components to Chebyshev polynomials and solving the resulting radial integral via a numerically stable evaluation of the Gaussian hypergeometric function $_2F_1$. Our solution is robust and adaptable and works regardless of triangle geometries and kernel functions. We validate the accuracy against high-precision numerical quadrature rules, as well as in problems with known exact solutions. We provide an open-source implementation of our general solution using differentiable programming to facilitate the adoption of our approach to SPH and other contexts that require analytic integration over polygonal domains. Our analytic solution outperforms existing numerical quadrature rules for this problem by up to five orders of magnitude, for integrals and their gradients, while providing a flexible framework to couple arbitrary triangular meshes analytically to Lagrangian schemes, building a strong foundation for addressing several grand challenges in SPH and beyond.

2025-07-29T10:59:01Z Rene Winchenbach Andreas Kolb http://arxiv.org/abs/2408.02275v2 Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes 2025-07-29T10:43:31Z

This paper introduces a novel integration of Large Language Models (LLMs) with Conformal Geometric Algebra (CGA) to revolutionize controllable 3D scene editing, particularly for object repositioning tasks, which traditionally requires intricate manual processes and specialized expertise. These conventional methods typically suffer from reliance on large training datasets or lack a formalized language for precise edits. Utilizing CGA as a robust formal language, our system, Shenlong, precisely models spatial transformations necessary for accurate object repositioning. Leveraging the zero-shot learning capabilities of pre-trained LLMs, Shenlong translates natural language instructions into CGA operations which are then applied to the scene, facilitating exact spatial transformations within 3D scenes without the need for specialized pre-training. Implemented in a realistic simulation environment, Shenlong ensures compatibility with existing graphics pipelines. To accurately assess the impact of CGA, we benchmark against robust Euclidean Space baselines, evaluating both latency and accuracy. Comparative performance evaluations indicate that Shenlong significantly reduces LLM response times by 16% and boosts success rates by 9.6% on average compared to the traditional methods. Notably, Shenlong achieves a 100% perfect success rate in common practical queries, a benchmark where other systems fall short. These advancements underscore Shenlong's potential to democratize 3D scene editing, enhancing accessibility and fostering innovation across sectors such as education, digital entertainment, and virtual reality.

2024-08-05T07:10:40Z 10 pages, 4 figures Prodromos Kolyvakis Manos Kamarianakis George Papagiannakis http://arxiv.org/abs/2409.15023v5 Efficient Nearest Neighbor Search Using Dynamic Programming 2025-07-29T10:36:14Z

Given a collection of points in R^3, KD-Tree and R-Tree are well-known nearest neighbor search (NNS) algorithms that rely on space partitioning and spatial indexing techniques. However, when the query point is far from the data points or the data points inherently represent a 2-manifold surface, their query performance may degrade. To address this, we propose a novel dynamic programming technique that precomputes a Directed Acyclic Graph (DAG) to encode the proximity structure between data points. More specifically, the DAG captures how the proximity structure evolves during the incremental construction of the Voronoi diagram of the data points. Experimental results demonstrate that our method achieves a 1x-10x speedup. Additionally, our algorithm demonstrates significant practical value across diverse applications. We validated its effectiveness through extensive testing in four key applications: Point to Mesh Distance Queries, Iterative Closest Point (ICP) Registration, Density Peak Clustering, and Point to Segments Distance Queries. A particularly notable feature of our approach is its unique ability to efficiently identify the nearest neighbor among the first k points in the point cloud a capability that enables substantial acceleration in low-dimensional applications like Density Peak Clustering. As a natural extension of our incremental construction process, our method can also be readily adapted for farthest point sampling tasks. These experimental results across multiple domains underscore the broad applicability and practical importance of our approach.

2024-09-23T13:50:39Z Pengfei Wang Jiantao Song Shiqing Xin Shuangmin Chen Changhe Tu Wenping Wang Jiaye Wang http://arxiv.org/abs/2503.09631v2 V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video 2025-07-29T10:07:37Z

We present V2M4, a novel 4D reconstruction method that directly generates a usable 4D mesh animation asset from a single monocular video. Unlike existing approaches that rely on priors from multi-view image and video generation models, our method is based on native 3D mesh generation models. Naively applying 3D mesh generation models to generate a mesh for each frame in a 4D task can lead to issues such as incorrect mesh poses, misalignment of mesh appearance, and inconsistencies in mesh geometry and texture maps. To address these problems, we propose a structured workflow that includes camera search and mesh reposing, condition embedding optimization for mesh appearance refinement, pairwise mesh registration for topology consistency, and global texture map optimization for texture consistency. Our method outputs high-quality 4D animated assets that are compatible with mainstream graphics and game software. Experimental results across a variety of animation types and motion amplitudes demonstrate the generalization and effectiveness of our method. Project page: https://windvchen.github.io/V2M4/.

2025-03-11T19:22:14Z Accepted by ICCV 2025. Project page: https://windvchen.github.io/V2M4/ Jianqi Chen Biao Zhang Xiangjun Tang Peter Wonka http://arxiv.org/abs/2504.06385v3 Fast Globally Optimal and Geometrically Consistent 3D Shape Matching 2025-07-29T06:28:24Z

Geometric consistency, i.e. the preservation of neighbourhoods, is a natural and strong prior in 3D shape matching. Geometrically consistent matchings are crucial for many downstream applications, such as texture transfer or statistical shape modelling. Yet, in practice, geometric consistency is often overlooked, or only achieved under severely limiting assumptions (e.g. a good initialisation). In this work, we propose a novel formalism for computing globally optimal and geometrically consistent matchings between 3D shapes which is scalable in practice. Our key idea is to represent the surface of the source shape as a collection of cyclic paths, which are then consistently matched to the target shape. Mathematically, we construct a hyper product graph (between source and target shape), and then cast 3D shape matching as a minimum-cost circulation flow problem in this hyper graph, which yields global geometrically consistent matchings between both shapes. We empirically show that our formalism is efficiently solvable and that it leads to high-quality results.

2025-04-08T19:08:43Z 8 pages main paper, 9 pages supplementary Paul Roetzer Florian Bernard http://arxiv.org/abs/2507.21493v1 BANG: Dividing 3D Assets via Generative Exploded Dynamics 2025-07-29T04:21:21Z

3D creation has always been a unique human strength, driven by our ability to deconstruct and reassemble objects using our eyes, mind and hand. However, current 3D design tools struggle to replicate this natural process, requiring considerable artistic expertise and manual labor. This paper introduces BANG, a novel generative approach that bridges 3D generation and reasoning, allowing for intuitive and flexible part-level decomposition of 3D objects. At the heart of BANG is "Generative Exploded Dynamics", which creates a smooth sequence of exploded states for an input geometry, progressively separating parts while preserving their geometric and semantic coherence. BANG utilizes a pre-trained large-scale latent diffusion model, fine-tuned for exploded dynamics with a lightweight exploded view adapter, allowing precise control over the decomposition process. It also incorporates a temporal attention module to ensure smooth transitions and consistency across time. BANG enhances control with spatial prompts, such as bounding boxes and surface regions, enabling users to specify which parts to decompose and how. This interaction can be extended with multimodal models like GPT-4, enabling 2D-to-3D manipulations for more intuitive and creative workflows. The capabilities of BANG extend to generating detailed part-level geometry, associating parts with functional descriptions, and facilitating component-aware 3D creation and manufacturing workflows. Additionally, BANG offers applications in 3D printing, where separable parts are generated for easy printing and reassembly. In essence, BANG enables seamless transformation from imaginative concepts to detailed 3D assets, offering a new perspective on creation that resonates with human intuition.

2025-07-29T04:21:21Z Homepage: https://sites.google.com/view/bang7355608 Longwen Zhang Qixuan Zhang Haoran Jiang Yinuo Bai Wei Yang Lan Xu Jingyi Yu 10.1145/3730840 http://arxiv.org/abs/2510.15877v1 Procedural modeling of urban land use 2025-07-29T02:23:12Z

Cities are important elements of content in digital productions, but their complexity and size make them very challenging to model. Few tools exist that can help artists with this work, even as rapid improvements in graphics hardware create demand for richer content without matching increases in production cost. We propose a method for procedurally generating realistic patterns of land use in cities, automating placement of buildings and roads for artists.

2025-07-29T02:23:12Z Thomas Lechner Ben Watson Uri Wilenski Seth Tisue Martin Felsen Andy Moddrell Pin Ren Craig Brozefsky http://arxiv.org/abs/2507.18972v2 TiVy: Time Series Visual Summary for Scalable Visualization 2025-07-28T23:00:54Z

Visualizing multiple time series presents fundamental tradeoffs between scalability and visual clarity. Time series capture the behavior of many large-scale real-world processes, from stock market trends to urban activities. Users often gain insights by visualizing them as line charts, juxtaposing or superposing multiple time series to compare them and identify trends and patterns. However, existing representations struggle with scalability: when covering long time spans, leading to visual clutter from too many small multiples or overlapping lines. We propose TiVy, a new algorithm that summarizes time series using sequential patterns. It transforms the series into a set of symbolic sequences based on subsequence visual similarity using Dynamic Time Warping (DTW), then constructs a disjoint grouping of similar subsequences based on the frequent sequential patterns. The grouping result, a visual summary of time series, provides uncluttered superposition with fewer small multiples. Unlike common clustering techniques, TiVy extracts similar subsequences (of varying lengths) aligned in time. We also present an interactive time series visualization that renders large-scale time series in real-time. Our experimental evaluation shows that our algorithm (1) extracts clear and accurate patterns when visualizing time series data, (2) achieves a significant speed-up (1000X) compared to a straightforward DTW clustering. We also demonstrate the efficiency of our approach to explore hidden structures in massive time series data in two usage scenarios.

2025-07-25T05:50:01Z to be published in TVCG (IEEE VIS 2025) Gromit Yeuk-Yin Chan Luis Gustavo Nonato Themis Palpanas Cláudio T. Silva Juliana Freire