https://arxiv.org/api/BFu8KTCmoaVz2JinyeLTV6ib0fI 2026-03-24T08:20:11Z 186066 45 15 http://arxiv.org/abs/2601.16296v2 Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing 2026-03-23T14:55:03Z Video-to-video diffusion models achieve impressive single-turn editing performance, but practical editing workflows are inherently iterative. When edits are applied sequentially, existing models treat each turn independently, often causing previously generated regions to drift or be overwritten. We identify this failure mode as the problem of cross-turn consistency in multi-turn video editing. We introduce Memory-V2V, a memory-augmented framework that treats prior edits as structured constraints for subsequent generations. Memory-V2V maintains an external memory of previous outputs, retrieves task-relevant edits, and integrates them through relevance-aware tokenization and adaptive compression. These technical ingredients enable scalable conditioning without linear growth in computation. We demonstrate Memory-V2V on iterative video novel view synthesis and text-guided long video editing. Memory-V2V substantially enhances cross-turn consistency while maintaining visual quality, outperforming strong baselines with modest overhead. 2026-01-22T19:59:17Z Project page: https://dohunlee1.github.io/MemoryV2V Dohun Lee Chun-Hao Paul Huang Xuelin Chen Jong Chul Ye Duygu Ceylan Hyeonho Jeong http://arxiv.org/abs/2603.22057v1 SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning 2026-03-23T14:54:34Z Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3. 2026-03-23T14:54:34Z 35 pages; 7 figures Byungwoo Jeon Dongyoung Kim Huiwon Jang Insoo Kim Jinwoo Shin http://arxiv.org/abs/2603.22054v1 FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation 2026-03-23T14:53:12Z Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture. 2026-03-23T14:53:12Z To appear in CVPR 2026 Wuyang Luo Chengkai Tan Chang Ge Binye Hong Su Yang Yongjiu Ma http://arxiv.org/abs/2510.17699v2 GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver 2026-03-23T14:51:10Z While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at https://github.com/3145tttt/GAS. 2025-10-20T16:14:38Z Accepted to ICLR 2026. Camera ready version Aleksandr Oganov Ilya Bykov Eva Neudachina Mishan Aliev Alexander Tolmachev Alexander Sidorov Aleksandr Zuev Andrey Okhotin Denis Rakitin Aibek Alanov http://arxiv.org/abs/2601.05848v2 Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals 2026-03-23T14:47:41Z Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page. 2026-01-09T15:23:36Z Camera ready version (CVPR 2026). Code and interactive demos at https://goal-force.github.io/ Nate Gillman Yinghua Zhou Zitian Tang Evan Luo Arjan Chakravarthy Daksh Aggarwal Michael Freeman Charles Herrmann Chen Sun http://arxiv.org/abs/2603.22042v1 Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models 2026-03-23T14:41:20Z While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git. 2026-03-23T14:41:20Z Hayeon Kim Ji Ha Jang Junghun James Kim Se Young Chun http://arxiv.org/abs/2603.22041v1 DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation 2026-03-23T14:41:11Z Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts. 2026-03-23T14:41:11Z Binhong Tan Zhaoxin Wang Handing Wang http://arxiv.org/abs/2603.22036v1 GTSR: Subsurface Scattering Awared 3D Gaussians for Translucent Surface Reconstruction 2026-03-23T14:38:18Z Reconstructing translucent objects from multi-view images is a difficult problem. Previously, researchers have used differentiable path tracing and the neural implicit field, which require relatively large computational costs. Recently, many works have achieved good reconstruction results for opaque objects based on a 3DGS pipeline with much higher efficiency. However, such methods have difficulty dealing with translucent objects, because they do not consider the optical properties of translucent objects. In this paper, we propose a novel 3DGS-based pipeline (GTSR) to reconstruct the surface geometry of translucent objects. GTSR combines two sets of Gaussians, surface and interior Gaussians, which are used to model the surface and scattering color when lights pass translucent objects. To render the appearance of translucent objects, we introduce a method that uses the Fresnel term to blend two sets of Gaussians. Furthermore, to improve the reconstructed details of non-contour areas, we introduce the Disney BSDF model with deferred rendering to enhance constraints of the normal and depth. Experimental results demonstrate that our method outperforms baseline reconstruction methods on the NeuralTO Syn dataset while showing great real-time rendering performance. We also extend the dataset with new translucent objects of varying material properties and demonstrate our method can adapt to different translucent materials. 2026-03-23T14:38:18Z Youwen Yuan Xi Zhao http://arxiv.org/abs/2603.22027v1 Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models 2026-03-23T14:33:43Z Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models. 2026-03-23T14:33:43Z 27 pages, 10 figures Purui Bai Junxian Duan Pin Wang Jinhua Hao Ming Sun Chao Zhou Huaibo Huang http://arxiv.org/abs/2603.22012v1 6D Robotic OCT Scanning of Curved Tissue Surfaces 2026-03-23T14:24:11Z Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning. 2026-03-23T14:24:11Z Accepted at IEEE ISBI 2026 Suresh Guttikonda Maximilian Neidhardt Vidas Raudonis Alexander Schlaefer http://arxiv.org/abs/2312.17251v3 MatSegNet: a New Boundary-aware Deep Learning Model for Accurate Carbide Precipitate Analysis in High-Strength Steels 2026-03-23T14:22:00Z Lower Bainite (LB) and Tempered Martensite (TM) are two common microstructures in modern high-strength steels. LB and TM can render similar mechanical properties for steels, yet LB is often considered superior to TM in resistance to hydrogen embrittlement. Such performance difference has conventionally been attributed to their distinction in certain microstructural features, particularly carbides. The present study developed, MatSegNet, a new contour-aware deep learning (DL) architecture. It is tailored for comprehensive segmentation and quantitative characterization of carbide precipitates with complex contours in high-strength steels, shown to outperform existing state-of-the-art DL architectures. Based on MatSegNet, a high-throughput DL pipeline has been established for precise comparative carbide analysis in LB and TM. The results showed that statistically the two microstructures exhibit similarity in key carbide characteristics with marginal difference, cautioning against the conventional use of carbide orientation as a reliable means to differentiate LB and TM in practice. Through MatSegNet, this work demonstrated the potential of DL to play a critical role in enabling accurate and quantitative microstructure characterization to facilitate development of structure-property relationships for accelerating materials innovation. 2023-12-02T05:11:34Z Xiaohan Bie Manoj Arthanari Evelin Barbosa de Melo Baihua Ren Juancheng Li Nicolas Brodusch Stephen Yue Salim Brahimi Raynald Gauvin Jun Song http://arxiv.org/abs/2406.03017v4 DifAttack++: Query-Efficient Black-Box Adversarial Attack via Hierarchical Disentangled Feature Space in Cross-Domain 2026-03-23T14:15:24Z This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (\textbf{ASR}) and good generalizability. We design a novel attack method based on a hierarchical DIsentangled Feature space, called \textbf{DifAttack++}, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack++ firstly disentangles an image's latent feature into an Adversarial Feature (\textbf{AF}) and a Visual Feature (\textbf{VF}) via an autoencoder equipped with our specially designed Hierarchical Decouple-Fusion (\textbf{HDF}) module, where the AF dominates the adversarial capability of an image, while the VF largely determines its visual appearance. We train such two autoencoders for the clean and adversarial image domains (i.e., cross-domain) respectively to achieve image reconstructions and feature disentanglement, by using pairs of clean images and their Adversarial Examples (\textbf{AE}s) generated from available surrogate models via white-box attack methods. Eventually, in the black-box attack stage, DifAttack++ iteratively optimizes the AF according to the query feedback from the victim model until a successful AE is generated, while keeping the VF unaltered. Extensive experimental results demonstrate that our DifAttack++ leads to superior ASR and query efficiency than state-of-the-art methods, meanwhile exhibiting much better visual quality of AEs. The code is available at https://github.com/csjunjun/DifAttack.git. 2024-06-05T07:32:29Z arXiv admin note: substantial text overlap with arXiv:2309.14585 An extension of the AAAI24 paper "DifAttack: Query-Efficient Black-Box Attack via Disentangled Feature Space." Jun Liu Jiantao Zhou Jiandian Zeng Jinyu Tian Zheng Li http://arxiv.org/abs/2603.22002v1 SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation 2026-03-23T14:08:40Z The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation. 2026-03-23T14:08:40Z Duy D. Nguyen Phat T. Tran-Truong http://arxiv.org/abs/2603.21999v1 STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection 2026-03-23T14:06:46Z Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer's exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at https://github.com/Mark9010/STENet. 2026-03-23T14:06:46Z 12 pages, 8 figures, accepted by IEEE TMM Jianlin Chen Gongyang Li Zhijiang Zhang Liang Chang Dan Zeng http://arxiv.org/abs/2411.17292v2 TPCL: Task Progressive Curriculum Learning for Robust Visual Question Answering 2026-03-23T13:49:42Z Visual Question Answering (VQA) systems are notoriously brittle under distribution shifts and data scarcity. While previous solutions-such as ensemble methods and data augmentation-can improve performance in isolation, they fail to generalise well across in-distribution (IID), out-of-distribution (OOD), and low-data settings simultaneously. We argue that this limitation stems from the suboptimal training strategies employed. Specifically, treating all training samples uniformly-without accounting for question difficulty or semantic structure-leaves the models vulnerable to dataset biases. Thus, they struggle to generalise beyond the training distribution. To address this issue, we introduce Task-Progressive Curriculum Learning (TPCL)-a simple, model-agnostic framework that progressively trains VQA models using a curriculum built by jointly considering question type and difficulty. Specifically, TPCL first groups questions based on their semantic type (e.g., yes/no, counting) and then orders them using a novel Optimal Transport-based difficulty measure. Without relying on data augmentation or explicit debiasing, TPCL improves generalisation across IID, OOD, and low-data regimes and achieves state-of-the-art performance on VQA-CP v2, VQA-CP v1, and VQA v2. It outperforms the most competitive robust VQA baselines by over 5% and 7% on VQA-CP v2 and v1, respectively, and boosts backbone performance by up to 28.5%. 2024-11-26T10:29:47Z Our source code is available at https://github.com/AhmedAAkl/tpcl Ahmed Akl Abdelwahed Khamis Zhe Wang Ali Cheraghian Sara Khalifa Kewen Wang