https://arxiv.org/api/mSUBwcaJq+y0sEVSOsFKq+dHbUw 2026-06-14T15:47:56Z 195716 435 15 http://arxiv.org/abs/2606.10468v1 Geometric Coastline Localization using Vision-Language Models 2026-06-09T06:37:37Z Coastline detection in remote sensing imagery is commonly formulated as a pixel-wise segmentation problem, where the final coastline is extracted from a predicted mask through post-processing. This formulation relegates coastline geometry, the primary representation used in coastal change analysis, to a secondary artifact rather than the learning objective. In practice, coastlines are defined by geomorphic proxies such as vegetation lines, dune toes, or cliff edges, rather than an instantaneous land-water boundary often used in pixel-based segmentation approaches. In this work, we revisit coastline extraction from a representation perspective and formulate the task as geometric boundary localization. We use the New Zealand Coastal Change Dataset (NZCCD) and high-resolution aerial imagery from Land Information New Zealand (LINZ) to develop CoastlineVLM-7B, a vision-language model (VLM) built on the GeoChat-7B/LLaVA-1.5 architecture that jointly performs coastline presence detection, proxy-type classification, and coastline grounding. The model directly predicts a coastline as a polyline rather than a dense segmentation mask. We evaluate CoastlineVLM-7B against segmentation baselines under strict one-pixel boundary supervision. Results show that geometry-based metrics are more suitable for assessing coastline localization quality than pixel-overlap metrics such as Intersection over Union (IoU). CoastlineVLM-7B improves global geometric alignment with reference coastlines, reducing Hausdorff distance from 37.74 m to 31.84 m and Earth Mover's Distance from 21.12 m to 17.32 m. These results indicate that output representation is a critical design choice in coastline extraction, and that geometry-oriented learning, combined with the semantic reasoning capabilities of vision-language models, aligns well with how coastlines are defined and evaluated in operational coastal monitoring. 2026-06-09T06:37:37Z Rafia Malik Bernhard Pfahringer Karin Bryan Mark Dickson Eibe Frank http://arxiv.org/abs/2601.08379v2 MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance 2026-06-09T06:35:29Z Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose \emph{MMD Guidance}, a training-free mechanism that augments the reverse diffusion process with gradients of the \textit{Maximum Mean Discrepancy (MMD)} between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity. The project code is available at github.com/matinamehdizadeh/MMD-Guidance. 2026-01-13T09:42:57Z Matina Mahdizadeh Sani Nima Jamali Mohammad Jalali Farzan Farnia http://arxiv.org/abs/2507.02513v4 Automatic Labelling for Low-Light Pedestrian Detection 2026-06-09T06:33:57Z Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. Low-light pedestrian detection lacks large public datasets and autolabelling pipelines. This research proposes a solution in the form of an automated infrared-RGB pipeline. The pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For evaluation, three object detection models, DETR, YOLO, and RCNN, were trained on generated and ground truth labels. When compared on previously unseen images, the results showed that the models trained on generated labels out-performed the ones trained on ground-truth in 5 out of 6 cases for the mAP@50 and LAMR metrics, and outperformed ground-truth on mAP@50-95 in all cases. Acquired results indicate that the proposed auto-labelling pipeline could be used for scalable annotation of low-light datasets for pedestrian detection. The source code for this research is available on GitHub: https://github.com/BouzoulasDimitrios/IR-RGB-autoamed-low-light-pedestrian-labelling 2025-07-03T10:20:17Z Dimitrios Bouzoulas Eerik Alamikkotervo Risto Ojala http://arxiv.org/abs/2411.02817v2 Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs 2026-06-09T06:28:35Z Generative models guided by text prompts are widely evaluated for fidelity and prompt alignment, yet their ability to produce outputs remains underexplored. Existing diversity metrics such as Vendi and RKE, which are based on the von Neumann and Rényi entropies of kernel matrices, were developed for unconditional models and cannot distinguish prompt-induced from model-induced variability. We address this gap by introducing \textit{Conditional-Vendi} and \textit{Conditional-RKE}, diversity measures derived from the conditional entropy of positive semidefinite matrices. These scores isolate model-induced diversity in prompt-guided generation, with Conditional-RKE enjoying an $O(1/\sqrt{n})$ convergence rate. For Conditional-Vendi, we introduce a truncated-spectrum approximation that yields scalable and consistent estimates. Experiments on text-to-image, image-captioning, and LLM tasks show that the conditional scores recover ground-truth diversity orderings and can also guide diffusion models toward more diverse samples. The codebase is available at https://github.com/mjalali/conditional-vendi. 2024-11-05T05:30:39Z Mohammad Jalali Azim Ospanov Amin Gohari Farzan Farnia http://arxiv.org/abs/2509.05913v3 A fine-grained attention and geometric correspondence model for musculoskeletal risk classification in athletes using multimodal visual and skeletal features 2026-06-09T06:10:37Z Musculoskeletal disorders pose significant risks to athletes, and early risk assessment is essential for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research introduces ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework that classifies musculoskeletal risk using both visual and skeletal coordinate-based features. A custom multimodal dataset (MusDis-Sports) was created by combining images and skeletal coordinates, with each sample labeled into eight risk categories based on the Rapid Entire Body Assessment (REBA) system. ViSK-GAT integrates two innovative modules: the Fine-Grained Attention Module (FGAM), which refines intra-modal features through self-attention before fusion, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal alignment between image features and coordinates. The model achieved robust performance, with all key metrics exceeding 93%. Probability distribution error metrics also showed a low Root Mean Squared Error (RMSE) of 0.1205 and a Mean Absolute Error (MAE) of 0.0156. ViSK-GAT consistently outperformed state-of-the-art (SOTA) deep learning backbones and showed its potential to advance artificial intelligence-driven musculoskeletal risk assessment and enable timely interventions in sports. 2025-09-07T04:09:06Z Published in Computers and Electrical Engineering Computers and Electrical Engineering, Vol. 138, 111281, 2026 Md. Abdur Rahman Mohaimenul Azam Khan Raiaan Tamanna Shermin Md Rafiqul Islam Mukhtar Hussain Sami Azam 10.1016/j.compeleceng.2026.111281 http://arxiv.org/abs/2606.10450v1 Few-step Generative Models as Lossy Compression 2026-06-09T05:56:22Z DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative models -- Rectified Flow, Consistency Trajectory Models (CTM), and MeanFlow -- can be cast as codecs within the same reverse channel coding (RCC) framework. The main challenge is that RCC requires posterior and shared distribution parameters, whereas these models do not explicitly parameterize intermediate conditional distributions. For Rectified Flow and MeanFlow, we use the equivalence between velocity parameterization and diffusion-style denoising parameterization to derive the quantities required by RCC. For CTM, which is distilled from EDM, we adopt the EDM noise parameterization together with local Gaussian approximations of the sender and shared distributions at intermediate states. This yields a proof-of-concept probabilistic formulation that enables compression with pre-trained few-step generative models without retraining. On low-resolution benchmarks, the resulting codecs reduce encoding and decoding time and improve realism in the low-bit-rate regime. 2026-06-09T05:56:22Z Fuma Kimishima Jinjia Zhou http://arxiv.org/abs/2606.10431v1 Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems 2026-06-09T05:15:25Z Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints. 2026-06-09T05:15:25Z Accepted by TNNLS Shuangchun Gui Zhiguang Cao Wen Song Yew-Soon Ong http://arxiv.org/abs/2501.01481v2 Unleashing Correlation and Continuity for Hyperspectral Reconstruction from RGB Images 2026-06-09T05:03:32Z Reconstructing Hyperspectral Images (HSI) from RGB images can yield high spatial resolution HSI at a lower cost, demonstrating significant application potential. This paper reveals that local correlation and global continuity of the spectral characteristics are crucial for HSI reconstruction tasks. Therefore, we fully explore these inter-spectral relationships and propose a Correlation and Continuity Network (CCNet) for HSI reconstruction from RGB images. For the correlation of local spectrum, we introduce the Group-wise Spectral Correlation Modeling (GrSCM) module, which efficiently establishes spectral band similarity within a localized range. For the continuity of global spectrum, we design the Neighborhood-wise Spectral Continuity Modeling (NeSCM) module, which employs memory units to recursively model the progressive variation characteristics at the global level. In order to explore the inherent complementarity of these two modules, we design the Patch-wise Adaptive Fusion (PAF) module to efficiently integrate global continuity features into the spectral features in a patch-wise adaptive manner. These innovations enhance the quality of reconstructed HSI. We perform comprehensive comparison and ablation experiments on the mainstream datasets NTIRE2022 and NTIRE2020 for the spectral reconstruction task. Compared to the current advanced spectral reconstruction algorithms, our designed algorithm achieves State-Of-The-Art (SOTA) performance. 2025-01-02T15:14:40Z Fuxiang Feng Runmin Cong Shoushui Wei Yipeng Zhang Jun Li Sam Kwong Wei Zhang http://arxiv.org/abs/2606.10407v1 Time-frequency localization of bird calls in dense soundscapes 2026-06-09T04:31:30Z Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes. 2026-06-09T04:31:30Z Simen Hexeberg Fanghui Tong Hari Vishnu Mandar Chitre http://arxiv.org/abs/2511.19706v2 Selective Disk Bispectrum: A Complete and Rotation Invariant Image Descriptor 2026-06-09T04:19:40Z Rotation invariance is a fundamental requirement across many computer vision tasks. Historically, this inductive bias has been encoded through hand-crafted rotation-invariant representations. These are compact, interpretable, and fast to compute, but they come at the cost of descriptive power. More recently, architectures achieve inductive bias through learned representations. These are highly descriptive and achieve strong empirical performance, at the cost of efficiency and interpretability. In this work, we propose an alternative at the intersection of both paradigms. We introduce the selective disk bispectrum (SDB), a complex-valued rotation-invariant vector that preserves all information about the image except its orientation. Our key theoretical contributions are the selective disk bispectrum, its inversion, its (reduced) spatial and computational complexities (compared to the full disk bispectrum), and its expectation and variance under noise. Furthermore, we propose a numerical SDB approximation and provide theoretical guarantees for its accuracy and rotation invariance. Empirically, we validate SDB's invariance and robustness to noise classification tasks. We test our reconstruction algorithm on multi-reference alignment of rotated images. 2025-11-24T21:15:29Z Adele Myers Lantow Nina Miolane http://arxiv.org/abs/2606.10400v1 Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark 2026-06-09T04:18:38Z Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away. 2026-06-09T04:18:38Z 17 pages, 7 figures, Submitted to EMNLP 2026 Pratham Singla Shivank Garg Vihan Singh Paras Chopra http://arxiv.org/abs/2606.10395v1 Efficient RWKV-based Representation Learning for 3D Point Clouds 2026-06-09T04:16:39Z The recent receptance weighted key value (RWKV) model combines RNN-style recurrence, offering a linear-complexity alternative to Transformers' quadratic self-attention for modeling global dependencies. However, when directly applied to point clouds, RWKV, originally developed for sequential text, struggles to capture local geometric structures and model spatial dependencies effectively. To address this, we propose the \textbf{P-RWKV} block, which bridges the gap between sequence modeling and irregular 3D geometry while preserving the efficiency advantages of RWKV. It consists of a Local Perception Expansion (LPE) component to expand contextual perception along the spatio-temporal sequence and a Spatial Context Enhancement (SCE) component to strengthen spatial awareness. To validate the effectiveness of P-RWKV for point cloud understanding, we construct PointER, a single-modality self-supervised representation learning framework whose encoder is composed of stacked P-RWKV blocks. Furthermore, we extend P-RWKV to a cross-modality setting and integrate the proposed core sub-modules into multiple architectures, demonstrating strong plug-and-play flexibility and architectural generality. Extensive experiments show that the P-RWKV block and its key sub-modules achieve competitive performance across various tasks with lower computational cost and inference latency. Code will be released upon acceptance. 2026-06-09T04:16:39Z Yun Liu Xuefeng Yan Liangliang Nan Xianzhi Li Peng Li Zhe Zhu Honghua Chen Mingqiang Wei http://arxiv.org/abs/2602.06886v4 Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers 2026-06-09T03:46:53Z Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality. 2026-02-06T17:19:53Z 19 pages Yuxuan Yao Yuxuan Chen Hui Li Kaihui Cheng Qipeng Guo Yuwei Sun Zilong Dong Jingdong Wang Siyu Zhu http://arxiv.org/abs/2512.06628v3 MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment 2026-06-09T03:43:13Z Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis. 2025-12-07T02:28:06Z Ruicheng Zhang Mingyang Zhang Jun Zhou Xiaofan Liu Zunnan Xu Zhizhou Zhong Puxin Yan Haocheng Luo Xiu Li http://arxiv.org/abs/2606.10378v1 FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation 2026-06-09T03:40:52Z Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images. 2026-06-09T03:40:52Z Jiawei Liu Zhijiang Wan Junhua Hu Rongli Zhang Zhongbiao Xu Yankun Cao Yuan Chen Jin Hong