https://arxiv.org/api/sJp1Fp71PJ64l09QukiEbcCox6g 2026-06-14T14:52:14Z 195716 420 15 http://arxiv.org/abs/2606.10533v1 Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning 2026-06-09T08:04:06Z

Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).

2026-06-09T08:04:06Z Zihan Meng Dexiang Hong Weidong Chen Ziyu Zhou Bo Hu Zhendong Mao http://arxiv.org/abs/2606.10522v1 GUI-AC: Enhancing Continual Learning in GUI Agents 2026-06-09T07:52:10Z

Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at https://anonymous.4open.science/r/GUI-AC.

2026-06-09T07:52:10Z Can Lin Tao Feng Hangjie Yuan Dan Zhang Yifan Zhu Zhonghong Ou http://arxiv.org/abs/2606.10517v1 LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching 2026-06-09T07:49:49Z

Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. However, existing approaches typically rely on behavior cloning, which tends to collapse inherently multimodal action distributions into unimodal ones, thereby degrading the pretrained latent action structure. While flow matching provides a potential alternative, directly applying it leads to a misalignment between latent actions and physical actions during action decoder training, due to the stochastic nature of the learned policy. To address these, we propose Latent Action Flow Policy (LAFP), which leverages flow matching for latent policy learning and introduces an inference-time interpolation mechanism to mitigate stochasticity-induced misalignment. Experimental results demonstrate that LAFP consistently outperforms prior methods on downstream imitation learning tasks, achieving up to 10-15% improvement in success rate while incurring less than 1x additional inference overhead.

2026-06-09T07:49:49Z Jiexi Lyu Xizhou Bu Qingqiu Huang Chufeng Tang Xiaoshuai Hao Hongbo Wang Wei Li http://arxiv.org/abs/2602.19086v2 Seal-Robust KCR: A Robust Kuzushiji Character Recognition Framework under Seal Interference 2026-06-09T07:45:57Z

Kuzushiji was one of the most widely used cursive writing systems in pre-modern Japan. Due to its highly cursive forms and extensive glyph variations, most modern Japanese readers are unable to read Kuzushiji characters. Consequently, recent studies have focused on developing automated Kuzushiji character recognition (KCR) methods, which have achieved strong performance on relatively clean Japanese historical document images. Although seals frequently appear in Japanese historical documents, existing methods often fail to maintain recognition accuracy under seal interference, particularly when seals overlap with characters. To address this challenge, we propose a seal-robust KCR framework. Based on character detection, classification, and ordering, the proposed framework additionally incorporates document restoration to mitigate seal interference, thereby improving overall recognition performance. In addition, we introduce a novel synthetic data augmentation strategy to enhance the performance of character detection models. We further correct annotation errors, reconstruct the dataset, and create a synthetic test set to simulate severe seal interference. Experimental results demonstrate the effectiveness of the proposed framework in mitigating the impact of seal interference on KCR. Compared with a conventional baseline and NDLkotenOCR, it achieves relative character error rate (CER) reductions of 39.7% and 5.9%, respectively, on the real test set, and 50.1% and 41.7%, respectively, on the synthetic test set.

2026-02-22T07:58:29Z Supplementary material is available at https://ruiyangju.github.io/Seal-Robust-KCR Rui-Yang Ju Kohei Yamashita Hirotaka Kameko Shinsuke Mori http://arxiv.org/abs/2603.04852v2 Non-Parametric Structural Priors for Geometry Theorem Prediction 2026-06-09T07:41:14Z

Multi-step theorem prediction is a central challenge in geometry problem solving. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

2026-03-05T06:08:50Z Junbo Zhao Ting Zhang Can Li Wei He Jingdong Wang Hua Huang http://arxiv.org/abs/2605.07415v2 ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring 2026-06-09T07:39:39Z

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

2026-05-08T08:07:11Z Tianhao Niu Ziyu Han Xuan Dong Qingfu Zhu Wanxiang Che http://arxiv.org/abs/2604.06893v4 Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models 2026-06-09T07:38:03Z

Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

2026-04-08T09:48:31Z 8 pages Tom Devynck Bilal Faye Djamel Bouchaffra Nadjib Lazaar Hanane Azzag Mustapha Lebbah http://arxiv.org/abs/2512.08180v2 GeoLoom: High-quality Geometric Diagram Generation from Textual Input 2026-06-09T07:26:39Z

High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.

2025-12-09T02:22:23Z Xiaojing Wei Ting Zhang Wei He Jingdong Wang Hua Huang http://arxiv.org/abs/2606.10492v1 PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation 2026-06-09T07:16:23Z

The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbf{PathSpec}), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbf{PathExplore}) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbf{PathRelax}) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 $\times$, 3.95$\times$, and 4.18$\times$, respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax's relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at https://github.com/Haodong-Lei-Ray/PathSpec.

2026-06-09T07:16:23Z 10 pages, 5 figures Haodong Lei Hongsong Wang Bingxuan Dai Pan Zhou http://arxiv.org/abs/2604.22192v2 CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution 2026-06-09T07:14:26Z

Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.

2026-04-24T03:39:51Z Accepted to ACL 2026 Main Xiangxi Zheng Kuang He Jiayi Hu Ping Yu Rui Yan Yuan Yao Peng Hou Anxiang Zeng Alex Jinpeng Wang http://arxiv.org/abs/2606.10488v1 5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning 2026-06-09T07:03:43Z

Parameter-Efficient Fine-Tuning (PEFT) methods provide a streamlined and efficient tool for adapting large models to domain-specific multimodal downstream tasks. Although these methods proved their tangible effects in practice, their principal aspects remain under-explored. Therefore we remain curious about the underlying generalization mechanisms in various PEFT methods and how they can be further enhanced. In this paper, we reveal the flatness preference widely present in various PEFTs, where a small fraction of sharp dimensions dominates the generalization of PEFT. This finding suggests an appealing possibility: we may be satisfied with a better generalization by merely attending to this small fraction of sharp dimensions instead of all of them. Furthermore, we propose Flatness Preference Optimization (FlatPO) to flatten these key sharpness dimensions, leading various PEFTs toward better generalization. Extensive experiments demonstrate the effectiveness of our findings and the proposed method. Code is available at https://github.com/Can-Lin/FlatPO.

2026-06-09T07:03:43Z Yifan Zhu Can Lin Hangjie Yuan Zixiang Zhao Pengfei Zhang Tao Feng Zhonghong Ou http://arxiv.org/abs/2505.11034v2 CleanPatrick: A Benchmark for Image Data Cleaning 2026-06-09T06:59:42Z

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (32%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and employs standard ranking metrics that mirror real audit workflows. We benchmark classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, FINE, BHN, and SelfClean. On CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and detecting implausible labels under conservative human judgment remains challenging for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies.

2025-05-16T09:29:41Z Accepted at Journal of Data-centric Machine Learning Research (DMLR) Fabian Gröger Simone Lionetti Philippe Gottfrois Alvaro Gonzalez-Jimenez Ludovic Amruthalingam Elisabeth Victoria Goessinger Hanna Lindemann Marie Bargiela Marie Hofbauer Omar Badri Philipp Tschandl Arash Koochek Matthew Groh Alexander A. Navarini Marc Pouly http://arxiv.org/abs/2601.06997v2 ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction 2026-06-09T06:46:46Z

Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .

2026-01-11T17:14:33Z Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/ Yuetao Li Zhizhou Jia Yu Zhang Qun Hao Shaohui Zhang 10.1109/TASE.2026.3700105 http://arxiv.org/abs/2606.10478v1 3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis 2026-06-09T06:46:29Z

Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are fundamentally low-level and hard to control programmatically. In contrast, we propose and systematically evaluate a new 3D reconstruction paradigm, 3D Code Synthesis (3D-CoS), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium. To assess how well current VLMs can use code to represent 3D objects, we evaluate representative open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce a suite of structured code-synthesis workflows, including blueprint-based planning, Retrieval-Augmented Generation (RAG) over Blender API documentation, few-shot geometric demonstrations, and a component-level Agent workflow for part-wise code generation. To demonstrate the unique advantages of this representation, we further evaluate localized text-driven modifications and compare our code-based edits with a point-cloud-based 3D editing baseline. Our study shows that code as a 3D representation offers strong controllability and locality, yielding stronger edit fidelity and better preservation of unedited regions in our targeted editing evaluation. Our work also analyzes the potential of this paradigm, delineates the current capability frontier of VLMs for programmatic 3D modeling, and highlights code synthesis as a promising direction for editable 3D reconstruction.

2026-06-09T06:46:29Z Preprint. 24 pages, 11 figures Yuhao Wang Puyi Wang Linjie Li Zhengyuan Yang Kevin Qinghong Lin Yu Cheng http://arxiv.org/abs/2606.11269v1 Traits Run Deeper: Trait-Specific Asymmetric Fusion for Personality Assessment 2026-06-09T06:38:36Z

Personality assessment aims to infer stable personality traits from dynamic behaviors across language, voice, and facial cues. Since different personality dimensions are revealed through distinct behavioral perspectives, modeling trait-specific evidence is challenging. However, most existing approaches adopt a uniform multimodal fusion strategy across all dimensions, assuming identical modality contributions. This overlooks trait-specific modality preferences and introduces cross-modal interference. To address this issue, we propose a novel personality assessment framework called Traits Run Deeper, which consists of three components. Specifically, the Multimodal Foundation Representation (MFR) module constructs personality-oriented multimodal inputs and leverages psychology-informed semantic templates as anchors, enabling foundation models to capture trait-relevant information. Building upon MFR, the Trait-Specific Modality Fusion (TSMF) module acts as an asymmetric fusion mechanism, allowing each dimension to selectively exploit different modality pathways from modality-specific modeling to complementary fusion. Thus, TSMF captures heterogeneous modality preferences while reducing cross-modal contamination. Furthermore, the Distribution-Calibrated Personality Regression (DCPR) module mitigates label imbalance and central tendency bias through target distribution calibration, improving robustness and stability. Experimental results on the AVI Challenge 2026 validation set demonstrate the effectiveness of the proposed framework, reducing mean squared error (MSE) by approximately 25% compared with the baseline. Consistent improvements are observed on the official test set, where our method achieves the best performance and ranks first in the Personality Assessment Track. The source code will be made available at https://github.com/MSA-LMC/AVI2026.

2026-06-09T06:38:36Z Jia Li Qian Chen Wei Wang Xinyu Li Zhenzhen Hu Dongsheng Shao Richang Hong Meng Wang