https://arxiv.org/api/9zbuvS0f0LvhN2mdJiVRcHkrbH0 2026-03-20T10:37:53Z 168181 30 15 http://arxiv.org/abs/2512.02906v3 MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding 2026-03-19T16:35:02Z Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD. 2025-12-02T16:22:01Z Accepted to CVPR 2026 Fan Yang Xingping Dong Xin Yu Wenhan Luo Wei Liu Kaihao Zhang http://arxiv.org/abs/2602.21814v2 Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem 2026-03-19T16:28:03Z Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference. We present a variable isolation study (n=20 per condition, 6 conditions, 120 total trials) examining which prompt architecture layers in a production system enable correct reasoning. Using Claude 3.5 Sonnet with controlled hyperparameters (temperature 0.7, top_p 1.0), we find that the STAR (Situation-Task-Action-Result) reasoning framework alone raises accuracy from 0% to 85% (p=0.001, Fisher's exact test, odds ratio 13.22). Adding user profile context via vector database retrieval provides a further 10 percentage point gain, while RAG context contributes an additional 5 percentage points, achieving 100% accuracy in the full-stack condition. These results suggest that structured reasoning scaffolds -- specifically, forced goal articulation before inference -- matter substantially more than context injection for implicit constraint reasoning tasks. 2026-02-25T11:40:15Z 9 pages, 4 tables Heejin Jo http://arxiv.org/abs/2603.19101v1 FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning 2026-03-19T16:26:13Z FL has emerged as a transformative paradigm for ITS, notably camera-based Road Condition Classification (RCC). However, by enabling collaboration, FL-based RCC exposes the system to adversarial participants launching Targeted Label-Flipping Attacks (TLFAs). Malicious clients (vehicles) can relabel their local training data (e.g., from an actual uneven road to a wrong smooth road), consequently compromising global model predictions and jeopardizing transportation safety. Existing countermeasures against such poisoning attacks fail to maintain resilient model performance near the necessary attack-free levels in various attack scenarios due to: 1) not tailoring poisoned local model detection to TLFAs, 2) not excluding malicious vehicular clients based on historical behavior, and 3) not remedying the already-corrupted global model after exclusion. To close this research gap, we propose FedTrident, which introduces: 1) neuron-wise analysis for local model misbehavior detection (notably including attack goal identification, critical feature extraction, and GMM-based model clustering and filtering); 2) adaptive client rating for client exclusion according to the local model detection results in each FL round; and 3) machine unlearning for corrupted global model remediation once malicious clients are excluded during FL. Extensive evaluation across diverse FL-RCC models, tasks, and configurations demonstrates that FedTrident can effectively thwart TLFAs, achieving performance comparable to that in attack-free scenarios and outperforming eight baseline countermeasures by 9.49% and 4.47% for the two most critical metrics. Moreover, FedTrident is resilient to various malicious client rates, data heterogeneity levels, complicated multi-task, and dynamic attacks. 2026-03-19T16:26:13Z Sheng Liu Panos Papadimitratos http://arxiv.org/abs/2603.19100v1 LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling 2026-03-19T16:26:11Z Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to \emph{differing electrode topologies} and \emph{computational scalability}, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose \textbf{LuMamba} (\textbf{L}atent \textbf{U}nified \textbf{Mamba}), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA's learned-query cross-attention mechanism for channel unification~\cite{luna}, and FEMBA's bidirectional Mamba blocks for efficient temporal modeling~\cite{femba}. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99\% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer's detection (0.97 AUPR), while requiring \textbf{377$\times$ fewer FLOPS} than state-of-art models at equivalent sequence lengths and scaling to \textbf{12$\times$ longer sequences} before reaching typical GPU memory limits. Code is available at https://github.com/pulp-bio/biofoundation 2026-03-19T16:26:11Z 5 pages, 2 figures, 4 tables DanaƩ Broustail Anna Tegon Thorir Mar Ingolfsson Yawei Li Luca Benini http://arxiv.org/abs/2603.19097v1 DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering 2026-03-19T16:23:04Z Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline. 2026-03-19T16:23:04Z Accepted by ICASSP 2026 Yilin Wang Yuchun Fan Jiaoyang Li Ziming Zhu Yongyu Mu Qiaozhi He Tong Xiao Jingbo Zhu http://arxiv.org/abs/2603.19092v1 SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues 2026-03-19T16:18:00Z Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems. 2026-03-19T16:18:00Z Carlos Hinojosa Clemens Grange Bernard Ghanem http://arxiv.org/abs/2603.19087v1 Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity 2026-03-19T16:12:05Z Are large language models (LLMs) creative in the same way humans are, and can the same interventions increase creativity in both? We evaluate a promising but largely untested intervention for creativity: forcing creators to draw an analogy from a random, remote source domain (''cross-domain mapping''). Human participants and LLMs generated novel features for ten daily products (e.g., backpack, TV) under two prompts: (i) cross-domain mapping, which required translating a property from a randomly assigned source (e.g., octopus, cactus, GPS), and (ii) user-need, which required proposing innovations targeting unmet user needs. We show that humans reliably benefit from randomly assigned cross-domain mappings, while LLMs, on average, generate more original ideas than humans and do not show a statistically significant effect of cross-domain mappings. However, in both systems, the impact of cross-domain mapping increases when the inspiration source becomes more semantically distant from the target. Our results highlight both the role of remote association in creative ideation and systematic differences in how humans and LLMs respond to the same intervention for creativity. 2026-03-19T16:12:05Z Qiawen Ella Liu Marina Dubova Henry Conklin Takumi Harada Thomas L. Griffiths http://arxiv.org/abs/2603.19074v1 CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem 2026-03-19T15:59:45Z Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO's key components, and real-world tests on a mobile robot platform demonstrate its practical applicability. 2026-03-19T15:59:45Z 9 pages, 3 figures Fengxiaoxiao Li Xiao Mao Mingfeng Fan Yifeng Zhang Yi Li Tanishq Duhan Guillaume Sartoretti http://arxiv.org/abs/2603.19066v1 Parallelograms Strike Back: LLMs Generate Better Analogies than People 2026-03-19T15:56:24Z Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently. 2026-03-19T15:56:24Z Qiawen Ella Liu Raja Marjieh Jian-Qiao Zhu Adele E. Goldberg Thomas L. Griffiths http://arxiv.org/abs/2510.16001v2 An Order-Sensitive Conflict Measure for Random Permutation Sets 2026-03-19T15:53:31Z Random permutation set (RPS) is a new formalism for reasoning with uncertainty involving order information. Measuring the conflict between two pieces of evidence represented by permutation mass functions remains an open issue in order-dependent uncertain information fusion. This paper analyzes conflicts in RPS from two different perspectives: random finite set (RFS) and Dempster-Shafer theory (DST). From the DST perspective, the order information incorporated into focal sets reflects a qualitative propensity where higher-ranked elements are more significant. Motivated by this view and observations on permutations, we define a non-overlap-based inconsistency measure for permutations and develop an order-sensitive conflict measure for RPSs. The proposed method reformulates the conflict in RPSs as a graded, order-dependent notion rather than a simple dichotomy of conflict versus non-conflict. Numerical examples are presented to validate the behavior and properties of the proposed conflict measure. The proposed method not only exhibits an inherent top-weightedness property and effectively quantifies conflict between RPSs within the DST framework, but also provides decision-makers with flexibility in selecting weights, parameters, and truncation depths. 2025-10-14T09:35:03Z Ruolan Cheng Yong Deng http://arxiv.org/abs/2603.19054v1 Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding 2026-03-19T15:47:50Z Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints. 2026-03-19T15:47:50Z Yikai Zheng Xin Ding Yifan Yang Shiqi Jiang Hao Wu Qianxi Zhang Weijun Wang Ting Cao Yunxin Liu http://arxiv.org/abs/2603.19042v1 Man and machine: artificial intelligence and judicial decision making 2026-03-19T15:38:50Z The integration of artificial intelligence (AI) technologies into judicial decision-making - particularly in pretrial, sentencing, and parole contexts - has generated substantial concerns about transparency, reliability, and accountability. At the same time, these developments have brought the limitations of human judgment into sharper relief and underscored the importance of understanding how judges interact with AI-based decision aids. Using criminal justice risk assessment as a focal case, we conduct a synthetic review connecting three intertwined aspects of AI's role in judicial decision-making: the performance and fairness of AI tools, the strengths and biases of human judges, and the nature of AI+human interactions. Across the fields of computer science, economics, law, criminology and psychology, researchers have made significant progress in evaluating the predictive validity of automated risk assessment instruments, documenting biases in judicial decision-making, and, to a more limited extent, examining how judges use algorithmic recommendations. While the existing empirical evidence indicates that the impact of AI decision aid tools on pretrial and sentencing decisions is modest or inexistent, our review also reveals important gaps in the canvassed literatures. Further research is needed to evaluate the performance of AI risk assessment instruments, understand how judges navigate noisy decision making environments and how individual characteristics influence judges' responses to AI advice. We argue that AI vs Human comparisons have the potential to yield new insights into both algorithmic tools and human decision-makers and advocate greater interdisciplinary integration and cross-fertilization in future research. 2026-03-19T15:38:50Z Arthur Dyevre Ahmad Shahvaroughi http://arxiv.org/abs/2403.02482v2 Heuristic Multiobjective Discrete Optimization using Restricted Decision Diagrams 2026-03-19T15:38:09Z Decision diagrams (DDs) have emerged as a state-of-the-art method for exact multiobjective integer linear programming. When the DD is too large to fit into memory or the decision-maker prefers a fast approximation to the Pareto frontier, the complete DD must be restricted to a subset of its states (or nodes). We introduce new node-selection heuristics for constructing restricted DDs that produce a high-quality approximation of the Pareto frontier. Depending on the structure of the problem, our heuristics are based on either simple rules, machine learning with feature engineering, or end-to-end deep learning. Experiments on multiobjective knapsack, set packing, and traveling salesperson problems show that our approach is highly effective, recovering over 85% of the Pareto frontier while achieving 2.5x speedups over exact DD enumeration on average, with very few non-Pareto solutions. The code is available at https://github.com/rahulptel/HMORDD. 2024-03-04T21:04:54Z To appear in the proceedings of CPAIOR 2026 Rahul Patel Elias B. Khalil David Bergman http://arxiv.org/abs/2603.17759v2 Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor 2026-03-19T15:34:50Z Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing Safe jokes from Harmful ones, with the latter further classified into Explicit (overt) and Implicit (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. Warning: this paper contains example data that may be offensive, harmful, or biased. 2026-03-18T14:21:10Z Ahmed Sharshar Hosam Elgendy Saad El Dine Ahmed Yasser Rohaim Yuxia Wang http://arxiv.org/abs/2511.23253v2 AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture 2026-03-19T15:28:24Z Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/wenyb/AgriCoT. 2025-11-28T15:02:19Z Yibin Wen Qingmei Li Zi Ye Jiarui Zhang Zurong Mai Jing Wu Shuohong Lou Yuhang Chen Henglian Huang Xiaoya Fan Yang Zhang Defeng Gu Lingyuan Zhao Yutong Lu Haohuan Fu Jianxi Huang Juepeng Zheng