https://arxiv.org/api/io40s89swj6S79Zg9nPJlhs1G5A
2026-06-22T15:09:28Z
112579
585
15
http://arxiv.org/abs/2510.16882v4
Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning
2026-06-13T11:40:06Z
Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops UDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.
2025-10-19T15:32:01Z
ICML 2026 accepted paper
Heming Zou
Yixiu Mao
Yun Qu
Qi Wang
Xiangyang Ji
http://arxiv.org/abs/2606.13441v2
Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
2026-06-13T11:39:47Z
Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.
2026-06-11T15:03:48Z
Joseph Keshet
http://arxiv.org/abs/2308.06035v4
Attention, not scale, drives human-AI alignment in multimodal language prediction
2026-06-13T10:19:41Z
Humans routinely draw on visual context to predict upcoming words. To what extent current vision-language models produce comparable behaviour is unclear. Here we placed five state-of-the-art pretrained systems side-by-side with 600 human participants in a web-based Visual-World Paradigm. On each of 100 six-second movie clips, models and participants received either text only or synchronised video and text and judged how likely a specified target word was to appear next; human eye movements were tracked throughout. Adding visual context increased model-human alignment in predictability ratings across all architectures (average Delta r = 0.18) with no impact of parameter size. When visual context was informative, transformer attention significantly increased alignment. Attention maps from two transformer models corresponded with human gaze, explaining up to 70% of the inter-participant variance when the scene contained informative cues. Notably, cross-modal attention reliably tracked anticipatory human fixations on semantic cues. These results suggest that current transformer-based vision-language models can approximate human behaviour exploiting visual context during language prediction - and that selective attention to informative cues, not sheer model scale, is the principal driver of this alignment.
2023-08-11T09:30:07Z
39 pages, 6 Figures, published in NPJ Artificial Intelligence
Viktor Kewenig
Andrew Lampinen
Samuel A. Nastase
Christopher Edwards
Quitterie Lacome D'Elascombe
Akilles Rechardt
Jeremy I Skipper
Gabriella Vigliocco
http://arxiv.org/abs/2602.22391v2
Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework
2026-06-13T09:57:16Z
Internet memes have become a dominant form of expression on social media, including within the Bengali speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is exceptionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource languages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyses both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task. To facilitate reproducibility and future research, the Bn-HIB dataset has been made publicly available through Mendeley Data. Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised
2026-02-25T20:40:25Z
Added public link to dataset and fixed typo in abstract
Rakib Ullah
Sylhet Engineering College
Mominul islam
Daffodil International University
Md Sanjid Hossain
Daffodil International University
Md Ismail Hossain
Daffodil International University
http://arxiv.org/abs/2606.11520v2
ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories
2026-06-13T09:55:32Z
Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at https://github.com/Valiere01/ISE-Trace.
2026-06-09T23:44:26Z
13 pages, 6 figures. Dataset and code: https://github.com/Valiere01/ISE-Trace
Siyuan Luo
Nairong Zheng
Lin Zhou
Tiankuo Yao
Shengyou Yuan
Haojia Yu
Cong Pang
Jiapeng Luo
Lewei Lu
http://arxiv.org/abs/2606.15216v1
Spokes: Optimizing for Diverse Pretraining Data Selection
2026-06-13T09:28:46Z
Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a +489 increase in G-Vendi score on a 500k-sample subset. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods. Notably, SPOKES (diversity-only) improves average downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. More importantly, jointly optimizing for both quality and diversity yields the strongest results: SPOKES achieves gains of +1.5 and +1.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering.
2026-06-13T09:28:46Z
9 pages, 4 figures
Clarence Lee
Yejin Choi
Luke Zettlemoyer
Pang Wei Koh
Hai Leong Chieu
http://arxiv.org/abs/2501.17615v2
Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition
2026-06-13T09:01:20Z
We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.
2025-01-29T12:44:30Z
Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)
in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 4226-4238, 2025
Zhengdong Yang
Qianying Liu
Sheng Li
Fei Cheng
Chenhui Chu
10.1109/TASLPRO.2025.3617233
http://arxiv.org/abs/2605.29796v3
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
2026-06-13T08:39:57Z
Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.
2026-05-28T11:45:45Z
Yunbo Tang
Chengyi Yang
Shiyu Liu
Zhishang Xiang
Zerui Chen
Qinggang Zhang
Jinsong Su
http://arxiv.org/abs/2606.15191v1
AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani
2026-06-13T08:36:16Z
Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for measuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.
2026-06-13T08:36:16Z
The 1st Workshop on Stereotypes Across Cultures in Language Technologies
Michelle Barbosa
Sebastian Padó
Franziska Weeber
http://arxiv.org/abs/2410.13439v6
Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning
2026-06-13T08:21:08Z
Supervised contrastive learning has achieved remarkable success by leveraging label information; however, determining positive samples in multi-label scenarios remains a critical challenge. In multi-label supervised contrastive learning (MSCL), multi-label relations are not yet fully defined, leading to ambiguity in identifying positive samples and formulating contrastive loss functions to construct the representation space. To address these challenges, we: (i) systematically formulate multi-label relations in MSCL, (ii) propose a novel Similarity-Dissimilarity Loss, which dynamically re-weights samples based on similarity and dissimilarity factors, (iii) further provide theoretically grounded proofs for our method through rigorous mathematical analysis that supports the formulation and effectiveness, and (iv) offer a unified form and paradigm for both single-label and multi-label supervised contrastive loss. We conduct experiments on both image and text modalities and further extend the evaluation to the medical domain. The results show that our method consistently outperforms baselines in comprehensive evaluations, demonstrating its effectiveness and robustness.
2024-10-17T11:12:55Z
Accepted by Transactions on Machine Learning Research (TMLR)
Guangming Huang
Yunfei Long
Cunjin Luo
http://arxiv.org/abs/2410.00812v3
Generative causal testing to bridge data-driven models and scientific theories in language neuroscience
2026-06-13T07:43:18Z
Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from predictive models and then testing those explanations in follow-up experiments using LLM-generated stimuli.This approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs), including newly identified microROIs in prefrontal cortex. We show that explanatory accuracy is closely related to the predictive power and stability of the underlying predictive models. Finally, we show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.
2024-10-01T15:57:48Z
Accepted to Nature Neuroscience, please cite that version
Richard Antonello
Chandan Singh
Shailee Jain
Aliyah Hsu
Sihang Guo
Jianfeng Gao
Bin Yu
Alexander Huth
http://arxiv.org/abs/2503.23688v2
Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions
2026-06-13T07:21:38Z
Large language models are how hundreds of millions of people now encounter contested political questions, raising a subtle measurement problem: a model that simply agrees with whatever it is told can masquerade as biased, contaminating any claim that models hold political opinions. We address this by importing balanced keying from survey psychometrics, posing each proposition and its swapped reverse and signing the response so acquiescence cancels and genuine conviction accumulates. The result is a reproducible, quantitative instrument that maps geopolitical stance across 11 models and 2 languages (19,712 responses). Developer origin, query language and issue domain emerge as three near-equal, additive factors; every model, including those built in the United States, leans more Pro-China in Mandarin; and two models with identical agreement bias are told apart, one neutral, one biased. We release it as an open, interactive tool that extends to any contested-opinion domain.
2025-03-31T03:38:17Z
37 pages, 6 main-text figures, 12 supplementary figures, 5 supplementary tables; supplementary information included
William Guey
Wei Zhang
Pierrick Bougault
Vitor D. de Moura
José O. Gomes
http://arxiv.org/abs/2606.15161v1
Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective
2026-06-13T07:16:16Z
The considerable layer-wise redundancy in large language models (LLMs) has established non-uniform sparsity allocation across layers as the standard pruning approach for efficient compression. Existing layer-wise allocation methods that estimate allocation strategy from local signals such as activation outliers or weight spectra mainly derive from local layer importance, whereas the final post-pruning performance is also influenced by the network's subsequent compensatory capacity. In this paper, we directly characterize this property through controlled perturbation experiments. We make the following empirical findings. First, layers exhibit highly heterogeneous responses to pruning-scale perturbations. In most cases, early layers amplify perturbations, while middle and late layers actively absorb them, with relative L2 drift decreasing monotonically across depth and direction realigning toward the unperturbed hidden-state trajectory. Second, absorption is a large-perturbation phenomenon. Under small perturbations the network exhibits amplification across all layers, and the transition to absorption occurs smoothly as perturbation magnitude grows to pruning scale. This enriches the linearized accumulation theory underlying related works. Building on these findings, we define an absorption coefficient per layer and propose absorption-aware correction, an orthogonal augmentation that improves OWL and AlphaPruning by reducing perplexity by 7.13% and boosting zero-shot accuracy by 1.02% across multiple model families at 70% sparsity.
2026-06-13T07:16:16Z
10 pages, 4 figures, 4 tables. Submitted to EMNLP 2026
Tao Jing
Ningxin Wu
Chen Kang
Dong Yu
Changliang Li
Pengyuan Liu
http://arxiv.org/abs/2606.15152v1
Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation
2026-06-13T06:44:31Z
Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.
2026-06-13T06:44:31Z
Shijun Wan
Xuehai Wu
Jiwen Zhang
Siyuan Wang
Zhongyu Wei
http://arxiv.org/abs/2606.15144v1
PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino
2026-06-13T06:12:56Z
Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.
2026-06-13T06:12:56Z
Submitted to EMNLP 2026
Jann Railey Montalan
David Demitri Africa
Jimson Paulo Layacan
Richell Isaiah Flores
Ivan Yuri De Leon
Lance Calvin Gamboa