https://arxiv.org/api/FWvYWO/W+PE0XVuZQZvKh8IMryA2026-04-07T08:37:17Z10574630015http://arxiv.org/abs/2604.02324v1Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation2026-04-02T17:59:19ZLanguage models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.2026-04-02T17:59:19ZDaiwei ChenZhoutong FuChengming JiangHaichao ZhangRan ZhouTan WangChunnan YaoGuoyao LiRui CaiYihan CaoRuijie JiangFedor BorisyukJianqiang ShenJingwei WuRamya Korlakai Vinayakhttp://arxiv.org/abs/2604.02322v1Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning2026-04-02T17:58:50ZLarge Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.2026-04-02T17:58:50Z43 pages, 5 figures, 24 tablesBangji YangHongbo MaJiajun FanGe Liuhttp://arxiv.org/abs/2604.02319v1No Single Best Model for Diversity: Learning a Router for Sample Diversity2026-04-02T17:58:37ZWhen posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.2026-04-02T17:58:37Zunder review at COLM 2026Yuhan LiuFangyuan XuVishakh PadmakumarDaphne IppolitoEunsol Choihttp://arxiv.org/abs/2604.02309v1go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices2026-04-02T17:51:19ZDoubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.2026-04-02T17:51:19Z29 pages, 30 figures, 9 tables. Includes supplementary materialTorque DandachiSophia Diggs-Galliganhttp://arxiv.org/abs/2603.25638v2Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers2026-04-02T17:45:24ZThrough an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.2026-03-26T16:49:00ZVisualization of word usage patterns in arXiv abstracts: https://llm-impact.github.io/Mingmeng GengYuhang DongThierry Poibeauhttp://arxiv.org/abs/2604.01195v2ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget2026-04-02T17:27:04ZSearch agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question-answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4-5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.2026-04-01T17:42:41ZPreprintNandan ThakurZijian ChenXueguang MaJimmy Linhttp://arxiv.org/abs/2510.22028v3Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics2026-04-02T17:12:49ZQuality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking. However, the prevalence and impact of length bias in QE metrics have been underexplored. Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a systematic preference for shorter translations when multiple candidates of comparable quality are available for the same source text. These biases risk unfairly penalizing longer, correct translations and can propagate into downstream pipelines that rely on QE signals for data selection or system optimization. We trace the root cause of learned QE metrics to skewed supervision distributions, where longer error-free examples are underrepresented in training data. As a diagnostic intervention, we apply length normalization during training and show that this simple modification effectively decouples error prediction from sequence length, yielding more reliable QE signals across translations of varying length.2025-10-24T21:22:06ZYilin ZhangWenda XuZhongtao LiuTetsuji NakagawaMarkus Freitaghttp://arxiv.org/abs/2603.10913v2LLM2Vec-Gen: Generative Embeddings from Large Language Models2026-04-02T17:09:12ZFine-tuning LLM-based text embedders via contrastive learning maps inputs and outputs into a new representational space, discarding the LLM's output semantics. We propose LLM2Vec-Gen, a self-supervised alternative that instead produces embeddings directly in the LLM's output space by learning to represent the model's potential response. Specifically, trainable special tokens are appended to the input and optimized to compress the LLM's own response into a fixed-length embedding, guided by an unsupervised embedding teacher and a reconstruction objective. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 8.8% over the unsupervised embedding teacher. Since the embeddings preserve the LLM's response-space semantics, they inherit capabilities such as safety alignment (up to 22.6% reduction in harmful content retrieval) and reasoning (up to 35.6% improvement on reasoning-intensive retrieval). Finally, the learned embeddings are also interpretable: they can be decoded back into text to reveal their semantic content.2026-03-11T15:58:47ZParishad BehnamGhaderVaibhav AdlakhaFabian David SchmidtNicolas ChapadosMarius MosbachSiva Reddyhttp://arxiv.org/abs/2603.16880v2NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning2026-04-02T17:07:27ZElectroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.2026-02-24T21:10:42ZGuoan WangShihao YangJun-en DingHao ZhuFeng Liuhttp://arxiv.org/abs/2604.02276v1De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules2026-04-02T17:06:50ZRegulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.2026-04-02T17:06:50ZKeerat GulianiDeepkamal GillDavid LandsmanNima EshraghiKrishna KumarLovedeep Gondarahttp://arxiv.org/abs/2507.05660v2Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI2026-04-02T16:59:25ZCustomizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus2025-07-08T04:40:09ZAccepted at ACM CODASPY 2026Aravind CheruvuShravya KanchiSifat Muhammad AbdullahNicholas KongDaphne YaoMurtuza JadliwalaBimal Viswanathhttp://arxiv.org/abs/2601.03111v2One Sample to Rule Them All: Extreme Data Efficiency in Multidiscipline Reasoning with Reinforcement Learning2026-04-02T16:27:06ZThe reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually rely on high-quality samples of large volumes. In this paper, we challenge conventional assumptions about data requirements in RL for LLMs by demonstrating the effectiveness of one-shot reinforcement learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary reasoning improvement. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology; (2) Analysis of salient mathematical skills provides insight into the characteristics associated with effective polymath samples; and (3) An engineered synthetic sample that integrates multidisciplinary elements and broader skill coverage achieves stronger performance than naturally occurring individual samples. Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of samples that complements simply increasing data volume.2026-01-06T15:41:35ZYiyuan LiZhen HuangYanan WuWeixun WangXuefeng LiYijia LuoWenbo SuBo ZhengPengfei Liuhttp://arxiv.org/abs/2604.02403v1Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics2026-04-02T16:17:55ZThis paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity.
Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.2026-04-02T16:17:55ZWorking paper. 13 pages, 7 figures, 6 references. Part of the Cognitive Factor Economics research program. Code: https://github.com/Cespial/cognitive-factor-economicsCristian Espinal Mayahttp://arxiv.org/abs/2604.02217v1VISTA: Visualization of Token Attribution via Efficient Analysis2026-04-02T16:09:13ZUnderstanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit2026-04-02T16:09:13Z12 pages, 3 figuresSyed AhmedBharathi Vokkaliga GaneshJagadish Babu PKarthick SelvarajPraneeth TalluriSanket HingneAnubhav KumarAnushka YadavPratham Kumar VermaKiranmayee JanardhanMandanna A Nhttp://arxiv.org/abs/2604.02209v1CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech2026-04-02T16:02:20ZEnd-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.2026-04-02T16:02:20ZAccepted at OSACT 2026Youssef SaidiHaroun ElleuchFethi Bougares