https://arxiv.org/api/m5YehICni5GyfcSKvhy+AqWtKBk2026-06-22T20:48:41Z11257966015http://arxiv.org/abs/2606.14516v1Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results2026-06-12T14:47:37ZAI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.2026-06-12T14:47:37ZJan BatznerSree Harsha NelaturuDamian StachuraAnastassia KornilovaJon CrallTommaso CerrutiYanan LongYifan MaiSanchit AhujaAsaf YehudaiMarek ŠuppaJohn P. LalorOluwagbemike OloweJatin GanhotraBrian H. HuEliya HabbaAndrew M. BeanChang LiuSander LandSteven DillmannAniketh GarikaparthiElron BandelSaki ImaiJames EdgellWm. Matthew KennedyJenny ChimPatrick MeuslingAsteria KaeberleinVenkata Ramachandra Karthik ChundiManasi PatwardhanMartin KuAustin MeekLeon KnauerBrian WingenrothSrishti YadavUsman GoharFelix FriedrichMichelle LinJennifer MickelArman CohanStella BidermanIrene SolaimanZeerak TalatAnka ReuelMubashara AkhtarGjergji KasneciAvijit GhoshLeshem Choshenhttp://arxiv.org/abs/2606.14512v1Fodor and Pylyshyn's Systematicity Challenge Still Stands2026-06-12T14:42:36ZThe recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.2026-06-12T14:42:36ZAccepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paperMichael GoodaleSalvador Mascarenhashttp://arxiv.org/abs/2510.05150v3Chronological Thinking in Full-Duplex Spoken Dialogue Language Models2026-06-12T14:26:27ZRecent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, an on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.2025-10-02T10:28:11ZAccepted by SIGDIAL 2026Donghang WuHaoyang ZhangChen ChenTianyu ZhangFei TianXuerui YangGang YuHexin LiuNana HouYuchen HuEng Siong Chnghttp://arxiv.org/abs/2602.14169v2Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling2026-06-12T14:12:24ZEffective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines. Code is available at https://github.com/AgentCombo/DEEP-GRPO2026-02-15T14:44:15ZYiran GuoZhongjian QiaoYingqi XieJie LiuDan YeRuiqing ZhangShuang QiuLijie Xuhttp://arxiv.org/abs/2606.14470v1GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge2026-06-12T14:02:37ZLarge language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost.
We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.2026-06-12T14:02:37Z10 pages, 1 figure, 9 tablesPavan C ShekarAbhishek H SAswanth Krishnanhttp://arxiv.org/abs/2605.28591v3Models That Know How Evaluations Are Designed Score Safer2026-06-12T14:01:36ZThe validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on five safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.2026-05-27T15:11:35ZKatharina DeckenbachHaritz PuertoJonas GeipingSahar Abdelnabihttp://arxiv.org/abs/2606.14460v1A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions2026-06-12T13:51:25ZTransformer-based clinical language models are increasingly integrated into high-stakes clinical decision support pipelines, yet the computational mechanisms through which demographic associations encoded in medical documentation propagate into model probability distributions remain empirically underspecified. We present a systematic computational audit of representational bias in ClinicalBERT (Alsentzer et al., 2019), a BERT-based model pretrained on MIMIC-III discharge summaries, employing two complementary probing methodologies: Log Probability Bias Analysis (LPBA), which quantifies demographic descriptor-induced shifts in masked token probability distributions across behavioral and evaluative semantic categories, and Masked Language Model-based analysis (MLM), which probes internal representational structure for demographic agency attribution encoding across 98 real clinical sentence templates and eight intersectional race-gender combinations. Corpus frequency analysis operationalizes the distinction between statistical disparity and bias amplification by benchmarking model outputs against empirical term frequencies in the MIMIC-III training corpus. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing, providing direct empirical evidence that representational bias in ClinicalBERT operates predominantly through model-internal amplification rather than training data inheritance. Keywords: natural language processing, clinical documentation, algorithmic auditing, representational bias, health equity 12026-06-12T13:51:25Z17 pages, 4 tables, appendices A-E, preprintKehinde Temitayo Soetanhttp://arxiv.org/abs/2606.14459v1MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition2026-06-12T13:50:09ZModern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.2026-06-12T13:50:09ZAccepted at Interspeech 2026Theresa Pekarek RosinMatthias KerzelStefan Wermterhttp://arxiv.org/abs/2508.06196v2EiCAP: Beyond Fluency, Probing and Improving Emotional Intelligence in LLMs via Psychologically Grounded Multi-Turn Dialogue2026-06-12T13:42:50ZLarge Language Models increasingly serve in emotionally sensitive roles, including mental health support, education, and crisis response, yet they lack a principled framework for assessing or improving Emotional Intelligence (EI). We introduce EiCAP, a unified, psychologically grounded six-layer EI taxonomy operationalized into two complementary resources. EiCAP-Bench is a multi-turn, one-vs-three forced-choice evaluation suite with 3,174 probes across 24 subcategories and cross-turn dependencies that reflect real conversational EI demands. EiCAP-SFT is a 152,820-dialogue supervision corpus aligned to the same taxonomy, enabling controlled, interpretable fine-tuning. Two key findings emerge. First, generic conversational supervised fine-tuning does not confer EI: fine-tuning on UltraChat yields no significant gain in any of the 24 subcategories, with a macro score of 24.6%, near the chance level of 25%. Second, applying EI-grounded LoRA, using approximately 0.8% of parameters, directly to Qwen-2.5-7B-Base achieves significant gains in all 24 subcategories, reaching a macro score of 75.33%, a gain of 51.7 percentage points over Base and 37.1 percentage points over Instruct. Crucially, an ablation shows that the UltraChat pre-stage is counterproductive, reducing performance by 21.4 percentage points: direct EI-grounded training is both necessary and sufficient.2025-08-08T10:22:19ZNizi NazarPardis Sadat ZahraeiDilek Hakkani-TürNatasa Milic-FraylingEhsaneddin Asgarihttp://arxiv.org/abs/2605.21363v2"I Didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration2026-06-12T13:42:35ZAs large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.2026-05-20T16:28:34ZEunsu KimJessica R. MindelKyungjin KimSherry Tongshuang Wuhttp://arxiv.org/abs/2606.12476v2Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics2026-06-12T13:24:39ZToken-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.2026-06-10T06:10:02Z16 pages, 1 figure. v2: added Discussion and Appendix; recall-honest framing; robustness analyses (k-NN divergence estimate, seed-averaged decomposition)Igor Itkinhttp://arxiv.org/abs/2410.15051v3Automatic identification of diagnosis from hospital discharge letters via weakly supervised Natural Language Processing2026-06-12T13:21:02ZIdentifying patient diagnoses from hospital discharge letters is essential for large-scale cohort selection and epidemiological research, but traditional supervised approaches require extensive manual annotation, which is often impractical for large textual datasets. We present a weakly supervised Natural Language Processing (NLP) pipeline for classifying Italian discharge letters without document-level manual annotation. The method extracts diagnosis-related sentences, generates semantic embeddings using a transformer model further pre-trained on Italian medical documents, and applies a two-level clustering procedure to derive weak labels that are then used to train a document-level classifier.
The approach was evaluated in a case study on bronchiolitis using 33,176 discharge letters of children admitted to 44 emergency rooms or hospitals in the Veneto Region, Italy, between 2017 and 2020. The best weakly supervised model achieved an AUROC of 77.68% ($\pm4.30\%$), an AUPRC of 73.13% ($\pm4.93\%$), and an F1-score of 78.14% ($\pm4.89\%$) against manually annotated data. Performance surpassed unsupervised baselines and approached fully supervised models, while reducing the need for manual annotation by more than 1,500 hours for a dataset of this size. Similar model rankings were observed in a secondary validation on a smaller bronchitis dataset (3,188 discharge letters, 2020-2025), where the best weakly supervised model achieved an AUPRC of 76.72% ($\pm 5.02\%$).
These results suggest the potential of weakly supervised NLP methods for scalable disease identification from clinical discharge letters.2024-10-19T09:42:20Z61 pages, 9 figuresVittorio TorriElisa BarbieriAnna CantaruttiCarlo GiaquintoFrancesca Ieva10.1038/s41598-026-56721-0http://arxiv.org/abs/2606.14420v1Coping in Crisis: Computational Modeling of Coping Styles in Digital Crisis Discourse During the 2023 Turkiye Earthquake2026-06-12T12:57:13ZHow do people cope when disaster strikes and can we detect it at scale, in real time, from what they write? This study addresses that question using over one million Turkish-language tweets posted in the aftermath of the February 6, 2023 earthquake in Turkiye, which unfolded in a deeply polarized political context just months before a national election. Drawing on Lazarus and Folkman's (1984) coping theory, we develop a multi-label BERTurk classifier to detect three coping styles (problem-focused, emotion-focused, and meaning-making) across four theoretically motivated crisis phases. BERTurk achieves a macro F1 of 0.693, substantially outperforming a zero-shot mDeBERTa baseline (macro F1 = 0.324). Applied to the full corpus, the classifier reveals a clear temporal trajectory: problem-focused coping dominates the urgency phase and declines sharply, emotion-focused coping rises and stabilizes, and meaning-making increases monotonically. Anger correlates most strongly with meaning-making (Spearman r = 0.387), suggesting it functions as a mobilizing force toward blame attribution rather than practical action. These findings demonstrate that coping theory can be reliably operationalized in real-world digital crisis data and that doing so can help humanitarian organizations tailor their responses to where a population actually is.2026-06-12T12:57:13Z20 pages, 5 figures, 3 tables. To be submitted to Social Science Computer ReviewŞevval Çakıcıhttp://arxiv.org/abs/2606.14823v1Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26,278 target-disease pairs with temporal validation and feature ablation2026-06-12T12:56:38ZGenetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.2026-06-12T12:56:38ZVictoria Patersonhttp://arxiv.org/abs/2606.14820v1Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models2026-06-12T12:30:53ZRecent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.2026-06-12T12:30:53ZAccepted to INTERSPEECH 2026; 6 pages, 3 figuresYuxuan ChenHaoyuan YuPeize He