Contemporary AI lacks the imagination to diverge or negate in science

2026-06-09T03:31:39Z

Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.

The Power of Altruism in Sticker Economics: Generosity Minimizes Collective Costs and Overprotective Norms Fuel Inefficiency

2026-06-09T02:24:59Z

Collecting the FIFA World Cup sticker album presents a classic public-goods and collective-action dilemma, in which completing a collection on one's own is highly inefficient. To evaluate how localized community norms shape collective efficiency, we use agent-based modeling and Monte Carlo simulations, parameterized with empirical field observations from exchange meetups in Natal, Brazil. Reflecting the tournament's recent expansion, the Panini 2026 album features 980 individual stickers, including 68 metallic specials. We contrast a standard baseline economy (1:2 special-to-normal exchange ratio) with an overprotective, strict strategy (exclusive special-for-special trading) and an altruistic, generous strategy (in which advanced players surrender needed duplicates to assist peers). Our findings reveal that overprotective rules trap liquidity and drive network-wide inefficiency. The strict strategy increases median completion costs by 10 packs and severely penalizes the least fortunate 5\% of collectors, adding 20 packs in large cities and 30 in small communities. Conversely, widespread generosity optimizes network liquidity and dramatically compresses the long tail of bad luck. Introducing the generous strategy reduces required purchases for the 5th percentile by 90 packs in large-scale configurations and 130 packs in smaller clusters. Furthermore, widespread altruism triggers a strong functional coupling that effectively synchronizes completion rates across the network. This study demonstrates that while rigid, protective norms degrade collective welfare, generosity successfully mitigates pack-draw variance, transforming an expensive, isolated hobby into a resilient, highly efficient public good.

Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

2026-06-09T02:10:17Z

Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking

2026-06-09T01:55:56Z

Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We argue that watermarking is part of the pluralistic alignment pipeline and should be held to the same evaluation standards. We connect this to governance frameworks currently mandating watermarking deployment without requiring fairness evaluation. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer.

Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem

2026-06-09T00:39:28Z

Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.

AnnotateThis: Analyzing a human-LLM system for annotating social media data with the concept of climate change mitigation pessimism

2026-06-08T22:03:02Z

Large language models (LLMs) are increasingly being integrated into research workflows. However, LLMs have been shown to struggle with difficult and nuanced concepts such as those found in computational social science (CSS) research. Within the CSS community, there has been a call for new systems to be developed which center humans in LLM-supported scientific workflows. We develop AnnotateThis, a human-centered system for inspecting and improving LLM annotations, a process we refer to as LLM grounding for a target concept. AnnotateThis is developed with both computational and social scientists to reflect existing workflows for data annotation. It includes a range of information features for users to interrogate the quality and reliability of LLM annotations. We evaluate our system in two settings. In the first, we assume a researcher may not have access to ground truth data and that users of AnnotateThis have limited prior knowledge of the concept they would like an LLM to annotate. That is, they may be conducting concept specification and LLM grounding simultaneously. In the second setting, we assume access to ground truth labels and that the concept is specified for a given annotation task; here, the task of LLM grounding is more straightforward. We find that in both settings users can improve the quality of LLM annotations with AnnotateThis and that their final annotations far surpass those created without human intervention. For example, when we evaluate with ground truth labels, we see an absolute improvement of 0.15 in F-Measure and 0.23 in accuracy over a fully automated state-of-the-art method for prompt refinement.

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

2026-06-08T20:38:06Z

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

"Where is this coming from?" Uncovering Trustworthiness Ideals in AI-powered Peripartum Information Seeking

2026-06-08T20:36:57Z

AI-powered tools increasingly promise to fill information gaps in health, especially in domains like maternal and reproductive health that demand timely, accurate, and actionable information. This is extremely important, as the United States leads peer nations in preventable deaths, with stark racial disparities. However, current AI and NLP-powered systems aim to improve access to vetted maternal health information by routing user queries to a factual response while under-specifying the socio-technical governance structures that shape trust, use, and harm in practice. We report findings from four synchronous focus groups ($n=24$) with three stakeholder groups central to peripartum information support: birthing people, clinicians, and health workers (e.g., doulas, social workers, community health workers) exploring topics around information seeking, experience with current clinical infrastructure, misinformation, and an AI-enabled factual answering tool design probe. Our inductive analysis surfaces a central finding: in high-stakes health contexts shaped by historical inequities, trustworthiness must be inspectable and not asserted. While stakeholders diverge on what makes information credible, they converge on the need for transparency, recourse, and ecosystem complementarity. Based on the discussions, we identify four themes and governance requirements: (1) support for social and identity-based sensemaking, (2) pluralistic verification practices, (3) inspectable governance with recourse mechanisms, and (4) ecosystem-aware integration that avoids shifting burden. Building on these findings, we propose design artifacts that are mistrust-aware and promote principled governance mechanisms for transparent, pluralistic AI systems. Finally, we discuss the implications of our findings for expanding human-AI evaluations and improving the transparency of deployed AI systems.

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

2026-06-08T19:57:13Z

Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators

2026-06-08T19:09:20Z

As generative AI systems are integrated into educational settings, students often encounter AI-generated output while working through learning tasks, either by requesting help or through integrated tools. Trust in AI can influence how students interpret and use that output, including whether they evaluate it critically or exhibit overreliance. We investigate how students' trust relates to their appropriate reliance on an AI assistant during programming problem-solving tasks, and whether this relationship differs by learner characteristics. With 432 undergraduate participants, students' completed Python output-prediction problems while receiving recommendations and explanations from an AI chatbot, including accurate and intentionally misleading suggestions. We operationalize reliance behaviorally as the extent to which students' responses reflected appropriate use of the AI assistant's suggestions, accepting them when they were correct and rejecting them when they were incorrect. Pre- and post-task surveys assessed trust in the assistant, AI literacy, need for cognition, programming self-efficacy, and programming literacy. Results showed a non-linear relationship in which higher trust was associated with lower appropriate reliance, suggesting weaker discrimination between correct and incorrect recommendations. This relationship was significantly moderated by students' AI literacy and need for cognition. These findings highlight the need for future work on instructional and system supports that encourage more reflective evaluation of AI assistance during problem-solving.

The Human Vulnerabilities & Exploits (HVE) Framework

2026-06-08T19:06:18Z

The cybersecurity community has invested over two decades in building standardized frameworks, the Common Vulnerabilities and Exposures (CVE) system, the Common Vulnerability Scoring System (CVSS), and the Common Weakness Enumeration (CWE) to identify, classify, and remediate threats to digital infrastructure. However, an emerging body of research reveals that a vast majority of successful cyberattacks exploit not software flaws, but human behavioral and psychological vulnerabilities. Social engineering, fraud, and scam attacks, which manipulate human cognition, emotion, and trust, do not have an equivalent standardized framework. Meanwhile, behavioral science and psychology research has established robust theoretical foundations, such as dual-process theory, prospect theory, social influence frameworks, and visceral state models, which explain precisely why and how these attacks succeed. This paper introduces the Human Vulnerabilities & Exploits (HVE) Framework, a structured approach for identifying, classifying, and mitigating the behavioral and psychological vulnerabilities exploited in scams, social engineering, and other human-centric fraud and attacks, analogous in concept to how CVE helps classify software vulnerabilities: it provides a shared, machine-readable taxonomy with structured identifiers, multi-dimensional severity scoring via the Human Vulnerability Severity Score (HVSS), and actionable remediation guidance through Human Vulnerability Patches (HVPs). This introduction synthesizes the relevant literature across cybersecurity standardization, behavioral science, and fraud defense to establish the theoretical and practical foundations for the HVE framework, whose architecture and technical specifications are detailed in subsequent sections.

Culturally uneven urban perception in large language models

2026-06-08T19:03:29Z

Large language models (LLMs) are increasingly used to describe and evaluate cities, yet the cultural structure of their urban judgments remains understudied. Here we introduce a measurement framework for testing whether LLM-based urban perception is culturally neutral, using a globally stratified street-view image dataset. Open-ended descriptions and structured scores generated by three frontier multimodal models all show that the neutral baseline lies closer to regional framings associated with Europe and North America than to other cultural framings. Comparisons between AI and human urban perception further show that prompting can move AI responses closer to specific regional human descriptions, but fails to recover the variety and diversity of human responses, flattening observed demographic patterns and introducing sentiment-based self-favouring bias. These results indicate a systematic risk in treating AI as a neutral tool for urban tasks, especially when model outputs are used to compare, evaluate or represent cities across cultural contexts.

Offline-First LLM Architecture for Adaptive Learning in Low-Connectivity Environments

2026-06-08T18:34:43Z

Artificial intelligence (AI) and large language models (LLMs) are transforming educational technology by enabling conversational tutoring, personalized explanations, and inquiry-driven learning. However, most AI-based learning systems rely on continuous internet connectivity and cloud-based computation, limiting their use in bandwidth-constrained environments. This paper presents an offline-first large language model architecture designed for AI-assisted learning in low-connectivity settings. The system performs all inference locally using quantized language models and incorporates hardware-aware model selection to enable deployment on low-specification CPU-only devices. By removing dependence on cloud infrastructure, the system provides curriculum-aligned explanations and structured academic support through natural-language interaction. To support learners at different educational stages, the system includes adaptive response levels that generate explanations at varying levels of complexity: Simple English, Lower Secondary, Upper Secondary, and Technical. This allows explanations to be adjusted to student ability, improving clarity and understanding of academic concepts. The system was deployed in selected secondary and tertiary institutions under limited-connectivity conditions and evaluated across technical performance, usability, perceived response quality, and educational impact. Results show stable operation on legacy hardware, acceptable response times, and positive user perceptions regarding support for self-directed learning. These findings demonstrate the feasibility of offline large language model deployment for AI-assisted education in low-connectivity environments.

The Empirically Grounded Adaptive Virtual Patient for Psychotherapy Training: Disclosure That Responds to Therapist Micro-Skills

2026-06-08T18:27:22Z

Simulated patients offer a scalable way to train psychotherapy micro-skills such as empathic responding and exploratory probing, but current systems either follow fixed scripts or rely on LLMs that drift unpredictably over long sessions. We present the Adaptive Virtual Patient (AVP), which adapts its disclosure behavior -- from guarded, through moderate openness, to full disclosure -- in response to trainee skill. The AVP is grounded in a structural equation model fit to nearly 2{,}000 hours of real-world psychotherapy transcripts, which quantifies how therapist empathy and exploration shift a patient's openness over time. An LLM generates the AVP's utterances conditioned on a disclosure level that the dynamics module updates each turn. In an evaluation with 20 clinicians and trainees over 80 sessions (1{,}033 turns), the AVP's disclosure rises in response to therapist empathy and exploration, while a prompt-only baseline stays flat; ablations confirm that the empirically motivated parameterization outperforms alternatives, with exploration carrying most of the adaptive signal.

Principled Uncertainty in Clinical AI: End-to-End Bayesian Modelling and Algorithmic Equity Auditing Across Multimodal Patient Data

2026-06-08T17:44:15Z

Clinical artificial intelligence (AI) systems routinely produce predictions without principled quantification of uncertainty, limiting their trustworthiness in high-stakes medical environments. This paper presents an integrated research programme addressing two interconnected problems: (1) the development of a fully end-to-end Bayesian uncertainty modelling framework for multimodal clinical data, and (2) the application of calibrated uncertainty estimates as a formal measure of algorithmic equity across patient subgroups. We construct a probabilistic deep learning architecture comprising modality-specific variational encoders, a precision-weighted late fusion mechanism, and a decomposed uncertainty output head that separates aleatoric from epistemic uncertainty. The system is trained with a composite Bayesian loss incorporating binary cross-entropy, Kullback-Leibler divergence regularisation, and an uncertainty calibration penalty. We evaluate model calibration using Expected Calibration Error (ECE = 0.096) and conduct a subgroup equity audit across facility type, socioeconomic status, age group, and biological sex on a dataset of 1,000 simulated patients. Results demonstrate that epistemic uncertainty systematically identifies underserved populations: primary/rural facility patients show a 15.3% uncertainty equity gap (p < 0.001, effect size = 0.698), low socioeconomic status patients exhibit a 6.8% gap (p < 0.001), and elderly patients show a 3.9% gap (p < 0.001), whilst no significant sex-based disparity is detected. These findings establish that calibrated uncertainty is not merely a technical property of probabilistic models but constitutes an actionable equity signal with direct clinical relevance.