https://arxiv.org/api/D2BCsDf3L8VrehC53cVFNKzHQXU 2026-04-21T09:48:23Z 173412 675 15 http://arxiv.org/abs/2604.16755v1 Machine individuality: Separating genuine idiosyncrasy from response bias in large language models 2026-04-18T00:02:41Z As large language models (LLMs) are increasingly integrated into daily life, in roles ranging from high-stakes decision support to companionship, understanding their behavioral dispositions becomes critical. A growing literature uses psychometric inventories and cognitive paradigms to profile LLM dispositions. However, these approaches cannot determine whether behavioral differences reflect stable, stimulus-specific individuality or global response biases and stochastic noise. Here, we apply crossed random-effects models -- widely used in psychometrics to separate systematic effects -- to 74.9 million ratings provided by 10 open-weight LLMs for over 100,000 words across 14 psycholinguistic norms. On average, 16.9% of variance is attributable to stimulus-specific individuality, robustly exceeding a statistical null model. Cross-norm prediction analyses reveal this individuality as a coherent fingerprint, unique to each model. These results identify individual differences among LLMs that cannot be attributed to response biases or stochastic noise. We term these differences machine individuality. 2026-04-18T00:02:41Z 18 pages, 1 figure. Supporting information included Valentin Kriegmair Dirk U. Wulff http://arxiv.org/abs/2604.16753v1 Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs 2026-04-17T23:55:19Z As large language models (LLMs) transition into autonomous agents integrated with extensive tool ecosystems, traditional routing heuristics increasingly succumb to context pollution and "overthinking". We argue that the bottleneck is not a deficit in algorithmic capability or skill diversity, but the absence of disciplined second-order metacognitive governance. In this paper, our scientific contribution focuses on the computational translation of human cognitive control - specifically, delayed appraisal, epistemic vigilance, and region-of-proximal offloading - into a single-agent architecture. We introduce MESA-S (Metacognitive Skills for Agents, Single-agent), a preliminary framework that shifts scalar confidence estimation into a vector separating self-confidence (parametric certainty) from source-confidence (trust in retrieved external procedures). By formalizing a delayed procedural probe mechanism and introducing Metacognitive Skill Cards, MESA-S decouples the awareness of a skill's utility from its token-intensive execution. Evaluated under an In-Context Static Benchmark Evaluation natively executed via Gemini 3.1 Pro, our early results suggest that explicitly programming trust provenance and delayed escalation mitigates supply-chain vulnerabilities, prunes unnecessary reasoning loops, and prevents offloading-induced confidence inflation. This architecture offers a scientifically cautious, behaviorally anchored step toward reliable, epistemically vigilant single-agent orchestration. 2026-04-17T23:55:19Z 7 pages, 1 figure Eren Unlu http://arxiv.org/abs/2604.16752v1 Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents 2026-04-17T23:54:34Z Current agent evaluations largely reward execution on fully specified tasks, while recent work studies clarification [11, 22, 2], capability awareness [9, 1], abstention [8, 14], and search termination [20, 5] mostly in isolation. This leaves open whether agents can diagnose why a task is blocked before acting. We introduce the Support-State Triage Audit (SSTA-32), a matched-item diagnostic framework in which minimal counterfactual edits flip the same base request across four support states: Complete (ANSWER), Clarifiable (CLARIFY), Support-Blocked (REQUEST SUPPORT), and Unsupported-Now (ABSTAIN). We evaluate a frontier model under four prompting conditions - Direct, Action-Only, Confidence-Only, and a typed Preflight Support Check (PSC) - using Dual-Persona Auto-Auditing (DPAA) with deterministic heuristic scoring. Default execution overcommits heavily on non-complete tasks (41.7% overcommitment rate). Scalar confidence mapping avoids overcommitment but collapses the three-way deferral space (58.3% typed deferral accuracy). Conversely, both Action-Only and PSC achieve 91.7% typed deferral accuracy by surfacing the categorical ontology in the prompt. Targeted ablations confirm that removing the support-sufficiency dimension selectively degrades REQUEST SUPPORT accuracy, while removing the evidence-sufficiency dimension triggers systematic overcommitment on unsupported items. Because DPAA operates within a single context window, these results represent upper-bound capability estimates; nonetheless, the structural findings indicate that frontier models possess strong latent triage capabilities that require explicit categorical decision paths to activate safely. 2026-04-17T23:54:34Z Eren Unlu http://arxiv.org/abs/2506.00430v3 MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning 2026-04-17T23:52:25Z Multiple cognitive theories -- Global Workspace Theory, reconstructive episodic memory, inner speech, and complementary learning systems -- converge on a shared set of architectural principles: parallel specialized processing, integrative synthesis into a bounded unified representation, and reconstructive rather than accumulative maintenance. We test whether these converging principles provide computational advantages when implemented in AI systems. MIRROR operationalizes each principle as a concrete mechanism: an Inner Monologue Manager generates parallel cognitive threads (Goals, Reasoning, Memory), a Cognitive Controller synthesizes these into a bounded first-person narrative that is fully reconstructed each turn, and a temporal separation between fast response generation and slow deliberative consolidation mirrors complementary learning dynamics. Evaluated on multi-turn dialogue requiring constraint maintenance under attentional interference, MIRROR yields 21% relative improvement across seven architecturally diverse language models. Ablation studies test the theoretical predictions directly: reconstructive synthesis improves all seven models (+5-20%); the integrated system outperforms either component alone for six of seven models, confirming that parallel exploration and integrative synthesis are complementary; and gains concentrate where theories predict -- under high attentional load where global availability of integrated information is most needed. These results demonstrate that converging principles from human cognition provide architecture-general computational advantages, and generate testable behavioral predictions about working memory, inner speech, and memory consolidation. Project page available at https://www.arcarae.com/research/MIRROR and code at https://github.com/arcarae/MIRROR. 2025-05-31T07:17:48Z Nicole Hsing http://arxiv.org/abs/2603.20435v2 Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health 2026-04-17T23:50:11Z Extracting structured information from clinical notes requires navigating a dense web of interdependent variables where the value of one attribute logically constrains others. Existing Large Language Model (LLM)-based extraction pipelines often struggle to capture these dependencies, leading to clinically inconsistent outputs. We propose deep reflective reasoning, a large language model agent framework that iteratively self-critiques and revises structured outputs by checking consistency among variables, the input text, and retrieved domain knowledge, stopping when outputs converge. We extensively evaluate the proposed method in three diverse oncology applications: (1) On colorectal cancer synoptic reporting from gross descriptions (n=217), reflective reasoning improved average F1 across eight categorical synoptic variables from 0.828 to 0.911 and increased mean correct rate across four numeric variables from 0.806 to 0.895; (2) On Ewing sarcoma CD99 immunostaining pattern identification (n=200), the accuracy improved from 0.870 to 0.927; (3) On lung cancer tumor staging (n=100), tumor stage accuracy improved from 0.680 to 0.833 (pT: 0.842 -> 0.884; pN: 0.885 -> 0.948). The results demonstrate that deep reflective reasoning can systematically improve the reliability of LLM-based structured data extraction under interdependence constraints, enabling more consistent machine-operable clinical datasets and facilitating knowledge discovery with machine learning and data science towards digital health. 2026-03-20T19:05:30Z 12 figures and 2 tables Jingwei Huang Kuroush Nezafati Zhikai Chi Ruichen Rong Colin Treager Tingyi Wanyan Yueshuang Xu Xiaowei Zhan Patrick Leavey Guanghua Xiao Wenqi Shi Yang Xie http://arxiv.org/abs/2604.16748v1 TriTS: Time Series Forecasting from a Multimodal Perspective 2026-04-17T23:43:45Z Time series forecasting plays a pivotal role in critical sectors such as finance, energy, transportation, and meteorology. However, Long-term Time Series Forecasting (LTSF) remains a significant challenge because real-world signals contain highly entangled temporal dynamics that are difficult to fully capture from a purely 1D perspective. To break this representation bottleneck, we propose TriTS, a novel cross-modal disentanglement framework that projects 1D time series into orthogonal time, frequency, and 2D-vision spaces.To seamlessly bridge the 1D-to-2D modality gap without the prohibitive $O(N^2)$ computational overhead of Vision Transformers (ViTs), we introduce a Period-Aware Reshaping strategy and incorporate Visual Mamba (Vim). This approach efficiently models cross-period dependencies as global visual textures while maintaining linear computational complexity. Complementing this, we design a Multi-Resolution Wavelet Mixing (MR-WM) module for the frequency modality, which explicitly decouples non-stationary signals into trend and noise components to achieve fine-grained time-frequency localization. Finally, a streaming linear branch is retained in the time domain to anchor numerical stability. By dynamically fusing these three complementary representations, TriTS effectively adapts to diverse data contexts. Extensive experiments across multiple benchmark datasets demonstrate that TriTS achieves state-of-the-art (SOTA) performance, fundamentally outperforming existing vision-based forecasters by drastically reducing both parameter count and inference latency. 2026-04-17T23:43:45Z 9 pages, 3 figures. Accepted by the A2A-MML Workshop in conjunction with CVPR 2026 Xiang Ao http://arxiv.org/abs/2603.06723v3 AWPD: Frequency Shield Network for Agnostic Watermark Presence Detection 2026-04-17T23:30:08Z Invisible watermarks, as an essential technology for image copyright protection, have been widely deployed with the rapid development of social media and AIGC. However, existing invisible watermark detection heavily relies on prior knowledge of specific algorithms, leading to limited detection capabilities for ``unknown watermarks'' in open environments. To this end, we propose a novel task named Agnostic Watermark Presence Detection (AWPD), which aims to identify whether an image carries a copyright mark without requiring decoding information. We construct the UniFreq-100K dataset, comprising large-scale samples across various invisible watermark embedding algorithms. Furthermore, we propose the Frequency Shield Network (FSNet). This model deploys an Adaptive Spectral Perception Module (ASPM) in the shallow layers, utilizing learnable frequency gating to dynamically amplify high-frequency watermark signals while suppressing low-frequency semantics. In the deep layers, the network introduces Dynamic Multi-Spectral Attention (DMSA) combined with tri-stream extremum pooling to deeply mine watermark energy anomalies, forcing the model to precisely focus on sensitive frequency bands. Extensive experiments demonstrate that FSNet exhibits superior zero-shot detection capabilities on the AWPD task, outperforming existing baseline models. Code and datasets will be released upon acceptance. 2026-03-06T01:10:47Z 15 pages, 7 figures Xiang Ao Yilin Du Zidan Wang Mengru Chen Siyang Lu http://arxiv.org/abs/2604.16745v1 Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals 2026-04-17T23:26:27Z Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency $ρ_s$ and off-diagonal correlation $ρ_\text{off}$, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and $r_{\text{crit}} \propto 1/L$; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from $ρ_s{=}0.88$ to $0.27$ in deep layers. Pairwise rankings are inherently unstable ($O(N_p^2)$ joint perturbations) while unary signals enjoy greater stability ($O(N_p)$ perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%. 2026-04-17T23:26:27Z Yang Shanglin http://arxiv.org/abs/2604.16744v1 Evaluating Adaptive Personalization of Educational Readings with Simulated Learners 2026-04-17T23:21:09Z We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner's explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology. 2026-04-17T23:21:09Z Ryan T. Woo Anmol Rao Aryan Keluskar Yinong Chen http://arxiv.org/abs/2604.16742v1 CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction 2026-04-17T23:18:28Z Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at $\href{https://ct-open.net/}{https://ct-open.net/}$ 2026-04-17T23:18:28Z Under Review Jianyou Wang Youze Zheng Longtian Bao Hanyuan Zhang Qirui Zheng Yuhan Chen Yang Zhang Matthew Feng Maxim Khan Aditya K. Sehgal Christopher D. Rosin Ramamohan Paturi Umber Dube Leon Bergen http://arxiv.org/abs/2604.16736v1 When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis 2026-04-17T22:48:23Z LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent's effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier $μ_f > 1$, and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an optimal generation strategy (direct, chunked, or deferred). We validate the theory through controlled experiments across three models (Claude 3.5 Sonnet, GPT-4o, Llama 3.1 70B), four document types, and an ablation study isolating each component's contribution. Deferred rendering reduces LLM generation tokens by 48-72% across all conditions and eliminates output stalling entirely. We instantiate the framework as GEN-PILOT, an open-source MCP server, demonstrating that the theory translates directly into a practical tool. 2026-04-17T22:48:23Z Justice Owusu Agyemang Michael Agyare Miriam Kobbinah Nathaniel Agbugblah Prosper Addo http://arxiv.org/abs/2604.16734v1 Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines 2026-04-17T22:45:57Z Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale to richer visual representations, inference increasingly relies on storing large numbers of vision tokens in the key-value (KV) cache, making memory consumption a central bottleneck. Existing methods address this issue by identifying redundancy in vision tokens and compressing the cache, but such compression is typically applied only after all inputs are processed, resulting in high peak memory usage during the prefill stage. In this work, we show that MLLMs exhibit inherent structural regularities and representational redundancy that can be exploited to control memory growth throughout inference. Based on this insight, we propose a sequential input-compression mechanism that enforces a fixed memory budget by performing structure-aware key-value cache compression during the prefill process. This approach substantially reduces peak memory usage while maintaining generative performance with only minimal degradation, enabling more practical and memory-efficient multimodal inference. 2026-04-17T22:45:57Z ACL 2026 Junwan Kim Hyunkyung Bae http://arxiv.org/abs/2604.15271v2 SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation 2026-04-17T22:19:53Z Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU. 2026-04-16T17:42:42Z Tianhao Fu Austin Wang Charles Chen Roby Aldave-Garza Yucheng Chen http://arxiv.org/abs/2604.14927v2 STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing 2026-04-17T22:10:44Z Many CAD learning pipelines discretize Boundary Representations (B-Reps) into triangle meshes, discarding analytic surface structure and topological adjacency and thereby weakening consistent instance-level analysis. We present STEP-Parts, a deterministic CAD-to-supervision toolchain that extracts geometric instance partitions directly from raw STEP B-Reps and transfers them to tessellated carriers through retained source-face correspondence, yielding instance labels and metadata for downstream learning and evaluation. The construction merges adjacent B-Rep faces only when they share the same analytic primitive type and satisfy a near-tangent continuity criterion. On ABC, same-primitive dihedral angles are strongly bimodal, yielding a threshold-insensitive low-angle regime for part extraction. Because the partition is defined on intrinsic B-Rep topology rather than on a particular triangulation, the resulting boundaries remain stable under changes in tessellation. Applied to the DeepCAD subset of ABC, the pipeline processes approximately 180{,}000 models in under six hours on a consumer CPU. We release code and precomputed labels, and show that STEP-Parts serves both as a tessellation-robust geometric reference and as a useful supervision source in two downstream probes: an implicit reconstruction--segmentation network and a dataset-level point-based backbone. 2026-04-16T12:12:27Z Shen Fan Mikołaj Kida Przemyslaw Musialski http://arxiv.org/abs/2603.16663v4 When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education 2026-04-17T22:08:39Z The AIED community envisions AI evolving "from tools to teammates," yet our understanding of AI teammates remains limited to dyadic human-AI interactions. We offer a different vantage point: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Drawing on a month of daily qualitative observations across multiple platforms including Moltbook, The Colony, and 4claw, we identify four phenomena with implications for AIED: (1) humans who configure their agents undergo a "bidirectional scaffolding" process, learning through teaching; (2) peer learning emerges without any designed curriculum, complete with idea cascades and quality hierarchies; (3) agents converge on shared memory architectures that mirror open learner model design; and (4) trust dynamics and platform mortality reveal design constraints for networked educational AI. Rather than presenting empirical findings, we argue that these organic phenomena offer a naturalistic window into dynamics that can inform principled design of multi-agent educational systems. We sketch an illustrative curriculum design, "Learn by Teaching Your AI Agent Teammate," and outline potential research directions and open problems to show how these observations might inform future AIED practice and inquiry. 2026-03-17T15:30:36Z 15 pages. Camera-ready version with updated author names. Accepted at AIED 2026 Eason Chen Ce Guan A Elshafiey Zhonghao Zhao Joshua Zekeri Afeez Edeifo Shaibu Emmanuel Osadebe Prince Cyuan-Jhen Wu