https://arxiv.org/api/MWiSQ2aR7x6w1j8Lz9MQz3R2K/c2026-06-10T04:23:19Z11183310515http://arxiv.org/abs/2606.10398v1Selection, Not Salience: The Shape and Limits of Personalization in Social Highlighting2026-06-09T04:18:08ZDoes personalizing what a reader sees pay off, and where does it stop? Using a social web highlighter and a co-readership identity control (the same document highlighted by many users, which holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does), we map the shape and limits of personalization across reading altitudes. At the document altitude we give the clean, leakage-free, identity-controlled measurement that prior next-document evaluations could only upper-bound: a person's history identifies which documents in a co-reading neighborhood are theirs, with an own-versus-other gap of +0.169 against community negatives and +0.119 against topic-matched hard negatives (both highly significant); a content-based arm suggests the signal is not purely title-driven but is largely thematic. This is comparable to the span-level selection signal (+0.14) from our prior work: the selection signal is of comparable magnitude across altitudes (+0.12 to +0.17), most of it stable topic preference. At the sentence altitude, a two-stage personalized auto-highlight (an impersonal model proposes candidates, a personal model re-ranks them) does not improve on its impersonal baseline: two off-the-shelf zero-shot LLMs, including a frontier model, predict highlight locations worse than a lead baseline, and personal re-ranking is beaten by the salience order even on the highest-recall candidate pool, so the null is not merely a Stage-1 ceiling artifact. Measurable personalization appears primarily at the selection layer: modest (~+0.13), topic-dominated, with no reliable gain at the salience layer. We also surface a control-in-negatives bias that inflated our document gap to a spurious +0.227 until audited. Going beyond the shared salience layer may be better approached by aggregating individuals than by personalizing them harder.2026-06-09T04:18:08Z9 pages, 1 figure, 3 tablesKazuki NakayashikiKeisuke Watanabehttp://arxiv.org/abs/2603.14463v2An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs2026-06-09T04:09:38ZAdapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.2026-03-15T16:13:37Z21 pages, 12 figures, 17 tablesICLR 2026 Workshop Advances in Financial AIQian ZhuXinnan GuoJingjing HuoJun LiPan LiuWenyan YangWanqing XuXuan Linhttp://arxiv.org/abs/2602.12966v2ProbeLLM: Automating Principled Diagnosis of LLM Failures2026-06-09T04:02:52ZUnderstanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.2026-02-13T14:33:13ZYue HuangZhengzhe JiangYuchen MaYu JiangXiangqi WangYujun ZhouYuexing HaoKehan GuoPin-Yu ChenStefan FeuerriegelXiangliang Zhanghttp://arxiv.org/abs/2606.10381v1Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis2026-06-09T03:42:50ZMuon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.2026-06-09T03:42:50Z22 pages, 5 figures, and 6 tablesRuobing JiangDawei FuCheng JiangTianyi YangZijian WangYoupeng WuYong BanYajun MaoQiang Lihttp://arxiv.org/abs/2606.10380v1Expert-Level Crisis Detection in Mental Health Conversations2026-06-09T03:42:14ZReal-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.2026-06-09T03:42:14ZGrace ByunAbigail LottRebecca LipschutzSean T. MintonElizabeth A. StinsonJinho D. Choihttp://arxiv.org/abs/2606.10369v1PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning2026-06-09T03:28:17ZAs large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.2026-06-09T03:28:17Zpublished in ICML 2026Xinyue PengYi QianJiaojiao LinWenjian ShaoYanming Liuhttp://arxiv.org/abs/2606.09421v2What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents2026-06-09T02:58:45ZLarge language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing sparse operational anchors that prevent exploration, debugging, and recovery. We study skill rewriting through this economic lens. Our controlled framework profiles skill structure, rewrites skills using information-preservation strategies, and evaluates the rewrites under fixed task instructions, environments, and verifiers. Experiments on SkillsBench reveal distinct quality--cost trade-offs across strategies: API/code anchoring, workflow guarding, and rule/formula anchoring benefit different task families, with no universally dominant template. In the main held-out evaluation, the learned policy reduces total cost by 7.0% and downstream agent-token cost by 6.0%; in frozen cross-model transfer, the corresponding reductions average 14.7% and 13.7%, while verifier quality is preserved. These results position skill design as cost-aware operational knowledge engineering rather than prompt compression. Resources: https://github.com/1Reminding/Skill_EE.2026-06-08T12:36:51ZQinghua XingYinda ChenYaping JinZhenhe WuBohan LinHang ZhouXinghao ChenHanting ChenZhiwei Xionghttp://arxiv.org/abs/2507.09788v3TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit2026-06-09T02:50:22ZRecent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, including preliminary experiments with real human behavior as control. Results highlight possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.2025-07-13T21:00:27Z9 pagesPaulo SalemRobert SimChristopher OlsenPrerit SaxenaRafael BarcelosYi Dinghttp://arxiv.org/abs/2511.02603v2CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency2026-06-09T02:34:12ZLarge language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency (Wang et al., 2023) strategy requires a fixed number of calls and fails when the correct answer is infrequent. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers and adaptively halts sampling once one answer accumulates enough posterior mass. We prove guarantees in both an ideal calibrated regime and a realistic noisy-confidence regime under a directional drift condition. Averaged over five reasoning benchmarks, CGES reduces the average number of calls by 58% on average (from 16.0 to 6.7) while matching its accuracy within 0.4 percentage points of self-consistency.2025-11-04T14:25:54ZExtended version. A preliminary version was accepted at the Efficient Reasoning Workshop @ NeurIPS 2025. Code: https://github.com/EhsanAghazadeh/cgesEhsan AghazadehAhmad GhasemiHedyeh BeyhaghiHossein Pishro-Nikhttp://arxiv.org/abs/2606.10338v1Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models2026-06-09T02:33:40ZMachine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.2026-06-09T02:33:40ZJingyi XieYijun LinYinjiang XiongZhikun ZhangSai Lihttp://arxiv.org/abs/2603.19225v4FinTradeBench: A Financial Reasoning Benchmark for LLMs2026-06-09T02:26:19ZReal-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with advances in Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question-answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning about how company stocks trade in the market or their interactions with fundamentals. To leverage the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment.
We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.2026-03-19T17:59:41Z9 pages main text, 31 pages total (including references and appendix). 5 figures, 16 tables. Preprint under review. Code and data will be made available upon publicationYogesh AgrawalAniruddha DuttaMd Mahadi HasanSantu KarmakerAritra Duttahttp://arxiv.org/abs/2605.27914v2Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm2026-06-09T02:24:01ZBenchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.2026-05-27T03:41:11Z YumingRapheal HuangYao LiuPengjie DingLei WangJunchen Wanhttp://arxiv.org/abs/2606.10327v1The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring2026-06-09T02:23:46ZAutomated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.2026-06-09T02:23:46ZAli KeramatiMark Warschauerhttp://arxiv.org/abs/2606.10316v1TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning2026-06-09T02:11:16ZSpreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi-table comparison, and repeat similar workflows without adapting to a user's preferences. This paper presents TabClaw, an open-source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural-language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct-style tool-using analysis loop, dispatches specialist agents for parallel multi-table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one-off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool-use patterns, supports package-style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data-analysis tasks. Our code is available.2026-06-09T02:11:16Z5 pages, 2 figuresMingyue ChengShuo YuDaoyu WangQingchuan LiXiaoyu TaoQingyang MaoYitong ZhouQi Liuhttp://arxiv.org/abs/2606.10315v1Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents2026-06-09T02:11:01ZLLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.2026-06-09T02:11:01Z13 pages, 1 figure, 5 tablesSawyer ZhangAlexander WangSophie Lei