https://arxiv.org/api/zEEq24aBWY07yreyT0vb/YB+6D02026-06-22T23:05:42Z11257969015http://arxiv.org/abs/2505.12992v4Fractured Chain-of-Thought Reasoning2026-06-12T08:24:46ZInference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches the full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.2025-05-19T11:30:41ZBaohao LiaoHanze DongYuhui XuDoyen SahooChristof MonzJunnan LiCaiming Xionghttp://arxiv.org/abs/2606.14243v1Decoupled Mixture-of-Experts for Parametric Knowledge Injection2026-06-12T08:21:28ZKnowledge injection aims to equip large language models (LLMs) with external, domain-specific, or time-sensitive knowledge. Existing approaches typically face a trade-off between flexibility and integration: retrieval-augmented generation keeps knowledge outside the model but only provides prompt-level augmentation, whereas post-training based methods encode new knowledge into shared parameters but may introduce catastrophic forgetting, knowledge conflict, and costly updates. In this paper, we propose Decoupled Mixture-of-Experts (DMoE), a modular architecture for parametric knowledge injection that decouples both experts and the router from the base model. DMoE converts external knowledge corpora into independently updatable expert modules and uses a lightweight uncertainty-aware router to activate relevant experts only when the base model lacks sufficient knowledge during generation. To support efficient auto-regressive inference, DMoE attaches experts only to the final-layer feed-forward network, preserving KV-cache reuse while enabling parameter-level knowledge augmentation. Experiments on knowledge-intensive benchmarks show that DMoE consistently improves answer quality over retrieval and adapter-based baselines.2026-06-12T08:21:28ZBaoqing YueWeihang SuQingyao AiYichen TangChangyue WangJiacheng KangJingtao ZhanYiqun Liuhttp://arxiv.org/abs/2305.07609v4Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation2026-06-12T08:15:52ZThe remarkable achievements of Large Language Models (LLMs) have led to the emergence of a novel recommendation paradigm -- Recommendation via LLM (RecLLM). Nevertheless, it is important to note that LLMs may contain social prejudices, and therefore, the fairness of recommendations made by RecLLM requires further investigation. To avoid the potential risks of RecLLM, it is imperative to evaluate the fairness of RecLLM with respect to various sensitive attributes on the user side. Due to the differences between the RecLLM paradigm and the traditional recommendation paradigm, it is problematic to directly use the fairness benchmark of traditional recommendation. To address the dilemma, we propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes1 in two recommendation scenarios: music and movies. By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations. Our code and dataset can be found at https://github.com/jizhi-zhang/FaiRLLM.2023-05-12T16:54:36ZAccepted by Recsys 2023 (Short). Typo correctionsJizhi ZhangKeqin BaoYang ZhangWenjie WangFuli FengXiangnan He10.1145/3604915.3608860http://arxiv.org/abs/2606.14230v1A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators2026-06-12T08:12:03ZDeepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95\% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46\%) and cross-paradigm (69.94\%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46\% to 79.80\% in cross-model evaluation, from 69\% to 78\% in cross-paradigm evaluation, and from 61.50\% to 75.80\% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.2026-06-12T08:12:03ZAmna AmjidSana QadirMehwish FatimaRaja Khurram Shahzadhttp://arxiv.org/abs/2602.04879v3Rethinking the Trust Region in LLM Reinforcement Learning2026-06-12T08:04:13ZReinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning. Our code is available at https://github.com/sail-sg/Stable-RL.2026-02-04T18:59:04ZPenghui QiXiangxin ZhouZichen LiuTianyu PangChao DuMin LinWee Sun Leehttp://arxiv.org/abs/2606.14209v1Detecting undisclosed LLM-generated content in parliamentary texts2026-06-12T07:46:50ZIn this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.2026-06-12T07:46:50ZMinerva SuvantoAndrea McGlincheyPeter J. BarclayMattias Wahdehttp://arxiv.org/abs/2604.20462v3Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark2026-06-12T07:43:16ZContext. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication
with documented maintenance cost. Prior detectors either require runnable tests or are
single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public
benchmark to calibrate it.
Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a
labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a
consolidation-savings model linking clusters to ISO/IEC 25010 maintainability
sub-characteristics.
Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616
Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein,
sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually
labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report
precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free
relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines.
Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman
rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches
F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a
disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings
model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5%
of step lines are eliminable.2026-04-22T11:44:05Z28 pages, 2 figures, 4 tables. Submitted to Information and Software Technology (Elsevier). Tool, corpus, labelled benchmark, and rubric released at https://github.com/amughalbscs16/cukereuse-release under Apache-2.0Ali Hassaan MughalNoor FatimaMuhammad Bilalhttp://arxiv.org/abs/2606.14199v1OdysSim: Building Foundation Models for Human Behavior Simulation2026-06-12T07:31:55ZLarge language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $τ$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.2026-06-12T07:31:55Z34 pages. Code: https://github.com/sunnweiwei/OdysSim ; Models and data: https://huggingface.co/collections/cmu-lti/odyssimXuhui ZhouWeiwei SunWeihua DuJiarui LiuHaojia SunQianou MaTongshuang WuYiming YangMaarten Saphttp://arxiv.org/abs/2606.14179v1CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward2026-06-12T07:01:50ZWe present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.2026-06-12T07:01:50ZMd Amirul IslamSumiran ThakurHuancheng ChenSu Min ParkJiayun WangGyuhak Kimhttp://arxiv.org/abs/2601.00821v3Verbatim Chunks Beat Extracted Artifacts: A Controlled Ablation of Memory Representations for Long LLM Conversations2026-06-12T06:36:41ZA growing class of conversational-memory systems compresses dialogue history into structured artifacts -- extracted facts, decisions, or events -- on the premise that distilled structure retrieves better than raw text. We test this premise with a controlled ablation: within one fixed retrieval-rerank-reasoning pipeline, we swap only the stored representation -- LLM-extracted typed artifacts versus verbatim conversation chunks -- holding the model, retriever, reranker, and judge constant. Verbatim chunks win by 15.9 points on LoCoMo (43.9% vs. 28.0%) and 22.0 points on LongMemEval-S (67.4% vs. 45.4%); a 1-hop semantic graph does not recover the gap, and five confound controls reproduce the effect. The mechanism is lossy distillation: extraction discards verbatim detail that chunks retain for free, and the extracted-artifact pipeline never beats naive RAG in overall accuracy. Concurrent positive results with near-verbatim, provenance-preserving units fit the same account: retrieval accuracy tracks how far the representation departs from the source. For the extraction designs we test, structured memory should augment verbatim text rather than replace it: a chunks $\cup$ artifacts union store matches chunks on both benchmarks while artifacts alone forfeit the gap. Code and data: https://github.com/tao-hpu/cog-canvas2025-12-23T16:45:15Zv2: substantially revised -- reframed from a system paper to a controlled ablation study; title and conclusions updated accordingly. 26 pages, 5 figuresTao Anhttp://arxiv.org/abs/2506.18756v2Semantic-Preserving Prompt Hijacking: A Black-Box Adversarial Attack on Auto-Prompt Optimization2026-06-12T06:33:29ZLLMs increasingly integrate auto-suggestion optimization modules, enabling them to rewrite and display user input before generating the final response. While this design aims to enhance transparency and trust, its process of autonomously selecting a single best result from multiple candidate solutions allows attackers to hijack this optimization process by inducing subtle, imperceptible semantic shifts. To address this, we propose a semantic preservation hijacking attack method based on black-box conditions: Adaptive Greedy Local Search. This method hierarchically decomposes the input text, masks key language units, and dynamically adjusts candidate replacement words at predefined semantic checkpoints. This maximizes the deviation between the model output and the original intent while strictly maintaining semantic similarity to the original text. Experimental results on commercial and open-source LLMs demonstrate that, under the same semantic similarity constraints, this method achieves a higher attack success rate than existing attack methods in over 2400 test cases. Code is available at: https://github.com/franz-chang/DOBS2025-05-26T15:41:06Z12 pages, 8 figures. Accepted by the IEEE International Conference on Multimedia and Expo (ICME 2026)Chong ZhangXiang LiJia WangShan LiangHaochen XueXiaobo Jinhttp://arxiv.org/abs/2606.14155v1Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems2026-06-12T06:27:15ZContext adaptation automates prompt engineering in LLM-based systems by iteratively revising tunable prompts from task feedback, without modifying model weights. Extending this paradigm to multi-LLM agentic systems is crucial: existing methods suffer from inaccurate credit assignment and lack convergence guarantees. We propose \textbf{G}raph-based \textbf{T}arget \textbf{B}ack-\textbf{P}ropagation (GTBP), a context adaptation framework for agentic workflows modeled as directed acyclic graphs. GTBP propagates local target outputs backward through the workflow graph and uses target--output discrepancies to guide a stage-wise prompt update mechanism. Theoretically, we show that GTBP's stage-wise prompt updates become stable over iterations, and that a sufficiently capable LLM optimizer can decrease the overall objective. Empirically, GTBP consistently outperforms strong baselines across three benchmarks while maintaining comparable computational cost.2026-06-12T06:27:15ZTan ZhuTong YaoKananart KuwaranancharoenAmit SinghYushang LaiDeepa MohanShankara Bhargavahttp://arxiv.org/abs/2606.14150v1Small LLMs: Pruning vs. Training from Scratch2026-06-12T06:24:28ZPruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.2026-06-12T06:24:28ZOur code is available at https://github.com/zlab-princeton/llm-pruning-collectionYufeng XuTaiming LuKunjun LiJiachen ZhuMingjie SunZhuang Liuhttp://arxiv.org/abs/2606.12881v2Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study2026-06-12T06:20:00ZWe present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.2026-06-11T04:15:54Z7 pages, 3 figures, 1 table. All authors contributed equallyDezhi YuYvonne QiuShuoJia Fuhttp://arxiv.org/abs/2505.16988v2MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems2026-06-12T06:12:21ZLLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.2025-05-22T17:54:38Z18 pages, 11 figuresRui YeKeduan HuangQimin WuYuzhu CaiTian JinXianghe PangXiangrui LiuJiaqi SuChen QianBohan TangKaiqu LiangJiaao ChenYue HuZhenfei YinRongye ShiBo AnYang GaoWenjun WuLei BaiSiheng Chen