https://arxiv.org/api/sL2dawlp05JGn+qEt4enlUNwiK8 2026-07-17T20:53:07Z 26150 15 15 http://arxiv.org/abs/2607.13418v2 Can We Steer the Black-Box? Towards Controllability-Centric Evaluation of Recommender Systems with Collaborative Agents 2026-07-16T00:42:41Z

Recommender systems operate as Black-Boxes, leaving users and regulators unable to steer their outputs toward specific intentions or audit their behavior. This lack of controllability, defined as the system's ability to respond to explicit guidance, remains an unaddressed dimension in existing evaluation paradigms. To fill this gap, we propose CtrlBench-Rec, a collaborative multi-agent framework for systematic assessment of controllability. We formalize three fundamental tasks: target content discovery, interest profile shaping, and popularity bias mitigation, which together measure steerability from explicit commands to implicit representation steering and finally to overcoming algorithmic biases.Extensive experiments on real-world datasets and multiple recommendation models demonstrate that our framework effectively quantifies controllability and exposes critical system bottlenecks, most notably persistent resistance to guiding long tail content. CtrlBench-Rec provides the first standardized toolkit for controllable recommendation research, algorithmic auditing, and user empowerment. Our code is released on https://github.com/caskcsg/CtrlBenchRec.

2026-07-15T03:37:06Z Jiwen Zhou Xiang Liu Mingming Li Pengbo Mo Jiao Dai Honglei Lv Jizhong Han Songlin Hu http://arxiv.org/abs/2607.14400v1 DS@GT ARC at LongEval: Citation Integrity and Factual Grounding in Scientific QA 2026-07-15T22:35:02Z

This paper describes DS@GT ARC's submission to the CLEF 2026 LongEval Task 4 on Retrieval-Augmented Generation (RAG). In this submission, we examine a divergence between traditional natural language evaluation metrics and citation integrity as applied to RAG QA systems. We evaluate a corrective pipeline using Corrective RAG (CRAG) and CiteFix against baseline and frontier model benchmark RAG QA scores. While frontier models maximized answer relevance and fluency scores, our RAGAs LLM-as-judge diagnostics indicate that frontier models would correctly identify relevant documents without using their context in answer generation. Conversely, by filtering chunks pre-generation and enforcing strict entailment of generated claims to the cited material post-generation, our corrective pipeline marginally improved citation faithfulness and answer grounding. We propose that evaluation of trustworthy RAG QA requires metrics that reward strict answer grounding.

2026-07-15T22:35:02Z 12 pages, 4 figures. Accepted to the CLEF 2026 LongEval Lab Working Notes Brandon Michaels Brendon Johnson http://arxiv.org/abs/2607.14390v1 Why Git Is the Memory Solution for the Agentic Development Lifecycle 2026-07-15T22:06:59Z

Coding agents now produce a growing share of a team's code, while the reasoning behind each change -- the alternatives weighed, the constraints discovered, the approaches rejected -- is trapped in assistant transcripts that vanish with the session. Memory for this setting, the agentic development lifecycle (ADLC), is usually posed as one retrieval problem and built as machinery: tiered stores, memory graphs, compiled wikis, model-judged admission. We argue memory should instead be git-bound -- built into the repository's version control, inheriting the guarantees the machinery struggles to construct: ground truth from commits, freshness from rebuild, verification from the merge, containment from review. On this ledger we solve two problems separately, then combine them. Seed supply is closed as an eight-corpus retrieval study under a pre-registered ship discipline: five imported ranking mechanisms rejected, two kept, and a best configuration of ~0.31 pooled MRR -- ~60x the raw-transcript grep floor, ~15x an honest parsed-turn floor. Answer assembly is where ranking stops helping: single-shot retrieval scores only 0.07-0.20 answer-sufficiency on real developer questions, and ungated episode injection measurably degrades good answers. A router dispatches breadth to a git-anchored structural map, pointed lookups to confidence-gated episodes, and rationale to decision synthesis, which reconstructs why-arcs no single session contains (0.83 sufficiency on a young ~50k-LOC production system). Routed, the system answers at 382-980 tokens per question -- three orders of magnitude below the recorded history. Because ground truth is mined from commit-session links rather than annotated, every result is replicable on any user's own history at zero labeling cost. The remaining constraint is capture. Code, benchmark, and paper source: github.com/rekal-dev/rekal-cli.

2026-07-15T22:06:59Z 8 pages Frank Guo http://arxiv.org/abs/2607.14331v1 Long-History User Transformers for Real-Time Ad Ranking 2026-07-15T19:51:16Z

Long interaction histories are among the most informative inputs for click-through rate (CTR) prediction, yet in online advertising they collide with a hard serving constraint: ads must be scored within a few hundred milliseconds to enter the auction, which rules out running a large sequence encoder at request time. We describe how a production advertising system resolves this conflict by decoupling history encoding from real-time inference. A high-capacity offline transformer asynchronously encodes the user's full cross-surface interaction history into a compact representation cached in a feature store, while a lightweight runtime model combines this cached representation with the user's most recent events and the request context at serving time. The offline encoder is pre-trained autoregressively on large-scale interaction logs with a dual objective - feedback prediction and next-item prediction - and the two-stage architecture is then fine-tuned for CTR prediction on the target advertising surface. Offline, the split design recovers 72-80% of the quality of a full-history runtime transformer that would be too expensive to deploy, and the cached representation is robust enough to staleness to permit inexpensive refresh policies. In production A/B experiments, the system improves the primary ranking metric by +2.77% in search advertising and +2.1% on the Yandex Advertising Network, with revenue gains of +2.26% and +0.43% respectively - without increasing serving latency.

2026-07-15T19:51:16Z Viacheslav Ovchinnikov Georgii Smirnov Nikolai Savushkin Veronika Ivanova Maksim Kuzin http://arxiv.org/abs/2607.14234v1 ICAConfPubs: A Dataset and User Interface for ICA Conference Papers (2003-2018) 2026-07-15T18:00:46Z

This paper presents a comprehensive dataset of past ICA (The International Communication Association) annual conference papers from 2003 to 2018, encompassing 27,466 papers, 21,038 authors, and 4,935 sessions. We made the dataset publicly available in both CSV and JSON formats. Additionally, we developed an API to facilitate programmatic access, and an intuitive user interface to enable users to navigate and explore the data more easily. The web application, API documentation, downloadable data, and reproducible code to obtain and process the data are available at https://ica.hongtaoh.com.

2026-07-15T18:00:46Z 21 pages; preprint Hongtao Hao Xinyue Chen Jiye Sun Yanling Zhao Jing Zhang http://arxiv.org/abs/2606.31693v2 ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping 2026-07-15T17:49:58Z

The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation pipelines, forcing complex intents through low-bandwidth retrieval or ranking interfaces and leaving a gap between language understanding and item-space fulfillment. Generative recommendation gives LLMs a direct item-space interface through semantic IDs (SIDs), but existing models mainly generate candidates for retrieval rather than translate flexible intents into item-space outcomes. We propose ShopX to address this bottleneck by unifying intent understanding, execution planning, and flexible SID-native item-space operations into a single foundation model. We deploy ShopX in agentic shopping workflows through a model-native item-fulfillment framework with a serving harness that defines a model-facing action protocol and exposes support surfaces for context access, catalog grounding, and state management. Within this framework, ShopX plans and composes SID-based item-space operations such as SID beam-search retrieval, listwise ranking, or product bundling. This model-centric design reduces lossy hand-offs between agent orchestration and item-space execution. To build ShopX, we design semantically recoverable, LLM-operable SIDs and a training recipe that equips a general LLM for flexible multi-turn item-space fulfillment while retaining the knowledge and instruction-following abilities needed by a shopping agent. We evaluate the ShopX framework against tool-mediated agentic systems on single- and multi-turn fulfillment tasks derived from anonymized Taobao production logs, showing that model-native fulfillment improves overall framework behavior, especially on complex or ambiguous requests.

2026-06-30T14:05:28Z The new version adds additional results and details Jiacheng Chen Tao Zhang Manxi Lin Dunxian Huang Teng Shi Honghao Fu Mengyan Li Xinming Zhang Chenchi Zhang Xuan Lu Xiaoxiong Du Haibin Chen Shaolin Ye Hao Chang Xiaoqi Li Shuwen Xiao Yujin Yuan Jingxuan Feng Shaopan Xiong Huimin Yi Ju Huang Qiu Shen Ying Chen Junjun Zheng Xiangheng Kong Dan Ou Haihong Tang Yuning Jiang Bo Zheng http://arxiv.org/abs/2607.14035v1 Optimizing Visibility in Generative Engines: A Critical Survey of Generative Engine Optimization (2023-2026) 2026-07-15T17:03:08Z

Generative Engine Optimization (GEO) seeks to increase content's presence, likelihood of citation, or influence in answers produced by generative engines. Since the foundational GEO paper, the field has expanded rapidly, but terminology, metrics, and evidence standards remain heterogeneous. This critical survey reviews 45 studies selected under a November 2023-July 2026 publication window, including one earlier preprint published at EMNLP after the window opened, plus relevant RAG and evaluation work. We argue that GEO is not a single ranking task but a stochastic, partially observable pipeline spanning search activation, crawling and indexing, retrieval, reranking and context allocation, citation, prominence, factual absorption, fidelity, and user behavior. The foundational paper's widely cited gains are valid within its experimental setting but conditional on a source already being present in a fixed context; they establish neither organic discoverability nor durable traffic effects. Reviewed work indicates that topical relevance and context position are the most reproducible levers, generic heuristics transfer poorly, competition can erode individual gains, and citation-oriented rewrites can impair retrieval. Commercial audits further reveal low source overlap, substantial run-to-run variability, and persistent fidelity gaps. We contribute a multistage formal model, a visibility vector separating discoverability, citation, absorption, and economic outcomes, an evidence hierarchy, and a reproducible protocol based on repeated measurements, paraphrases, controls, human validation, and multi-actor interference. Within this corpus, the evidence is narrow: already-retrieved content can causally alter its citation or use, but no reviewed technique shows a stable, longitudinal, cross-platform causal effect on organic discoverability or downstream behavior.

2026-07-15T17:03:08Z 18 pages, 8 tables, 1 figure; critical survey of 45 studies; ancillary literature matrix and search protocol included Olivier Martinez http://arxiv.org/abs/2607.14192v1 Long-term User Engagement Optimization through Model-agnostic Downstream Rewards Learning 2026-07-15T16:17:58Z

As recommender systems mature in the past few years, their optimization objectives have evolved from a primary focusing on short-term behavioral signals to a broader emphasis on long-term user engagement and retention. However, directly optimizing retention is difficult because return signals are sparse, delayed, and only partially attributable to earlier recommendations. Prior work has addressed this challenge with sequential modeling and reinforcement learning, but these approaches typically require task specific reward engineering, substantial computational overhead, and surface specific implementations that are difficult to generalize. In this paper, we present a unified, model-agnostic downstream reward framework for optimizing long-term user value in large-scale recommendation systems. First, we formulate the downstream reward learning problem and develop an offline screening framework to identify session level behaviors that are both observable early and predictive of future retention. We then propose several model-agnostic downstream rewards signals derived from observed user action patterns across multiple sources. We further discuss the engineering effort to productionize the proposed rewards derivations and challenges we faced when adding them to our ranking models. Online A/B experiments demonstrate consistent improvements in engagement and retention-related metrics, and the framework has been deployed across multiple Pinterest surfaces, including Homefeed, Related Pins, Search, and Notifications.

2026-07-15T16:17:58Z Recsys 2026 Dingsu Wang Filip Ryzner Kelly He Armando Ordorica David Woo Aditya Mantha Liyao Lu Usha Amrutha Nookala Haoran Guo Jiacong He Olafur Gudmundsson Matt Chun Krystal Benitez Dhruvil Deven Badani Yijie Dylan Wang http://arxiv.org/abs/2607.14190v1 A Temporal Machine Learning-Based Time-to-Event Model for Predicting ALS Progression and Healthcare Utilization 2026-07-15T16:11:23Z

Amyotrophic lateral sclerosis (ALS) is a progressive and heterogeneous neurodegenerative disease in which predicting clinically meaningful milestones, such as assistive device use, remains challenging. We developed a time-to-event, digital-twin-inspired framework that integrates longitudinal ALS Functional Rating Scale-Revised (ALSFRS-R) trajectories with survival modeling to support individualized prediction of functional decline and assistive device utilization. We constructed a harmonized longitudinal dataset by integrating diagnosis records, ALSFRS-R assessments, activities of daily living, and demographic information, followed by preprocessing to ensure data quality, temporal alignment, and cohort consistency. Correlation-based clustering identified coherent functional domains spanning bulbar, upper limb, axial, lower limb, and respiratory systems. Generalized additive mixed models characterized nonlinear, domain-specific functional decline across all domains. In addition, a temporal machine learning model was developed to predict longitudinal functional decline and capture stage-dependent disease progression. Cox proportional hazards modeling further identified lower limb function, particularly walking and stair climbing, as the strongest predictors of earlier wheelchair access. Building on these results, we implemented a digital twin-inspired temporal machine learning-based time-to-event (TTE) model that generates individualized survival curves and dynamically predicts wheelchair-free survival. This framework provides a scalable, interpretable, and clinically actionable approach for linking ALS progression with personalized decision support, with applications in proactive care planning, clinical trial stratification, and precision medicine.

2026-07-15T16:11:23Z Zongliang Yue Qi Li Terry Heiman-Patterson Frank Bearoff Zhaohui Qin Huanmei Wu http://arxiv.org/abs/2606.12198v2 LLM-Based User Personas for Recommendations at Scale 2026-07-15T16:08:17Z

Large Language Models (LLMs) offer unprecedented potential for enhancing recommendation systems through their world knowledge and reasoning capabilities. However, existing approaches often rely on structured IDs or offline processing, limiting semantic richness, real-time adaptability, and user-facing interpretability. In this paper, we introduce a novel framework that enables real-time generation of LLM-based user interest personas for a large-scale commercial video recommendation platform. Our method generates natural-language user interest personas that address the exploitation-exploration trade-off by combining the summarization of existing interests with novel topics, directly during serving. To overcome the computational challenges of online LLM inference at a billion-user scale, we design a cost-efficient architecture leveraging knowledge distillation, asynchronous inference, and input optimization via semantically clustered video representations. Extensive offline evaluations, user studies, and live A/B tests demonstrate significant improvements in viewer value. This work bridges the gap between high-level semantic understanding and industrial-scale recommendation, paving the way for more dynamic, explainable, and satisfying personalized experiences.

2026-06-10T15:18:32Z Accepted by 2026 RecSys Industry Track Haoting Wang Haokai Lu Zheyun Feng Jenny Huang Yifat Amir Gregory Hinkson Ben Most Zelong Zhao Yixin Kelly Cui Rein Zhang Fabio Soldo Yu Xia Nihar Bhupalam Minmin Chen Konstantina Christakopoulou Lichan Hong Ed H. Chi http://arxiv.org/abs/2607.13826v1 Multimodal Assessment of Pancreatic Cancer Resectability Using Deep Learning 2026-07-15T13:32:44Z

Accurate determination of pancreatic ductal adenocarcinoma (PDAC) resectability relies on evaluating how the tumor interacts with major peripancreatic vessels on CT imaging, yet expert assessment often shows substantial variability. We introduce a fully automated multimodal deep learning framework that jointly analyzes 3D contrast enhanced CT and structured clinical information to classify patients into the three National Comprehensive Cancer Network (NCCN) resectability categories (upfront resectable, borderline resectable, locally advanced). The approach uses a Swin-UNETR backbone to obtain anatomy aware image representations through auxiliary segmentation of pancreas, tumor, and vascular structures. These features are fused with a compact clinical embedding derived from 17 routinely collected variables and processed by a lightweight classification head. Model training is guided by a dynamic multitask objective that adapts the balance between segmentation and classification based on current tumor Dice performance, promoting feature representations that remain both anatomically informed and discriminative.

2026-07-15T13:32:44Z Vincent Ochs Christoph Kuemmerli Florentin Bieder Julia Wolleb Joel L. Lavanchy Julia Ruppel Jan Liechti Stephanie Taha-Mehlitz Christian Andreas Nebiker Beat Mueller Giuseppe Kito Fusai Joerg-Matthias Pollok Anas Taha Philippe C. Cattin Sebastian Staubli http://arxiv.org/abs/2607.13728v1 Cluster with Auctions for Vector Search 2026-07-15T11:42:07Z

Large-scale approximate nearest neighbor search commonly relies on partitions for indexing: database vectors are partitioned into clusters, and for each query a probing function selects the clusters to be scanned. The query probing function and the database partition are rarely treated as separate entities: most techniques assign queries with the same assignment function as the database vectors, which is suboptimal especially when database and query distributions differ. This paper introduces CwA (Cluster with Auctions), which addresses this limitation by jointly learning a balanced database partition and a neural probing function. CwA optimizes search performance directly for the query distribution. It minimizes its objective by alternating two steps: (i) gradient descent on the neural network of the probing function, and (ii) a large-scale combinatorial optimization of the cluster assignment for the database vectors. We solve the latter with a parallelizable auction algorithm that balances the partition by design. To further scale CwA, we extend the method to a Cartesian product of clusters that increases the partition's granularity. When database and query distributions differ, CwA achieves up to 4.7$\times$ throughput over the state-of-the-art at equal recall. In the in-distribution (ID) setting, even a simple linear probing function trained with CwA outperforms competing deep neural methods.

2026-07-15T11:42:07Z 10 pages, 6 figures. Under review at NeurIPS 2026 Swann Bessa Pierre Fernandez Gergely Szilvasy Matthijs Douze Hervé Jégou http://arxiv.org/abs/2607.10541v3 RecRec: Recursive Refinement for Sequential Recommendation 2026-07-15T10:06:25Z

Sequential recommender systems typically infer user preferences through single-pass encoding of interaction histories without iterative refinement, relying on increasingly deep architectures to capture complex patterns. In this work, we revisit sequential recommendation from a recursive inference perspective: can user preferences be modeled as a persistent latent state that is recursively refined? We propose RecRec (Recursive Recommendation), a lightweight model that maintains a compact latent state and updates it through a shared recursive module conditioned on interaction evidence. Unlike prior recursive models, RecRec introduces an evidence-anchored correction mechanism that stabilizes refinement by grounding each update in the original interaction context, preventing semantic drift during deep recursive reasoning. Experiments on three benchmark datasets under standard evaluation protocols show that RecRec matches or outperforms state-of-the-art sequential, graph-based, and reasoning-enhanced recommenders while using only 3.9M to 14M parameters. Ablation studies demonstrate that both recursive refinement and the evidence-anchored correction gate contribute significantly to performance, highlighting the effectiveness of recursive latent inference as a scalable alternative to deeper or language-based architectures. Code is available at https://anonymous.4open.science/r/RecRec-6B67/README.md.

2026-07-12T02:53:17Z 7 pages, 3 figures Pervez Shaik Prosenjit Biswas Abhinav Thorat Ravi Kolla Niranjan Pedanekar http://arxiv.org/abs/2607.13636v1 Measuring What the Crawler Sees: Discovery Curves, Core Persistence, and Shell Dynamics in Longitudinal Web Crawls 2026-07-15T09:29:24Z

A longitudinal web crawl is a sequence of partial samples of an evolving URL population. Pairwise containment between two crawls is the standard probe; under a simple \emph{urn} model of the crawl -- each round samples a fraction of the URLs and replaces a fraction -- it recovers two interpretable rates, per-round survival $α$ and coverage $c$, but treats the population as uniform and consumes one pair at a time. In this work, we define a formal language for talking about a crawl. We extend this analysis with the \emph{discovery curve} $U(s, T)$, the cumulative URL footprint over a sliding window of $T$ crawls starting at $s$, which under the same urn model is also a closed-form function of $(α, c)$. Containment and the discovery curve are then two projections of one process: independent fits agree on $(α, c)$ when the urn is homogeneous, so any disagreement is itself a measurement. Applied to Common Crawl (2020--2025, domain granularity) and to the German Academic Web (GAW, URL granularity), the two projections disagree on both archives, and a two-component urn with a persistent core fraction $κ$ alongside shell parameters $(α_\partial, c_\partial)$ reconciles the disagreement. A residual on $c_\partial$ remains, signaling that the shell itself is not homogeneous; $κ$ is recorded as the scalar entry point to a rank-resolved generalization, which is left to follow-up work. \keywords{web archive \and crawl coverage \and discovery curve \and urn model \and two-component model \and URL lifetime}

2026-07-15T09:29:24Z 16 pages, 4 figures, web metrics Michael Paris Hande Celikkanat Luca Foppiano http://arxiv.org/abs/2607.13609v1 Gauge-Invariant, Parameter-Insensitive Regularization for Potential Recovery from Flow on Directed Graphs 2026-07-15T08:56:57Z

Recovering a latent potential from observed flow on a directed graph (a discrete Poisson problem with Dirichlet boundaries) is ill-posed, and the standard fix backfires: ridge regularization shrinks toward a gauge-meaningless origin, collapsing and reversing the recovered ordering ($+0.81\to-0.42$ rank correlation against a planted ground truth). The gauge-invariant graph Dirichlet energy removes the hazard and delivers parameter-insensitivity: the estimate is stable across four orders of magnitude in $λ$, whereas ridge inverts the ordering for every $λ>0$. We prove the reduced solve is SPD and preserves dynamic range exactly where ridge collapses it, and localize absorbing boundaries from flow alone via a Poisson residual. The $H^1$ seminorm is classical; what is new is the gauge diagnosis, the parameter-insensitivity it buys, and an ablation showing the result is robust to the extraction method. On three public clickstream corpora the gauge-invariant estimate retains $28$--$41\%$ of the interior dynamic range while ridge collapses to as little as $0.2\%$. The same gauge invariance carries into graph neural networks -- neutralizing the constant mode per layer prevents the oversmoothing that collapses a deep directed GCN -- linking this classical inverse problem to a central question in graph learning.

2026-07-15T08:56:57Z 17 pages, 6 figures, submitted to LoG 2026 Mohammad Forouhesh