https://arxiv.org/api/wzkbw1EIJyD6XoyAJz/dIVWGJ+w2026-03-20T20:32:07Z1136910515http://arxiv.org/abs/2603.09558v1No Cliques Allowed: The Next Step Towards BDD/FC Conjecture2026-03-10T12:05:39ZThis paper addresses one of the fundamental open questions in the realm of existential rules: the conjecture on the finite controllability of bounded derivation depth rule sets (bdd $\Rightarrow$ fc). We take a step toward a positive resolution of this conjecture by demonstrating that universal models generated by bdd rule sets cannot contain arbitrarily large tournaments (arbitrarily directed cliques) without entailing a loop query, $\exists{x} E(x, x)$. This simple yet elegant result narrows the space of potential counterexamples to the (bdd $\Rightarrow$ fc) conjecture.2026-03-10T12:05:39ZPublished at PODS 2025Lucas LarroquePiotr Ostropolski-NalewajaMichaƫl Thomazohttp://arxiv.org/abs/2210.13722v6Towards Selecting the Informative Alternative Relational Query Plans for Database Education2026-03-10T09:53:28ZOff-the-shelf RDBMS typically expose only the query execution plan (QEP) of an SQL query, without presenting information about representative alternative query plans (AQPs) considered during plan selection in a user-friendly manner. Providing easy access to representative AQPs is valuable in database education, as it helps learners understand the plan choices made by a query optimizer, one of the several important components related to relational query processing. In this paper, we present a novel problem called the informative plan selection problem (TIPS), which aims to discover a set of k informative AQPs from the underlying plan space so that the plan informativeness of the set is maximized. Specifically, we explore two variants of the problem, batch TIPS and incremental TIPS, to cater to diverse learners. Due to the computational hardness of the problem, we present an approximation algorithm to address it efficiently while providing theoretical guarantees for the results. An extensive experimental study, including feedback from real-world learners and a three-year in-class evaluation of academic outcomes, demonstrates the effectiveness of our solutions for database education.2022-10-25T02:41:40Z31 pages, 15 figures. Major revision and substantial extension. This version expands the earlier demo-oriented paper into a full article on TIPS, with updated title, abstract, and author list. Accepted to Proceedings of the ACM on Management of Data (SIGMOD 2026)Hu WangHui LiSourav S BhowmickZihao Mahttp://arxiv.org/abs/2603.10081v1Categorical Calculus and Algebra for Multi-Model Data2026-03-10T09:44:48ZMulti-model databases are designed to store, manage, and query data in various models, such as relational, hierarchical, and graph data, simultaneously. In this paper, we provide a theoretical basis for querying categorical databases. We propose two formal query languages: categorical calculus and categorical algebra, by extending relational calculus and relational algebra respectively. We demonstrate the equivalence between these two languages of queries. We propose a series of transformation rules of categorical algebra to facilitate query optimization. Finally, we analyze the expressive power and computation complexity for the proposed query languages.2026-03-10T09:44:48ZIn Proceedings ACT 2025, arXiv:2603.07595. arXiv admin note: substantial text overlap with arXiv:2504.09515EPTCS 442, 2026, pp. 75-90Jiaheng LuUniversity of Helsinki10.4204/EPTCS.442.6http://arxiv.org/abs/2603.09398v1GeoBenchr: An Application-Centric Benchmarking Suite for Spatiotemporal Database Platforms2026-03-10T09:12:05ZThe rapid growth of spatiotemporal data volumes needs to be handled by database systems capable of efficiently managing and querying such data. Existing systems such as PostGIS, SpaceTime, and MobilityDB offer partial solutions but differ widely in scope and performance. Also, first spatiotemporal benchmarks provide valuable insights but are limited in scope and, to our knowledge, no application-centric benchmarking suite exists. In this paper, we propose GeoBenchr, an open-source, application-centric benchmarking suite for spatiotemporal platforms. GeoBenchr enables comprehensive evaluation across diverse datasets, query types, and workload patterns, reflecting realistic use cases from domains such as cycling, aviation, and maritime tracking. We use our GeoBenchr prototype to evaluate several system aspects including scalability, configuration impact, and cross-platform performance comparison. Our results highlight the importance of application-centric benchmarking in selecting suitable spatiotemporal database systems for real-world scenarios.2026-03-10T09:12:05Zcurrently under review at The 27th IEEE International Conference on Mobile Data ManagementTim C. ReseNils JapkeDiana BaumannNatalie CarlDavid Bermbachhttp://arxiv.org/abs/2603.09347v1The Virtuous Cycle: AI-Powered Vector Search and Vector Search-Augmented AI2026-03-10T08:28:58ZModern AI and vector search are rapidly converging, forming a promising research frontier in intelligent information systems. On one hand, advances in AI have substantially improved the semantic accuracy and efficiency of vector search, including learned indexing structures, adaptive pruning strategies, and automated parameter tuning. On the other hand, powerful vector search techniques have enabled new AI paradigms, notably Retrieval-Augmented Generation (RAG), which effectively mitigates challenges in Large Language Models (LLMs) like knowledge staleness and hallucinations. This mutual reinforcement establishes a virtuous cycle where AI injects intelligence and adaptive optimization into vector search, while vector search, in turn, expands AI's capabilities in knowledge integration and context-aware generation. This tutorial provides a comprehensive overview of recent research and advancements at this intersection. We begin by discussing the foundational background and motivations for integrating vector search and AI. Subsequently, we explore how AI empowers vector search (AI4VS) across each step of the vector search pipeline. We then investigate how vector search empowers AI (VS4AI), with a particular focus on RAG frameworks that integrate dynamic, external knowledge sources into the generative process of LLMs. Furthermore, we analyze end-to-end co-optimization strategies that fully unlock the potential of the ``virtuous cycle" between vector search and AI. Finally, we highlight key challenges and future research opportunities in this emerging area. This paper was published in ICDE 2026.2026-03-10T08:28:58ZJiuqi WeiQuanqing XuChuanhui Yanghttp://arxiv.org/abs/2603.08036v2Samyama: A Unified Graph-Vector Database with In-Database Optimization, Agentic Enrichment, and Hardware Acceleration2026-03-10T05:50:24ZModern data architectures are fragmented across graph databases, vector stores, analytics engines, and optimization solvers, resulting in complex ETL pipelines and synchronization overhead. We present Samyama, a high-performance graph-vector database written in Rust that unifies these workloads into a single engine. Samyama combines a RocksDB-backed persistent store with a versioned-arena MVCC model, a vectorized query executor with 35 physical operators, a cost-based query planner with plan enumeration and predicate pushdown, a dedicated CSR-based analytics engine, and native RDF/SPARQL support. The system integrates 22 metaheuristic optimization solvers directly into its query language, implements HNSW vector indexing with Graph RAG capabilities, and introduces Agentic Enrichment for autonomous graph expansion via LLMs. The Enterprise Edition adds GPU acceleration via wgpu, production-grade observability, point-in-time recovery, and hardened high availability with HTTP/2 Raft transport.
Our evaluation on commodity hardware (Mac Mini M4, 16 GB RAM) demonstrates: ingestion at 255K nodes/s (CPU) and 412K nodes/s (GPU-accelerated); 115K Cypher queries/sec at 1M nodes; 4.0-4.7x latency reduction from late materialization on multi-hop traversals; 8.2x GPU PageRank speedup at 1M nodes; and 100% LDBC Graphalytics validation (28/28 tests). These results demonstrate that a unified graph-vector-optimization engine can achieve competitive performance on commodity hardware while maintaining Rust's memory safety guarantees.2026-03-09T07:17:17Z16 pages, 4 figures, 12 tablesMadhulatha MandarapuSandeep Kunkunuruhttp://arxiv.org/abs/2603.09181v1Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor2026-03-10T04:35:50ZIndex tuning is critical for the performance of modern database systems. Industrial index tuners, such as the Database Tuning Advisor (DTA) developed for Microsoft SQL Server, rely on the "what-if" API provided by the query optimizer to estimate the cost of a query given an index configuration, which can lead to suboptimal recommendations when the estimations are inaccurate. Large language model (LLM) offers a new approach to index tuning, with knowledge learned from web-scale training datasets. However, the effectiveness of LLM-driven index tuning, especially beyond what is already achieved by commercial index tuners, remains unclear.
In this paper, we study the practical effectiveness of LLM-driven index tuning using both industrial benchmarks and real-world enterprise customer workloads, and compare it with DTA. Our results show that although DTA is generally more reliable, with a few invocations, LLM can identify configurations that significantly outperform those found by DTA in execution time in a considerable number of cases, highlighting its potential as a complementary technique. We also observe that LLM's reasoning captures human-intuitive insights that may be distilled to potentially improve DTA. However, adopting LLM-driven index tuning in production remains challenging due to its substantial performance variance, limited and often negative impact when directly integrated into DTA, and the high cost of performance validation. This work provides motivation, lessons, and practical insights that will inspire future work on LLM-driven index tuning both in academia and industry.2026-03-10T04:35:50ZXiaoying WangWentao WuVivek NarasayyaSurajit Chaudhurihttp://arxiv.org/abs/2603.09152v1DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering2026-03-10T03:44:52ZTable Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single-agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi-hop logic. This paper introduces DataFactory, a multi-agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data-to-knowledge graph transformation via the mapping function T:D x S x R -> G, and implement natural language-based consultation that - unlike fixed workflow multi-agent systems - enables flexible inter-agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen's d > 1). Team coordination also outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2). The framework offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation.2026-03-10T03:44:52ZPublished in Information Processing & Management, 2026Information Processing & Management, 63(6):104723, 2026Tong WangChi JinYongkang ChenHuan DengXiaohui KuangGang Zhao10.1016/j.ipm.2026.104723http://arxiv.org/abs/2603.09122v1Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration2026-03-10T02:55:37ZDistributed key-value stores are widely adopted to support elastic big data applications, leveraging purpose-built consensus algorithms like Raft to ensure data consistency. However, through systematic analysis, we reveal a critical performance issue in such consistent stores, i.e., overlapping persistence operations between consensus protocols and underlying storage engines result in significant I/O overhead. To address this issue, we present Nezha, a prototype distributed storage system that innovatively integrates key-value separation with Raft to provide scalable throughput in a strong consistency guarantee. Nezha redesigns the persistence strategy at the operation level and incorporates leveled garbage collection, significantly improving read and write performance while preserving Raft's safety properties. Experimental results demonstrate that, on average, Nezha achieves throughput improvements of 460.2%, 12.5%, and 72.6% for put, get, and scan operations, respectively.2026-03-10T02:55:37ZAccepted to ICDE 2026 (main research track). The main paper is 12 pages excluding referencesYangyang WangYucong DongZiqian ChengZichen Xuhttp://arxiv.org/abs/2603.08037v2CEMR: An Effective Subgraph Matching Algorithm with Redundant Extension Elimination2026-03-10T02:34:23ZSubgraph matching is a fundamental problem in graph analysis with a wide range of applications. However, due to its inherent NP-hardness, enumerating subgraph matches efficiently on large real-world graphs remains highly challenging. Most existing works adopt a depth-first search (DFS) backtracking strategy, where a partial embedding is gradually extended in a DFS manner along a branch of the search trees until either a full embedding is found or no further extension is possible. A major limitation of this paradigm is the significant amount of duplicate computation that occurs during enumeration, which increases the overall runtime. To overcome this limitation, we propose a novel subgraph matching algorithm, CEMR. It incorporates two techniques to reduce duplicate extensions: common extension merging, which leverages a black-white vertex encoding, and common extension reusing, which employs common extension buffers. In addition, we design two pruning techniques to discard unpromising search branches. Extensive experiments on real-world datasets and diverse query workloads demonstrate that CEMR outperforms state-of-the-art subgraph matching methods.2026-03-09T07:20:24ZAccepted to PVLDB (VLDB 2026). This arXiv version contains the full version of the paperLinglin YangXunbin SuLei ZouXiangyang GouYinnian Linhttp://arxiv.org/abs/2503.10036v4Modeling Concurrency Control as a Learnable Function2026-03-10T00:45:56ZConcurrency control (CC) algorithms are important in modern transactional databases, as they enable high performance by executing transactions concurrently while ensuring correctness. However, state-of-the-art CC algorithms struggle to perform well across diverse workloads, and most do not consider workload drifts.
In this paper, we propose NeurCC, a novel learned concurrency control algorithm that achieves high performance across diverse workloads. The algorithm is quick to optimize, making it robust against dynamic workloads. It learns a function that captures a large number of design choices from existing CC algorithms. The function is implemented as an efficient in-database lookup table that maps database states to concurrency control actions. The learning process is based on a combination of Bayesian optimization and a novel graph reduction search algorithm, which converges quickly to a function that achieves high transaction throughput. We compare NeurCC against five state-of-the-art CC algorithms and show that it consistently outperforms the baselines both in transaction throughput and in optimization time.2025-03-13T04:34:56ZHexiang PanShaofeng CaiTien Tuan Anh DinhYuncheng WuYeow Meng CheeGang ChenBeng Chin Ooihttp://arxiv.org/abs/2603.08957v1Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation2026-03-09T21:43:39ZA \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.2026-03-09T21:43:39ZYuxin TangZhiyuan XinZhimin DingXinyu YaoDaniel BourgeoisTirthak PatelChris Jermainehttp://arxiv.org/abs/2603.08880v1OptBench: An Interactive Workbench for AI/ML-SQL Co-Optimization[Extended Demonstration Proposal]2026-03-09T19:45:43ZDatabase workloads are increasingly nesting artificial intelligence (AI) and machine learning (ML) pipelines and AI/ML model inferences with data processing, yielding hybrid SQL+AI/ML queries that mix relational operators with expensive, opaque AI/ML operators, often expressed as UDFs. These workloads are challenging to optimize because ML operators behave like black boxes, data-dependent effects such as sparsity, selectivity, and cardinalities can dominate runtime, domain experts often rely on practical heuristics that are difficult to develop with monolithic optimizers, and AI/ML operators introduce numerous co-optimization opportunities such as factorization, pushdown, ML-to-SQL conversion, and linear-algebra-to-relational-algebra rewrites, significantly enlarging the search space of equivalent execution plans. At the same time, research prototypes for SQL+ML optimization are difficult to evaluate fairly because they are typically developed on different platforms and evaluated using different queries.
We present OptBench, an interactive workbench for building and benchmarking query optimizers for hybrid SQL+AI/ML queries in a transparent, apples-to-apples manner. OptBench runs all optimizers on a unified backend using DuckDB and exposes an interactive web interface that allows users to (i) construct query optimizers by leveraging and extending abstracted logical plan rewrite actions, (ii) benchmark and compare different optimizer implementations over a suite of diverse queries while recording decision traces and latency, and (iii) visualize logical plans produced by different optimizers side-by-side. The system enables practitioners and researchers to prototype optimizer ideas, inspect plan transformations, and quantitatively compare optimizer designs on multimodal inference queries within a single workbench.2026-03-09T19:45:43Z12 pages. Extended version of accepted SIGMOD 2026 demonstration paperJaykumar TandelDouglas OscarsonJia Zouhttp://arxiv.org/abs/2603.08612v1Query-Guided Analysis and Mitigation of Data Verification Errors (Extended Version)2026-03-09T16:57:45ZData verification, the process of labeling data items as correct or incorrect, is a preprocessing step that may critically affect the quality of results in data-driven pipelines. Despite recent advances, verification can still produce erroneous labels that propagate to downstream query results in complex ways. We present a framework that complements existing verification tools by assessing the impact of potential labeling errors on query outputs and guiding additional verification steps to improve result reliability. To this end, we introduce Maximal Error Score (MES), a worst-case uncertainty metric that quantifies the reliability of query output tuples independently of the underlying data distribution. As an auxiliary indicator, we identify risky tuples - input tuples for which reducing label uncertainty may counterintuitively increase the output uncertainty. We then develop efficient algorithms for computing MES and detecting risky tuples, as well as a generic algorithm, named MESReduce, that builds on both indicators and interacts with external verifiers to select effective additional verification steps. We implement our techniques in a prototype system and evaluate them on real and synthetic datasets, demonstrating that MESReduce can substantially and effectively reduce the MES and improve the accuracy of verification results.2026-03-09T16:57:45ZThis paper is the extended version of a paper accepted to the IEEE International Conference on Data Engineering (ICDE) 2026Ran SchreiberYael Amsterdamerhttp://arxiv.org/abs/2602.03278v2A Pipeline for ADNI Resting-State Functional MRI Processing and Quality Control2026-03-09T14:43:13ZThe Alzheimer's Disease Neuroimaging Initiative (ADNI) provides a comprehensive multimodal neuroimaging resource for studying aging and Alzheimer's disease (AD). Since its second wave, ADNI has increasingly collected resting-state functional MRI (rs-fMRI), a valuable resource for discovering brain connectivity changes predictive of cognitive decline and AD. A major barrier to its use is the considerable variability in acquisition protocols and data quality, compounded by missing imaging sessions and inconsistencies in how functional scans temporally align with clinical assessments. As a result, many studies only utilize a small subset of the total rs-fMRI data, limiting statistical power, reproducibility, and the ability to study longitudinal functional brain changes at scale. Here, we describe a pipeline for ADNI rs-fMRI data that encompasses the download of necessary imaging and clinical data, temporally aligning the clinical and imaging data, preprocessing, and quality control. We integrate data curation and preprocessing across all ADNI sites and scanner types using a combination of open-source software (Clinica, fMRIPrep, and MRIQC) and bespoke tools. Quality metrics and reports are generated for each subject and session to facilitate rigorous data screening. All scripts and configuration files are available to enable reproducibility. The pipeline, which currently supports ADNI-GO, ADNI-2, and ADNI-3 data releases, outputs high-quality rs-fMRI time series data adhering to the BIDS-derivatives specification. This protocol provides a transparent and scalable framework for curating and utilizing ADNI fMRI data, empowering large-scale functional biomarker discovery and integrative multimodal analyses in Alzheimer's disease research.2026-02-03T09:01:50ZSaige RutherfordZeshawn ZahidRobert C. WelshAndrea Avena-KoenigsbergerVincent KoppelmansAmanda F. Mejia