https://arxiv.org/api/wzkbw1EIJyD6XoyAJz/dIVWGJ+w 2026-03-20T20:32:07Z 11369 105 15 http://arxiv.org/abs/2603.09558v1 No Cliques Allowed: The Next Step Towards BDD/FC Conjecture 2026-03-10T12:05:39Z

This paper addresses one of the fundamental open questions in the realm of existential rules: the conjecture on the finite controllability of bounded derivation depth rule sets (bdd $\Rightarrow$ fc). We take a step toward a positive resolution of this conjecture by demonstrating that universal models generated by bdd rule sets cannot contain arbitrarily large tournaments (arbitrarily directed cliques) without entailing a loop query, $\exists{x} E(x, x)$. This simple yet elegant result narrows the space of potential counterexamples to the (bdd $\Rightarrow$ fc) conjecture.

2026-03-10T12:05:39Z Published at PODS 2025 Lucas Larroque Piotr Ostropolski-Nalewaja Michaël Thomazo http://arxiv.org/abs/2210.13722v6 Towards Selecting the Informative Alternative Relational Query Plans for Database Education 2026-03-10T09:53:28Z

Off-the-shelf RDBMS typically expose only the query execution plan (QEP) of an SQL query, without presenting information about representative alternative query plans (AQPs) considered during plan selection in a user-friendly manner. Providing easy access to representative AQPs is valuable in database education, as it helps learners understand the plan choices made by a query optimizer, one of the several important components related to relational query processing. In this paper, we present a novel problem called the informative plan selection problem (TIPS), which aims to discover a set of k informative AQPs from the underlying plan space so that the plan informativeness of the set is maximized. Specifically, we explore two variants of the problem, batch TIPS and incremental TIPS, to cater to diverse learners. Due to the computational hardness of the problem, we present an approximation algorithm to address it efficiently while providing theoretical guarantees for the results. An extensive experimental study, including feedback from real-world learners and a three-year in-class evaluation of academic outcomes, demonstrates the effectiveness of our solutions for database education.

2022-10-25T02:41:40Z 31 pages, 15 figures. Major revision and substantial extension. This version expands the earlier demo-oriented paper into a full article on TIPS, with updated title, abstract, and author list. Accepted to Proceedings of the ACM on Management of Data (SIGMOD 2026) Hu Wang Hui Li Sourav S Bhowmick Zihao Ma http://arxiv.org/abs/2603.10081v1 Categorical Calculus and Algebra for Multi-Model Data 2026-03-10T09:44:48Z

Multi-model databases are designed to store, manage, and query data in various models, such as relational, hierarchical, and graph data, simultaneously. In this paper, we provide a theoretical basis for querying categorical databases. We propose two formal query languages: categorical calculus and categorical algebra, by extending relational calculus and relational algebra respectively. We demonstrate the equivalence between these two languages of queries. We propose a series of transformation rules of categorical algebra to facilitate query optimization. Finally, we analyze the expressive power and computation complexity for the proposed query languages.

2026-03-10T09:44:48Z In Proceedings ACT 2025, arXiv:2603.07595. arXiv admin note: substantial text overlap with arXiv:2504.09515 EPTCS 442, 2026, pp. 75-90 Jiaheng Lu University of Helsinki 10.4204/EPTCS.442.6 http://arxiv.org/abs/2603.09398v1 GeoBenchr: An Application-Centric Benchmarking Suite for Spatiotemporal Database Platforms 2026-03-10T09:12:05Z

The rapid growth of spatiotemporal data volumes needs to be handled by database systems capable of efficiently managing and querying such data. Existing systems such as PostGIS, SpaceTime, and MobilityDB offer partial solutions but differ widely in scope and performance. Also, first spatiotemporal benchmarks provide valuable insights but are limited in scope and, to our knowledge, no application-centric benchmarking suite exists. In this paper, we propose GeoBenchr, an open-source, application-centric benchmarking suite for spatiotemporal platforms. GeoBenchr enables comprehensive evaluation across diverse datasets, query types, and workload patterns, reflecting realistic use cases from domains such as cycling, aviation, and maritime tracking. We use our GeoBenchr prototype to evaluate several system aspects including scalability, configuration impact, and cross-platform performance comparison. Our results highlight the importance of application-centric benchmarking in selecting suitable spatiotemporal database systems for real-world scenarios.

2026-03-10T09:12:05Z currently under review at The 27th IEEE International Conference on Mobile Data Management Tim C. Rese Nils Japke Diana Baumann Natalie Carl David Bermbach http://arxiv.org/abs/2603.09347v1 The Virtuous Cycle: AI-Powered Vector Search and Vector Search-Augmented AI 2026-03-10T08:28:58Z

Modern AI and vector search are rapidly converging, forming a promising research frontier in intelligent information systems. On one hand, advances in AI have substantially improved the semantic accuracy and efficiency of vector search, including learned indexing structures, adaptive pruning strategies, and automated parameter tuning. On the other hand, powerful vector search techniques have enabled new AI paradigms, notably Retrieval-Augmented Generation (RAG), which effectively mitigates challenges in Large Language Models (LLMs) like knowledge staleness and hallucinations. This mutual reinforcement establishes a virtuous cycle where AI injects intelligence and adaptive optimization into vector search, while vector search, in turn, expands AI's capabilities in knowledge integration and context-aware generation. This tutorial provides a comprehensive overview of recent research and advancements at this intersection. We begin by discussing the foundational background and motivations for integrating vector search and AI. Subsequently, we explore how AI empowers vector search (AI4VS) across each step of the vector search pipeline. We then investigate how vector search empowers AI (VS4AI), with a particular focus on RAG frameworks that integrate dynamic, external knowledge sources into the generative process of LLMs. Furthermore, we analyze end-to-end co-optimization strategies that fully unlock the potential of the ``virtuous cycle" between vector search and AI. Finally, we highlight key challenges and future research opportunities in this emerging area. This paper was published in ICDE 2026.

2026-03-10T08:28:58Z Jiuqi Wei Quanqing Xu Chuanhui Yang http://arxiv.org/abs/2603.08036v2 Samyama: A Unified Graph-Vector Database with In-Database Optimization, Agentic Enrichment, and Hardware Acceleration 2026-03-10T05:50:24Z

Modern data architectures are fragmented across graph databases, vector stores, analytics engines, and optimization solvers, resulting in complex ETL pipelines and synchronization overhead. We present Samyama, a high-performance graph-vector database written in Rust that unifies these workloads into a single engine. Samyama combines a RocksDB-backed persistent store with a versioned-arena MVCC model, a vectorized query executor with 35 physical operators, a cost-based query planner with plan enumeration and predicate pushdown, a dedicated CSR-based analytics engine, and native RDF/SPARQL support. The system integrates 22 metaheuristic optimization solvers directly into its query language, implements HNSW vector indexing with Graph RAG capabilities, and introduces Agentic Enrichment for autonomous graph expansion via LLMs. The Enterprise Edition adds GPU acceleration via wgpu, production-grade observability, point-in-time recovery, and hardened high availability with HTTP/2 Raft transport. Our evaluation on commodity hardware (Mac Mini M4, 16 GB RAM) demonstrates: ingestion at 255K nodes/s (CPU) and 412K nodes/s (GPU-accelerated); 115K Cypher queries/sec at 1M nodes; 4.0-4.7x latency reduction from late materialization on multi-hop traversals; 8.2x GPU PageRank speedup at 1M nodes; and 100% LDBC Graphalytics validation (28/28 tests). These results demonstrate that a unified graph-vector-optimization engine can achieve competitive performance on commodity hardware while maintaining Rust's memory safety guarantees.

2026-03-09T07:17:17Z 16 pages, 4 figures, 12 tables Madhulatha Mandarapu Sandeep Kunkunuru http://arxiv.org/abs/2603.09181v1 Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor 2026-03-10T04:35:50Z

Index tuning is critical for the performance of modern database systems. Industrial index tuners, such as the Database Tuning Advisor (DTA) developed for Microsoft SQL Server, rely on the "what-if" API provided by the query optimizer to estimate the cost of a query given an index configuration, which can lead to suboptimal recommendations when the estimations are inaccurate. Large language model (LLM) offers a new approach to index tuning, with knowledge learned from web-scale training datasets. However, the effectiveness of LLM-driven index tuning, especially beyond what is already achieved by commercial index tuners, remains unclear. In this paper, we study the practical effectiveness of LLM-driven index tuning using both industrial benchmarks and real-world enterprise customer workloads, and compare it with DTA. Our results show that although DTA is generally more reliable, with a few invocations, LLM can identify configurations that significantly outperform those found by DTA in execution time in a considerable number of cases, highlighting its potential as a complementary technique. We also observe that LLM's reasoning captures human-intuitive insights that may be distilled to potentially improve DTA. However, adopting LLM-driven index tuning in production remains challenging due to its substantial performance variance, limited and often negative impact when directly integrated into DTA, and the high cost of performance validation. This work provides motivation, lessons, and practical insights that will inspire future work on LLM-driven index tuning both in academia and industry.

2026-03-10T04:35:50Z Xiaoying Wang Wentao Wu Vivek Narasayya Surajit Chaudhuri http://arxiv.org/abs/2603.09152v1 DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering 2026-03-10T03:44:52Z

Table Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single-agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi-hop logic. This paper introduces DataFactory, a multi-agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data-to-knowledge graph transformation via the mapping function T:D x S x R -> G, and implement natural language-based consultation that - unlike fixed workflow multi-agent systems - enables flexible inter-agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen's d > 1). Team coordination also outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2). The framework offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation.

2026-03-10T03:44:52Z Published in Information Processing & Management, 2026 Information Processing & Management, 63(6):104723, 2026 Tong Wang Chi Jin Yongkang Chen Huan Deng Xiaohui Kuang Gang Zhao 10.1016/j.ipm.2026.104723 http://arxiv.org/abs/2603.09122v1 Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration 2026-03-10T02:55:37Z

Distributed key-value stores are widely adopted to support elastic big data applications, leveraging purpose-built consensus algorithms like Raft to ensure data consistency. However, through systematic analysis, we reveal a critical performance issue in such consistent stores, i.e., overlapping persistence operations between consensus protocols and underlying storage engines result in significant I/O overhead. To address this issue, we present Nezha, a prototype distributed storage system that innovatively integrates key-value separation with Raft to provide scalable throughput in a strong consistency guarantee. Nezha redesigns the persistence strategy at the operation level and incorporates leveled garbage collection, significantly improving read and write performance while preserving Raft's safety properties. Experimental results demonstrate that, on average, Nezha achieves throughput improvements of 460.2%, 12.5%, and 72.6% for put, get, and scan operations, respectively.

2026-03-10T02:55:37Z Accepted to ICDE 2026 (main research track). The main paper is 12 pages excluding references Yangyang Wang Yucong Dong Ziqian Cheng Zichen Xu http://arxiv.org/abs/2603.08037v2 CEMR: An Effective Subgraph Matching Algorithm with Redundant Extension Elimination 2026-03-10T02:34:23Z

Subgraph matching is a fundamental problem in graph analysis with a wide range of applications. However, due to its inherent NP-hardness, enumerating subgraph matches efficiently on large real-world graphs remains highly challenging. Most existing works adopt a depth-first search (DFS) backtracking strategy, where a partial embedding is gradually extended in a DFS manner along a branch of the search trees until either a full embedding is found or no further extension is possible. A major limitation of this paradigm is the significant amount of duplicate computation that occurs during enumeration, which increases the overall runtime. To overcome this limitation, we propose a novel subgraph matching algorithm, CEMR. It incorporates two techniques to reduce duplicate extensions: common extension merging, which leverages a black-white vertex encoding, and common extension reusing, which employs common extension buffers. In addition, we design two pruning techniques to discard unpromising search branches. Extensive experiments on real-world datasets and diverse query workloads demonstrate that CEMR outperforms state-of-the-art subgraph matching methods.

2026-03-09T07:20:24Z Accepted to PVLDB (VLDB 2026). This arXiv version contains the full version of the paper Linglin Yang Xunbin Su Lei Zou Xiangyang Gou Yinnian Lin http://arxiv.org/abs/2503.10036v4 Modeling Concurrency Control as a Learnable Function 2026-03-10T00:45:56Z

Concurrency control (CC) algorithms are important in modern transactional databases, as they enable high performance by executing transactions concurrently while ensuring correctness. However, state-of-the-art CC algorithms struggle to perform well across diverse workloads, and most do not consider workload drifts. In this paper, we propose NeurCC, a novel learned concurrency control algorithm that achieves high performance across diverse workloads. The algorithm is quick to optimize, making it robust against dynamic workloads. It learns a function that captures a large number of design choices from existing CC algorithms. The function is implemented as an efficient in-database lookup table that maps database states to concurrency control actions. The learning process is based on a combination of Bayesian optimization and a novel graph reduction search algorithm, which converges quickly to a function that achieves high transaction throughput. We compare NeurCC against five state-of-the-art CC algorithms and show that it consistently outperforms the baselines both in transaction throughput and in optimization time.

2025-03-13T04:34:56Z Hexiang Pan Shaofeng Cai Tien Tuan Anh Dinh Yuncheng Wu Yeow Meng Chee Gang Chen Beng Chin Ooi http://arxiv.org/abs/2603.08957v1 Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation 2026-03-09T21:43:39Z

A \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.

2026-03-09T21:43:39Z Yuxin Tang Zhiyuan Xin Zhimin Ding Xinyu Yao Daniel Bourgeois Tirthak Patel Chris Jermaine http://arxiv.org/abs/2603.08880v1 OptBench: An Interactive Workbench for AI/ML-SQL Co-Optimization[Extended Demonstration Proposal] 2026-03-09T19:45:43Z

Database workloads are increasingly nesting artificial intelligence (AI) and machine learning (ML) pipelines and AI/ML model inferences with data processing, yielding hybrid SQL+AI/ML queries that mix relational operators with expensive, opaque AI/ML operators, often expressed as UDFs. These workloads are challenging to optimize because ML operators behave like black boxes, data-dependent effects such as sparsity, selectivity, and cardinalities can dominate runtime, domain experts often rely on practical heuristics that are difficult to develop with monolithic optimizers, and AI/ML operators introduce numerous co-optimization opportunities such as factorization, pushdown, ML-to-SQL conversion, and linear-algebra-to-relational-algebra rewrites, significantly enlarging the search space of equivalent execution plans. At the same time, research prototypes for SQL+ML optimization are difficult to evaluate fairly because they are typically developed on different platforms and evaluated using different queries. We present OptBench, an interactive workbench for building and benchmarking query optimizers for hybrid SQL+AI/ML queries in a transparent, apples-to-apples manner. OptBench runs all optimizers on a unified backend using DuckDB and exposes an interactive web interface that allows users to (i) construct query optimizers by leveraging and extending abstracted logical plan rewrite actions, (ii) benchmark and compare different optimizer implementations over a suite of diverse queries while recording decision traces and latency, and (iii) visualize logical plans produced by different optimizers side-by-side. The system enables practitioners and researchers to prototype optimizer ideas, inspect plan transformations, and quantitatively compare optimizer designs on multimodal inference queries within a single workbench.

2026-03-09T19:45:43Z 12 pages. Extended version of accepted SIGMOD 2026 demonstration paper Jaykumar Tandel Douglas Oscarson Jia Zou http://arxiv.org/abs/2603.08612v1 Query-Guided Analysis and Mitigation of Data Verification Errors (Extended Version) 2026-03-09T16:57:45Z

Data verification, the process of labeling data items as correct or incorrect, is a preprocessing step that may critically affect the quality of results in data-driven pipelines. Despite recent advances, verification can still produce erroneous labels that propagate to downstream query results in complex ways. We present a framework that complements existing verification tools by assessing the impact of potential labeling errors on query outputs and guiding additional verification steps to improve result reliability. To this end, we introduce Maximal Error Score (MES), a worst-case uncertainty metric that quantifies the reliability of query output tuples independently of the underlying data distribution. As an auxiliary indicator, we identify risky tuples - input tuples for which reducing label uncertainty may counterintuitively increase the output uncertainty. We then develop efficient algorithms for computing MES and detecting risky tuples, as well as a generic algorithm, named MESReduce, that builds on both indicators and interacts with external verifiers to select effective additional verification steps. We implement our techniques in a prototype system and evaluate them on real and synthetic datasets, demonstrating that MESReduce can substantially and effectively reduce the MES and improve the accuracy of verification results.

2026-03-09T16:57:45Z This paper is the extended version of a paper accepted to the IEEE International Conference on Data Engineering (ICDE) 2026 Ran Schreiber Yael Amsterdamer http://arxiv.org/abs/2602.03278v2 A Pipeline for ADNI Resting-State Functional MRI Processing and Quality Control 2026-03-09T14:43:13Z

The Alzheimer's Disease Neuroimaging Initiative (ADNI) provides a comprehensive multimodal neuroimaging resource for studying aging and Alzheimer's disease (AD). Since its second wave, ADNI has increasingly collected resting-state functional MRI (rs-fMRI), a valuable resource for discovering brain connectivity changes predictive of cognitive decline and AD. A major barrier to its use is the considerable variability in acquisition protocols and data quality, compounded by missing imaging sessions and inconsistencies in how functional scans temporally align with clinical assessments. As a result, many studies only utilize a small subset of the total rs-fMRI data, limiting statistical power, reproducibility, and the ability to study longitudinal functional brain changes at scale. Here, we describe a pipeline for ADNI rs-fMRI data that encompasses the download of necessary imaging and clinical data, temporally aligning the clinical and imaging data, preprocessing, and quality control. We integrate data curation and preprocessing across all ADNI sites and scanner types using a combination of open-source software (Clinica, fMRIPrep, and MRIQC) and bespoke tools. Quality metrics and reports are generated for each subject and session to facilitate rigorous data screening. All scripts and configuration files are available to enable reproducibility. The pipeline, which currently supports ADNI-GO, ADNI-2, and ADNI-3 data releases, outputs high-quality rs-fMRI time series data adhering to the BIDS-derivatives specification. This protocol provides a transparent and scalable framework for curating and utilizing ADNI fMRI data, empowering large-scale functional biomarker discovery and integrative multimodal analyses in Alzheimer's disease research.

2026-02-03T09:01:50Z Saige Rutherford Zeshawn Zahid Robert C. Welsh Andrea Avena-Koenigsberger Vincent Koppelmans Amanda F. Mejia