https://arxiv.org/api/Z7qoTWwgT12+2E5Hp1ukP+1L8XE 2026-03-21T05:24:19Z 11369 195 15 http://arxiv.org/abs/2603.01779v1 Disk-Resident Graph ANN Search: An Experimental Evaluation 2026-03-02T12:05:09Z

As data volumes grow while memory capacity remains limited, disk-resident graph-based approximate nearest neighbor (ANN) methods have become a practical alternative to memory-resident designs, shifting the bottleneck from computation to disk I/O. However, since their technical designs diverge widely across storage, layout, and execution paradigms, a systematic understanding of their fundamental performance trade-offs remains elusive. This paper presents a comprehensive experimental study of disk-resident graph-based ANN methods. First, we decompose such systems into five key technical components, i.e., storage strategy, disk layout, cache management, query execution, and update mechanism, and build a unified taxonomy of existing designs across these components. Second, we conduct fine-grained evaluations of representative strategies for each technical component to analyze the trade-offs in throughput, recall, and resource utilization. Third, we perform comprehensive end-to-end experiments and parameter-sensitivity analyses to evaluate overall system performance under diverse configurations. Fourth, our study reveals several non-obvious findings: (1) vector dimensionality fundamentally reshapes component effectiveness, necessitating dimension-aware design; (2) existing layout strategies exhibit surprisingly low I/O utilization (less than or equal to 15%); (3) page size critically affects feasibility and efficiency, with smaller pages preferred when layouts are carefully optimized; and (4) update strategies present clear workload-dependent trade-offs between in-place and out-of-place designs. Based on these findings, we derive practical guidelines for system design and configuration, and outline promising directions for future research.

2026-03-02T12:05:09Z Xiaoyu Chen Jinxiu Qu Yitong Song Shuhang Lu Huiling Li Minghui Jiang Wei Zhou Jianliang Xu Xuanhe Zhou Fan Wu http://arxiv.org/abs/2603.01598v1 Graph-centric Cross-model Data Integration and Analytics in a Unified Multi-model Database 2026-03-02T08:27:27Z

Graph-centric cross-model data integration and analytics (GCDIA) refer to tasks that leverage the graph model as a central paradigm to integrate relevant information across heterogeneous data models, such as relational and document, and subsequently perform complex analytics such as regression and similarity computation. As modern applications generate increasingly diverse data and move beyond simple retrieval toward advanced analytical objectives (e.g., prediction and recommendation), GCDIA has become increasingly important. Existing multi-model databases (MMDBs) struggle to efficiently support both integration (GCDI) and analytics (GCDA) in GCDIA. They typically separate graph processing from other models without global optimization for GCDI, while relying on tuple-at-a-time execution for GCDA, leading to limited performance and scalability. To address these limitations, we propose GredoDB, a unified MMDB that natively supports storing graph, relational, and document models, while efficiently processing GCDIA. Specifically, we design 1) topology- and attribute-aware graph operators for efficient predicate-aware traversal, 2) a unified GCDI optimization framework to exploit cross-model correlations, and 3) a parallel GCDA architecture that materializes intermediate results for operator-level execution. Experiments on the widely adopted multi-model benchmark M2Bench demonstrate that, in terms of response time, GredoDB achieves up to 107.89 times and an average of 10.89 times speedup on GCDI, and up to 356.72 times and an average of 37.79 times on GCDA, compared to state-of-the-art (SOTA) MMDBs.

2026-03-02T08:27:27Z Zepeng Liu Sheng Wang Shixun Huang Hailang Qiu Yuwei Peng Jiale Feng Shunan Liao Yushuai Ji Zhiyong Peng http://arxiv.org/abs/2603.01570v1 Adversarial Query Synthesis via Bayesian Optimization 2026-03-02T07:50:46Z

Benchmark workloads are extremely important to the database management research community, especially as more machine learning components are integrated into database systems. Here, we propose a Bayesian optimization technique to automatically search for difficult benchmark queries, significantly reducing the amount of manual effort usually required. In preliminary experiments, we show that our approach can generate queries with more than double the optimization headroom compared to existing benchmarks.

2026-03-02T07:50:46Z Jeffrey Tao Yimeng Zeng Haydn Thomas Jones Natalie Maus Osbert Bastani Jacob R. Gardner Ryan Marcus http://arxiv.org/abs/2510.06377v3 Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data 2026-03-02T07:22:37Z

Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) incorporates task specification via task table prompting, (ii) tokenizes cells with table/column metadata, (iii) is pretrained via masked token prediction, and (iv) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 93% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experimental analyses show that RT's zero-shot transfer leverages task context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data. Code, models, data: https://github.com/snap-stanford/relational-transformer.

2025-10-07T18:51:51Z Accepted to ICLR 2026 Rishabh Ranjan Valter Hudovernik Mark Znidar Charilaos Kanatsoulis Roshan Upendra Mahmoud Mohammadi Joe Meyer Tom Palczewski Carlos Guestrin Jure Leskovec http://arxiv.org/abs/2603.01525v1 VectorMaton: Efficient Vector Search with Pattern Constraints via an Enhanced Suffix Automaton 2026-03-02T06:56:46Z

Approximate nearest neighbor search (ANNS) has become a cornerstone in modern vector database systems. Given a query vector, ANNS retrieves the closest vectors from a set of base vectors. In real-world applications, vectors are often accompanied by additional information, such as sequences or structured attributes, motivating the need for fine-grained vector search with constraints on this auxiliary data. Existing methods support attribute-based filtering or range-based filtering on categorical and numerical attributes, but they do not support pattern predicates over sequence attributes. In relational databases, predicates such as LIKE and CONTAINS are fundamental operators for filtering records based on substring patterns. As vector databases increasingly adopt SQL-style query interfaces, enabling pattern predicates over sequence attributes (e.g., texts and biological sequences) alongside vector similarity search becomes essential. In this paper, we formulate a novel problem: given a set of vectors each associated with a sequence, retrieve the nearest vectors whose sequences contain a given query pattern. To address this challenge, we propose VectorMaton, an automaton-based index that integrates pattern filtering with efficient vector search, while maintaining an index size comparable to the dataset size. Extensive experiments on real-world datasets demonstrate that VectorMaton consistently outperforms all baselines, achieving up to 10x higher query throughput at the same accuracy and up to 18x reduction in index size.

2026-03-02T06:56:46Z Haoxuan Xie Siqiang Luo http://arxiv.org/abs/2603.01448v1 SEAnet: A Deep Learning Architecture for Data Series Similarity Search 2026-03-02T04:57:06Z

A key operation for massive data series collection analysis is similarity search. According to recent studies, SAX-based indexes offer state-of-the-art performance for similarity search tasks. However, their performance lags under high-frequency, weakly correlated, excessively noisy, or other dataset-specific properties. In this work, we propose Deep Embedding Approximation (DEA), a novel family of data series summarization techniques based on deep neural networks. Moreover, we describe SEAnet, a novel architecture especially designed for learning DEA, that introduces the Sum of Squares preservation property into the deep network design. We further enhance SEAnet with SEAtrans encoder. Finally, we propose novel sampling strategies, SEAsam and SEAsamE, that allow SEAnet to effectively train on massive datasets. Comprehensive experiments on 7 diverse synthetic and real datasets verify the advantages of DEA learned using SEAnet in providing high-quality data series summarizations and similarity search results.

2026-03-02T04:57:06Z This paper was published in IEEE Transactions on Knowledge and Data Engineering (Volume: 35, Issue: 12, Page(s): 12972 - 12986, 01 December 2023). Date of Publication: 25 April 2023 IEEE Trans. Knowl. Data Eng. 35(12): 12972-12986 (2023) Qitong Wang Themis Palpanas 10.1109/TKDE.2023.3270264 http://arxiv.org/abs/2602.19167v2 S$^3$GND: An Effective Learning-Based Approach for Subgraph Similarity Search Under Generalized Neighbor Difference Semantics (Technical Report) 2026-03-02T04:18:31Z

Subgraph similarity search over large-scale graphs is a fundamental task that retrieves subgraphs similar to a given query graph from a data graph, and it plays a crucial role in real applications such as protein discovery, social network analysis, and recommendation systems. While prior works on subgraph similarity search studied various graph similarity metrics, in this paper, we propose a novel graph similarity semantics, \textit{generalized neighbor difference} (GND), that accounts for both the keyword-set relationships between vertices and edge-weight differences. We formulate the problem of \textit{subgraph similarity search under the generalized neighbor difference semantics} (S$^3$GND), which retrieves those subgraphs similar to a query graph $q$ under GND semantics. To efficiently tackle the S$^3$GND problem, we propose an effective learning-based approach, which constructs a keyword hypergraph from the data graph, and trains a \textit{hypergraph neural network} (HGNN) model to obtain high-quality keyword embedding representations. We design effective pruning strategies, \textit{keyword embedding MBR}, \textit{vertex-Level ND lower bound}, and \textit{graph-level GND lower bound pruning}, to rule out false alarms of candidate vertices/subgraphs, and devise a tree-based indexing mechanism to facilitate efficient S$^3$GND query answering. We develop an efficient S$^3$GND query-processing algorithm that traverses the index, applies pruning strategies, and returns actual S$^3$GND answers. Finally, we conduct extensive experiments to verify the effectiveness and efficiency of our proposed S$^3$GND approach over both real and synthetic graphs.

2026-02-22T12:55:07Z Qi Wen Xiang Lian Nan Zhang Yutong Ye Mingsong Chen http://arxiv.org/abs/2507.10070v2 Breaking the Storage-Compute Bottleneck in Billion-Scale ANNS: A GPU-Driven Asynchronous I/O Framework 2026-03-02T03:08:28Z

With the advancement of information retrieval, recommendation systems, and Retrieval-Augmented Generation (RAG), Approximate Nearest Neighbor Search (ANNS) gains widespread applications due to its higher performance and accuracy. While several disk-based ANNS systems have emerged to handle exponentially growing vector datasets, they suffer from suboptimal performance due to two inherent limitations: 1) failing to overlap SSD accesses with distance computation processes and 2) extended I/O latency caused by suboptimal I/O Stack. To address these challenges, we present FlashANNS, a GPU-accelerated out-of-core graph-based ANNS system through I/O-compute overlapping. Our core insight lies in the synchronized orchestration of I/O and computation through three key innovations: 1) Dependency-Relaxed asynchronous pipeline: FlashANNS decouples I/O-computation dependencies to fully overlap between GPU distance calculations and SSD data transfers. 2) Warp-Level concurrent SSD access: FlashANNS implements a lock-free I/O stack with warp-level concurrency control, to reduce the latency-induced time overhead. 3) Computation-I/O balanced graph degree Selection: FlashANNS selects graph degrees via lightweight compute-to-I/O ratio sampling, ensuring optimal balance between computational load and storage access latency across different I/O bandwidth configurations. We implement FlashANNS and compare it with state-of-the-art out-of-core ANNS systems (SPANN, DiskANN) and a GPU-accelerated out-of-core ANNS system (FusionANNS). Experimental results demonstrate that at $\geq$95\% recall@10 accuracy, our method achieves 2.3-5.9$\times$ higher throughput compared to existing SOTA methods with a single SSD, and further attains 2.7-12.2$\times$ throughput improvement in multi-SSD configurations.

2025-07-14T08:55:51Z Yang Xiao Mo Sun Ziyu Song Bing Tian Jie Zhang Jie Sun Zeke Wang http://arxiv.org/abs/2602.01701v2 Beyond Single-Modal Analytics: A Framework for Integrating Heterogeneous LLM-Based Query Systems for Multi-Modal Data 2026-03-02T01:05:12Z

With the increasing use of multi-modal data, semantic query has become more and more demanded in data management systems, which is an important way to access and analyze multi-modal data. As unstructured data, most information of multi-modal data (text, image, video, etc.) hides in the semantics, which cannot be accessed by traditional database queries like SQL. Given the power of Large Language Models (LLMs) in understanding semantics and processing natural language, in recent years several LLM-based semantic query systems have been proposed to support semantic querying over unstructured data. However, this rapid growth has produced a fragmented ecosystem. Applications face significant integration challenges due to (1) disparate APIs of different semantic query systems and (2) a fundamental trade-off between specialization and generality. Many semantic query systems are highly specialized, offering state-of-the-art performance within a single modality but struggling with multi-modal data. Conversely, some "all-in-one" systems handle multiple modalities but often exhibit suboptimal performance compared to their specialized counterparts in specific modalities. This paper introduces Meta Engine, a novel ``query system on query systems'', designed to resolve those aforementioned challenges. Meta Engine is a unified semantic query engine that integrates heterogeneous, specialized LLM-based query systems. Its architecture comprises five key components: (1) a Natural Language (NL) Query Parser, (2) an Operator Generator, (3) a Query Router, (4) a set of Adapters, and (5) a Result Aggregator. In the evaluation, Meta Engine consistently outperforms all baselines, yielding 3--6x higher F1 in most cases and up to ~24x on specific datasets.

2026-02-02T06:16:04Z Ruyu Li Tinghui Zhang Haodi Ma Daisy Zhe Wang Yifan Wang http://arxiv.org/abs/2601.16409v2 Gen-DBA: Generative Database Agents 2026-03-02T00:49:00Z

Leveraging Machine Learning to optimize database systems, referred to as Machine Learning for Databases (ML4DB, for short), dates back to the early 1990s, spanning indexing techniques, selectivity estimation, and query optimization. However, the idea has gained mainstream traction following the introduction of learned indexes in 2018, triggering a surge of research spanning learned indexes and cardinality estimators to learned query optimizers, storage layout design, resource management, and database tuning. The current ML4DB optimization landscape is dominated by narrow specialist ML models that are small and are trained on limited training data. Each specialist ML model targets a single database learning task on a fixed database engine, hardware platform, query workload, and optimization objective. As a result, they fall short in real-world settings, where these factors can vary significantly and evolve over time. This leads to an exponential number of ML models with limited portability and generalization capability, thus limiting the utility of existing ML4DB approaches. We address this limitation with Gen-DBA, a single general-purpose foundation model for optimizing databases with agentic capabilities. This paper presents the vision for Gen-DBA, provides a sketch design of how to realize it, and highlights several research challenges that need to be addressed to fully realize Gen-DBA.

2026-01-23T02:55:42Z Yeasir Rayhan Walid G. Aref http://arxiv.org/abs/2505.20274v3 Probabilistic Kernel Function for Fast Angle Testing 2026-03-01T22:18:53Z

In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and adopts a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5x--3x higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.

2025-05-26T17:53:28Z ICLR 2026 Oral, source code available at https://github.com/KejingLu-810/KS Kejing Lu Chuan Xiao Yoshiharu Ishikawa http://arxiv.org/abs/2505.19025v2 SQUiD: Synthesizing Relational Databases from Unstructured Text 2026-03-01T22:00:50Z

Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets. Our code and datasets are publicly available at: https://github.com/Mushtari-Sadia/SQUiD.

2025-05-25T08:20:49Z Mushtari Sadia Zhenning Yang Yunming Xiao Ang Chen Amrita Roy Chowdhury http://arxiv.org/abs/2603.06660v1 Approximate Nearest Neighbor Search for Modern AI: A Projection-Augmented Graph Approach 2026-03-01T21:57:42Z

Approximate Nearest Neighbor Search (ANNS) is fundamental to modern AI applications. Most existing solutions optimize query efficiency but fail to align with the practical requirements of modern workloads. In this paper, we outline six critical demands of modern AI applications: high query efficiency, fast indexing, low memory footprint, scalability to high dimensionality, robustness across varying retrieval sizes, and support for online insertions. To satisfy all these demands, we introduce Projection-Augmented Graph (PAG), a new ANNS framework that integrates projection techniques into a graph index. PAG reduces unnecessary exact distance computations through asymmetric comparisons between exact and approximate distances as guided by projection-based statistical tests. Three key components are designed and unified to the graph index to optimize indexing and searching. Experiments on six modern datasets demonstrate that PAG consistently achieves superior query per second (QPS)-recall performance -- up to 5x faster than HNSW -- while offering fast indexing speed and moderate memory footprint. PAG remains robust as dimensionality and retrieval size increase and naturally supports online insertions.

2026-03-01T21:57:42Z Source code is available at https://github.com/KejingLu-810/PAG/ Kejing Lu Zhenpeng Pan Jianbin Qin Yoshiharu Ishikawa Chuan Xiao http://arxiv.org/abs/2603.00921v1 A Framework for Transparent Reporting of Data Quality Analysis Across the Clinical Electronic Health Record Data Lifecycle 2026-03-01T04:48:46Z

Data quality (DQ) and transparency of secondary data are critical factors that delay the adoption of clinical AI models and affect clinician trust in them. Many DQ studies fail to clarify where, along the lifecycle, quality checks occur, leading to uncertainty about provenance and fitness for reuse. This study develops a framework for transparent reporting of DQ assessments across the clinical electronic health record (EHR) data lifecycle. The reporting framework was developed through iterative analysis to identify actors and phases of the clinical data lifecycle. The framework distinguishes between data-generating organizations and data-receiving organizations to allow users to map DQ parameters to stages across the data lifecycle. The framework defines 5 key lifecycle phases and multiple actors. When applied to the real-world dataset, the framework demonstrated applicability in revealing where DQ issues may originate. The framework provides a structured approach for reporting DQ assessments, which can enhance transparency regarding data fitness for reuse, supporting reliable clinical research, AI model development, and internal organisational governance. This work provides practical guidance for researchers to understand data provenance and for organisations to target DQ improvement efforts across the data lifecycle.

2026-03-01T04:48:46Z 6 pages, 1 figure. Submitted to IoS Press, Studies in Health Technology and Informatics as conference proceedings for AIDH Health Innovation Community Conference Ethics Approval: Royal Melbourne Institute of Technology #26603 Melinda Wassell Kerryn Butler-Henderson Karin Verspoor http://arxiv.org/abs/2603.00866v1 A Tree-Structured Two-Phase Commit Framework for OceanBase: Optimizing Scalability and Consistency 2026-03-01T02:02:27Z

Modern distributed databases face challenges in achieving transactional consistency across distributed partitions. Traditional two-phase commit (2PC) protocols incur high coordination overhead and latency, and require complex recovery for dynamic partition transfers. This paper introduces a novel tree-shaped 2PC framework for OceanBase that leverages single-machine log streams to address these challenges through three innovations. First, we propose log streams as atomic participants, replacing partition-level coordination. By treating each log stream as the commit unit, a transaction spanning $N$ co-located partitions interacts with one participant, reducing coordination overhead by orders of magnitude (e.g., 99 percent reduction for $N=100$). Second, we design a tree-shaped 2PC protocol with coordinator-rooted DAG topology that dynamically handles partition transfers by recursively constructing commit trees. When a partition migrates during a transaction, the protocol embeds migration contexts as leaf nodes, eliminating explicit participant list updates, resolving circular dependencies, and ensuring linearizable commits under topology changes. Third, we introduce prepare-unknown and trans-unknown states to prevent consistency violations when participants lose context. These states signal uncertainty during retries, avoiding erroneous aborts from so-called lying participants while isolating users from ambiguity. Experimental evaluation demonstrates performance approaching that of single-machine transactions, with reduced latency and bandwidth consumption, validating the framework's effectiveness for modern distributed databases.

2026-03-01T02:02:27Z Quanqing Xu Chen Qian Chuanhui Yang Fanyu Kong Guixiang Liu Fusheng Han Zixiang Zhai