https://arxiv.org/api/Z7qoTWwgT12+2E5Hp1ukP+1L8XE2026-03-21T05:24:19Z1136919515http://arxiv.org/abs/2603.01779v1Disk-Resident Graph ANN Search: An Experimental Evaluation2026-03-02T12:05:09ZAs data volumes grow while memory capacity remains limited, disk-resident graph-based approximate nearest neighbor (ANN) methods have become a practical alternative to memory-resident designs, shifting the bottleneck from computation to disk I/O. However, since their technical designs diverge widely across storage, layout, and execution paradigms, a systematic understanding of their fundamental performance trade-offs remains elusive. This paper presents a comprehensive experimental study of disk-resident graph-based ANN methods. First, we decompose such systems into five key technical components, i.e., storage strategy, disk layout, cache management, query execution, and update mechanism, and build a unified taxonomy of existing designs across these components. Second, we conduct fine-grained evaluations of representative strategies for each technical component to analyze the trade-offs in throughput, recall, and resource utilization. Third, we perform comprehensive end-to-end experiments and parameter-sensitivity analyses to evaluate overall system performance under diverse configurations. Fourth, our study reveals several non-obvious findings: (1) vector dimensionality fundamentally reshapes component effectiveness, necessitating dimension-aware design; (2) existing layout strategies exhibit surprisingly low I/O utilization (less than or equal to 15%); (3) page size critically affects feasibility and efficiency, with smaller pages preferred when layouts are carefully optimized; and (4) update strategies present clear workload-dependent trade-offs between in-place and out-of-place designs. Based on these findings, we derive practical guidelines for system design and configuration, and outline promising directions for future research.2026-03-02T12:05:09ZXiaoyu ChenJinxiu QuYitong SongShuhang LuHuiling LiMinghui JiangWei ZhouJianliang XuXuanhe ZhouFan Wuhttp://arxiv.org/abs/2603.01598v1Graph-centric Cross-model Data Integration and Analytics in a Unified Multi-model Database2026-03-02T08:27:27ZGraph-centric cross-model data integration and analytics (GCDIA) refer to tasks that leverage the graph model as a central paradigm to integrate relevant information across heterogeneous data models, such as relational and document, and subsequently perform complex analytics such as regression and similarity computation. As modern applications generate increasingly diverse data and move beyond simple retrieval toward advanced analytical objectives (e.g., prediction and recommendation), GCDIA has become increasingly important. Existing multi-model databases (MMDBs) struggle to efficiently support both integration (GCDI) and analytics (GCDA) in GCDIA. They typically separate graph processing from other models without global optimization for GCDI, while relying on tuple-at-a-time execution for GCDA, leading to limited performance and scalability. To address these limitations, we propose GredoDB, a unified MMDB that natively supports storing graph, relational, and document models, while efficiently processing GCDIA. Specifically, we design 1) topology- and attribute-aware graph operators for efficient predicate-aware traversal, 2) a unified GCDI optimization framework to exploit cross-model correlations, and 3) a parallel GCDA architecture that materializes intermediate results for operator-level execution. Experiments on the widely adopted multi-model benchmark M2Bench demonstrate that, in terms of response time, GredoDB achieves up to 107.89 times and an average of 10.89 times speedup on GCDI, and up to 356.72 times and an average of 37.79 times on GCDA, compared to state-of-the-art (SOTA) MMDBs.2026-03-02T08:27:27ZZepeng LiuSheng WangShixun HuangHailang QiuYuwei PengJiale FengShunan LiaoYushuai JiZhiyong Penghttp://arxiv.org/abs/2603.01570v1Adversarial Query Synthesis via Bayesian Optimization2026-03-02T07:50:46ZBenchmark workloads are extremely important to the database management research community, especially as more machine learning components are integrated into database systems. Here, we propose a Bayesian optimization technique to automatically search for difficult benchmark queries, significantly reducing the amount of manual effort usually required. In preliminary experiments, we show that our approach can generate queries with more than double the optimization headroom compared to existing benchmarks.2026-03-02T07:50:46ZJeffrey TaoYimeng ZengHaydn Thomas JonesNatalie MausOsbert BastaniJacob R. GardnerRyan Marcushttp://arxiv.org/abs/2510.06377v3Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data2026-03-02T07:22:37ZPretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) incorporates task specification via task table prompting, (ii) tokenizes cells with table/column metadata, (iii) is pretrained via masked token prediction, and (iv) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 93% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experimental analyses show that RT's zero-shot transfer leverages task context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data. Code, models, data: https://github.com/snap-stanford/relational-transformer.2025-10-07T18:51:51ZAccepted to ICLR 2026Rishabh RanjanValter HudovernikMark ZnidarCharilaos KanatsoulisRoshan UpendraMahmoud MohammadiJoe MeyerTom PalczewskiCarlos GuestrinJure Leskovechttp://arxiv.org/abs/2603.01525v1VectorMaton: Efficient Vector Search with Pattern Constraints via an Enhanced Suffix Automaton2026-03-02T06:56:46ZApproximate nearest neighbor search (ANNS) has become a cornerstone in modern vector database systems. Given a query vector, ANNS retrieves the closest vectors from a set of base vectors. In real-world applications, vectors are often accompanied by additional information, such as sequences or structured attributes, motivating the need for fine-grained vector search with constraints on this auxiliary data. Existing methods support attribute-based filtering or range-based filtering on categorical and numerical attributes, but they do not support pattern predicates over sequence attributes. In relational databases, predicates such as LIKE and CONTAINS are fundamental operators for filtering records based on substring patterns. As vector databases increasingly adopt SQL-style query interfaces, enabling pattern predicates over sequence attributes (e.g., texts and biological sequences) alongside vector similarity search becomes essential. In this paper, we formulate a novel problem: given a set of vectors each associated with a sequence, retrieve the nearest vectors whose sequences contain a given query pattern. To address this challenge, we propose VectorMaton, an automaton-based index that integrates pattern filtering with efficient vector search, while maintaining an index size comparable to the dataset size. Extensive experiments on real-world datasets demonstrate that VectorMaton consistently outperforms all baselines, achieving up to 10x higher query throughput at the same accuracy and up to 18x reduction in index size.2026-03-02T06:56:46ZHaoxuan XieSiqiang Luohttp://arxiv.org/abs/2603.01448v1SEAnet: A Deep Learning Architecture for Data Series Similarity Search2026-03-02T04:57:06ZA key operation for massive data series collection analysis is similarity search. According to recent studies, SAX-based indexes offer state-of-the-art performance for similarity search tasks. However, their performance lags under high-frequency, weakly correlated, excessively noisy, or other dataset-specific properties. In this work, we propose Deep Embedding Approximation (DEA), a novel family of data series summarization techniques based on deep neural networks. Moreover, we describe SEAnet, a novel architecture especially designed for learning DEA, that introduces the Sum of Squares preservation property into the deep network design. We further enhance SEAnet with SEAtrans encoder. Finally, we propose novel sampling strategies, SEAsam and SEAsamE, that allow SEAnet to effectively train on massive datasets. Comprehensive experiments on 7 diverse synthetic and real datasets verify the advantages of DEA learned using SEAnet in providing high-quality data series summarizations and similarity search results.2026-03-02T04:57:06ZThis paper was published in IEEE Transactions on Knowledge and Data Engineering (Volume: 35, Issue: 12, Page(s): 12972 - 12986, 01 December 2023). Date of Publication: 25 April 2023IEEE Trans. Knowl. Data Eng. 35(12): 12972-12986 (2023)Qitong WangThemis Palpanas10.1109/TKDE.2023.3270264http://arxiv.org/abs/2602.19167v2S$^3$GND: An Effective Learning-Based Approach for Subgraph Similarity Search Under Generalized Neighbor Difference Semantics (Technical Report)2026-03-02T04:18:31ZSubgraph similarity search over large-scale graphs is a fundamental task that retrieves subgraphs similar to a given query graph from a data graph, and it plays a crucial role in real applications such as protein discovery, social network analysis, and recommendation systems. While prior works on subgraph similarity search studied various graph similarity metrics, in this paper, we propose a novel graph similarity semantics, \textit{generalized neighbor difference} (GND), that accounts for both the keyword-set relationships between vertices and edge-weight differences. We formulate the problem of \textit{subgraph similarity search under the generalized neighbor difference semantics} (S$^3$GND), which retrieves those subgraphs similar to a query graph $q$ under GND semantics. To efficiently tackle the S$^3$GND problem, we propose an effective learning-based approach, which constructs a keyword hypergraph from the data graph, and trains a \textit{hypergraph neural network} (HGNN) model to obtain high-quality keyword embedding representations. We design effective pruning strategies, \textit{keyword embedding MBR}, \textit{vertex-Level ND lower bound}, and \textit{graph-level GND lower bound pruning}, to rule out false alarms of candidate vertices/subgraphs, and devise a tree-based indexing mechanism to facilitate efficient S$^3$GND query answering. We develop an efficient S$^3$GND query-processing algorithm that traverses the index, applies pruning strategies, and returns actual S$^3$GND answers. Finally, we conduct extensive experiments to verify the effectiveness and efficiency of our proposed S$^3$GND approach over both real and synthetic graphs.2026-02-22T12:55:07ZQi WenXiang LianNan ZhangYutong YeMingsong Chenhttp://arxiv.org/abs/2507.10070v2Breaking the Storage-Compute Bottleneck in Billion-Scale ANNS: A GPU-Driven Asynchronous I/O Framework2026-03-02T03:08:28ZWith the advancement of information retrieval, recommendation systems, and Retrieval-Augmented Generation (RAG), Approximate Nearest Neighbor Search (ANNS) gains widespread applications due to its higher performance and accuracy. While several disk-based ANNS systems have emerged to handle exponentially growing vector datasets, they suffer from suboptimal performance due to two inherent limitations: 1) failing to overlap SSD accesses with distance computation processes and 2) extended I/O latency caused by suboptimal I/O Stack. To address these challenges, we present FlashANNS, a GPU-accelerated out-of-core graph-based ANNS system through I/O-compute overlapping. Our core insight lies in the synchronized orchestration of I/O and computation through three key innovations: 1) Dependency-Relaxed asynchronous pipeline: FlashANNS decouples I/O-computation dependencies to fully overlap between GPU distance calculations and SSD data transfers. 2) Warp-Level concurrent SSD access: FlashANNS implements a lock-free I/O stack with warp-level concurrency control, to reduce the latency-induced time overhead. 3) Computation-I/O balanced graph degree Selection: FlashANNS selects graph degrees via lightweight compute-to-I/O ratio sampling, ensuring optimal balance between computational load and storage access latency across different I/O bandwidth configurations. We implement FlashANNS and compare it with state-of-the-art out-of-core ANNS systems (SPANN, DiskANN) and a GPU-accelerated out-of-core ANNS system (FusionANNS). Experimental results demonstrate that at $\geq$95\% recall@10 accuracy, our method achieves 2.3-5.9$\times$ higher throughput compared to existing SOTA methods with a single SSD, and further attains 2.7-12.2$\times$ throughput improvement in multi-SSD configurations.2025-07-14T08:55:51ZYang XiaoMo SunZiyu SongBing TianJie ZhangJie SunZeke Wanghttp://arxiv.org/abs/2602.01701v2Beyond Single-Modal Analytics: A Framework for Integrating Heterogeneous LLM-Based Query Systems for Multi-Modal Data2026-03-02T01:05:12ZWith the increasing use of multi-modal data, semantic query has become more and more demanded in data management systems, which is an important way to access and analyze multi-modal data. As unstructured data, most information of multi-modal data (text, image, video, etc.) hides in the semantics, which cannot be accessed by traditional database queries like SQL. Given the power of Large Language Models (LLMs) in understanding semantics and processing natural language, in recent years several LLM-based semantic query systems have been proposed to support semantic querying over unstructured data. However, this rapid growth has produced a fragmented ecosystem. Applications face significant integration challenges due to (1) disparate APIs of different semantic query systems and (2) a fundamental trade-off between specialization and generality. Many semantic query systems are highly specialized, offering state-of-the-art performance within a single modality but struggling with multi-modal data. Conversely, some "all-in-one" systems handle multiple modalities but often exhibit suboptimal performance compared to their specialized counterparts in specific modalities. This paper introduces Meta Engine, a novel ``query system on query systems'', designed to resolve those aforementioned challenges. Meta Engine is a unified semantic query engine that integrates heterogeneous, specialized LLM-based query systems. Its architecture comprises five key components: (1) a Natural Language (NL) Query Parser, (2) an Operator Generator, (3) a Query Router, (4) a set of Adapters, and (5) a Result Aggregator. In the evaluation, Meta Engine consistently outperforms all baselines, yielding 3--6x higher F1 in most cases and up to ~24x on specific datasets.2026-02-02T06:16:04ZRuyu LiTinghui ZhangHaodi MaDaisy Zhe WangYifan Wanghttp://arxiv.org/abs/2601.16409v2Gen-DBA: Generative Database Agents2026-03-02T00:49:00ZLeveraging Machine Learning to optimize database systems, referred to as Machine Learning for Databases (ML4DB, for short), dates back to the early 1990s, spanning indexing techniques, selectivity estimation, and query optimization. However, the idea has gained mainstream traction following the introduction of learned indexes in 2018, triggering a surge of research spanning learned indexes and cardinality estimators to learned query optimizers, storage layout design, resource management, and database tuning. The current ML4DB optimization landscape is dominated by narrow specialist ML models that are small and are trained on limited training data. Each specialist ML model targets a single database learning task on a fixed database engine, hardware platform, query workload, and optimization objective. As a result, they fall short in real-world settings, where these factors can vary significantly and evolve over time. This leads to an exponential number of ML models with limited portability and generalization capability, thus limiting the utility of existing ML4DB approaches. We address this limitation with Gen-DBA, a single general-purpose foundation model for optimizing databases with agentic capabilities. This paper presents the vision for Gen-DBA, provides a sketch design of how to realize it, and highlights several research challenges that need to be addressed to fully realize Gen-DBA.2026-01-23T02:55:42ZYeasir RayhanWalid G. Arefhttp://arxiv.org/abs/2505.20274v3Probabilistic Kernel Function for Fast Angle Testing2026-03-01T22:18:53ZIn this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and adopts a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5x--3x higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.2025-05-26T17:53:28ZICLR 2026 Oral, source code available at https://github.com/KejingLu-810/KSKejing LuChuan XiaoYoshiharu Ishikawahttp://arxiv.org/abs/2505.19025v2SQUiD: Synthesizing Relational Databases from Unstructured Text2026-03-01T22:00:50ZRelational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets. Our code and datasets are publicly available at: https://github.com/Mushtari-Sadia/SQUiD.2025-05-25T08:20:49ZMushtari SadiaZhenning YangYunming XiaoAng ChenAmrita Roy Chowdhuryhttp://arxiv.org/abs/2603.06660v1Approximate Nearest Neighbor Search for Modern AI: A Projection-Augmented Graph Approach2026-03-01T21:57:42ZApproximate Nearest Neighbor Search (ANNS) is fundamental to modern AI applications. Most existing solutions optimize query efficiency but fail to align with the practical requirements of modern workloads. In this paper, we outline six critical demands of modern AI applications: high query efficiency, fast indexing, low memory footprint, scalability to high dimensionality, robustness across varying retrieval sizes, and support for online insertions. To satisfy all these demands, we introduce Projection-Augmented Graph (PAG), a new ANNS framework that integrates projection techniques into a graph index. PAG reduces unnecessary exact distance computations through asymmetric comparisons between exact and approximate distances as guided by projection-based statistical tests. Three key components are designed and unified to the graph index to optimize indexing and searching. Experiments on six modern datasets demonstrate that PAG consistently achieves superior query per second (QPS)-recall performance -- up to 5x faster than HNSW -- while offering fast indexing speed and moderate memory footprint. PAG remains robust as dimensionality and retrieval size increase and naturally supports online insertions.2026-03-01T21:57:42ZSource code is available at https://github.com/KejingLu-810/PAG/Kejing LuZhenpeng PanJianbin QinYoshiharu IshikawaChuan Xiaohttp://arxiv.org/abs/2603.00921v1A Framework for Transparent Reporting of Data Quality Analysis Across the Clinical Electronic Health Record Data Lifecycle2026-03-01T04:48:46ZData quality (DQ) and transparency of secondary data are critical factors that delay the adoption of clinical AI models and affect clinician trust in them. Many DQ studies fail to clarify where, along the lifecycle, quality checks occur, leading to uncertainty about provenance and fitness for reuse. This study develops a framework for transparent reporting of DQ assessments across the clinical electronic health record (EHR) data lifecycle. The reporting framework was developed through iterative analysis to identify actors and phases of the clinical data lifecycle. The framework distinguishes between data-generating organizations and data-receiving organizations to allow users to map DQ parameters to stages across the data lifecycle. The framework defines 5 key lifecycle phases and multiple actors. When applied to the real-world dataset, the framework demonstrated applicability in revealing where DQ issues may originate. The framework provides a structured approach for reporting DQ assessments, which can enhance transparency regarding data fitness for reuse, supporting reliable clinical research, AI model development, and internal organisational governance. This work provides practical guidance for researchers to understand data provenance and for organisations to target DQ improvement efforts across the data lifecycle.2026-03-01T04:48:46Z6 pages, 1 figure. Submitted to IoS Press, Studies in Health Technology and Informatics as conference proceedings for AIDH Health Innovation Community Conference Ethics Approval: Royal Melbourne Institute of Technology #26603Melinda WassellKerryn Butler-HendersonKarin Verspoorhttp://arxiv.org/abs/2603.00866v1A Tree-Structured Two-Phase Commit Framework for OceanBase: Optimizing Scalability and Consistency2026-03-01T02:02:27ZModern distributed databases face challenges in achieving transactional consistency across distributed partitions. Traditional two-phase commit (2PC) protocols incur high coordination overhead and latency, and require complex recovery for dynamic partition transfers. This paper introduces a novel tree-shaped 2PC framework for OceanBase that leverages single-machine log streams to address these challenges through three innovations. First, we propose log streams as atomic participants, replacing partition-level coordination. By treating each log stream as the commit unit, a transaction spanning $N$ co-located partitions interacts with one participant, reducing coordination overhead by orders of magnitude (e.g., 99 percent reduction for $N=100$). Second, we design a tree-shaped 2PC protocol with coordinator-rooted DAG topology that dynamically handles partition transfers by recursively constructing commit trees. When a partition migrates during a transaction, the protocol embeds migration contexts as leaf nodes, eliminating explicit participant list updates, resolving circular dependencies, and ensuring linearizable commits under topology changes. Third, we introduce prepare-unknown and trans-unknown states to prevent consistency violations when participants lose context. These states signal uncertainty during retries, avoiding erroneous aborts from so-called lying participants while isolating users from ambiguity. Experimental evaluation demonstrates performance approaching that of single-machine transactions, with reduced latency and bandwidth consumption, validating the framework's effectiveness for modern distributed databases.2026-03-01T02:02:27ZQuanqing XuChen QianChuanhui YangFanyu KongGuixiang LiuFusheng HanZixiang Zhai