https://arxiv.org/api/Ga+MCzJObV2xL2Si8uD84YHm+LM 2026-03-21T06:50:33Z 11369 210 15 http://arxiv.org/abs/2603.00537v1 Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions 2026-02-28T08:21:44Z Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes. 2026-02-28T08:21:44Z SIGMOD 2026 Atsuki Sato Martin Aumüller Yusuke Matsui http://arxiv.org/abs/2509.03226v2 BAMG: A Block-Aware Monotonic Graph Index for Disk-Based Approximate Nearest Neighbor Search 2026-02-28T07:38:21Z Approximate Nearest Neighbor Search (ANNS) over high-dimensional vectors is a foundational problem in databases, where disk I/O often emerges as the dominant performance bottleneck at scale. To accelerate search, graph-based indexes rely on proximity graph, where nodes represent vectors and edges guide the traversal toward the target. However, existing graph indexing solutions for disk-based ANNS typically either optimize the storage layout for a given graph or construct the graph independently of the storage layout, thus overlooking their interaction. In this paper, we bridge this gap by proposing the Block-aware Monotonic Relative Neighborhood Graph (BMRNG), theoretically guaranteeing the existence of I/O monotonic search paths. The core idea is to align the graph topology with the data placement by jointly considering both geometric distance and storage layout for edge selection. To address the scalability challenge of BMRNG construction, we further develop a practical and efficient variant, the Block-Aware Monotonic Graph (BAMG), which can be constructed in linear time from a monotonic graph considering the storage layout. BAMG integrates block-aware edge pruning with a decoupled storage design that separates raw vectors from the graph index, thereby maximizing block utilization and minimizing redundant disk reads. Additionally, we design a multi-layer navigation graph for adaptive and efficient query entry, along with a block-first search algorithm that prioritizes intra-block traversal to fully exploit each disk I/O operation. Extensive experiments on real-world datasets show that BAMG can outperform state-of-the-art methods in search performance. 2025-09-03T11:33:31Z Huiling Li Xin Huang Byron Choi Jianliang Xu http://arxiv.org/abs/2603.00509v1 COLE$^+$: Towards Practical Column-based Learned Storage for Blockchain Systems 2026-02-28T07:13:00Z Blockchain provides a decentralized and tamper-resistant ledger for securely recording transactions across a network of untrusted nodes. While its transparency and integrity are beneficial, the substantial storage requirements for maintaining a complete transaction history present significant challenges. For example, Ethereum nodes require around 23TB of storage, with an annual growth rate of 4TB. Prior studies have employed various strategies to mitigate the storage challenges. Notably, COLE significantly reduces storage size and improves throughput by adopting a column-based design that incorporates a learned index, effectively eliminating data duplication in the storage layer. However, this approach has limitations in supporting chain reorganization during blockchain forks and state pruning to minimize storage overhead. In this paper, we propose COLE$^+$, an enhanced storage solution designed to address these limitations. COLE$^+$ incorporates a novel rewind-supported in-memory tree structure for handling chain reorganization, leveraging content-defined chunking (CDC) to maintain a consistent hash digest for each block. For on-disk storage, a new two-level Merkle Hash Tree (MHT) structure, called prunable version tree, is developed to facilitate efficient state pruning. Both theoretical and empirical analyses show the effectiveness of COLE$^+$ and its potential for practical application in real-world blockchain systems. 2026-02-28T07:13:00Z Ce Zhang Cheng Xu Haibo Hu Jianliang Xu http://arxiv.org/abs/2603.00448v1 Semijoins of Annotated Relations 2026-02-28T04:05:43Z The semijoin operation is a fundamental operation of relational algebra that has been extensively used in query processing. Furthermore, semijoins have been used to formulate desirable properties of acyclic schemas; in particular, a schema is acyclic if and only if it has a full reducer, i.e., a sequence of semijoins that converts a given collection of relations to a globally consistent collection of relations. In recent years, the study of acyclicity has been extended to annotated relations, where the annotations are values from some positive commutative monoid. So far, however, it has not been known if the characterization of acyclicity in terms of full reducers extends to annotated relations. Here, we develop a theory of semijoins of annotated relations. To this effect, we first introduce the notion of a semijoin function on a monoid and then characterize the positive commutative monoids for which a semijoin function exists. After this, we introduce the notion of a full reducer for a schema on a monoid and show that the following is true for every positive commutative monoid that has the inner consistency property: a schema is acyclic if and only if it has a full reducer on that monoid. 2026-02-28T04:05:43Z 21 pages Phokion G. Kolaitis http://arxiv.org/abs/2504.21291v2 Efficiency of Analysis of Transitive Relations using Query-Driven, Ground-and-Solve, and Fact-Driven Inference 2026-02-28T03:06:52Z Logic rules allow analysis of complex relationships to be expressed easily, especially for transitive relations in critical applications. However, understanding and predicting the efficiency of different inference methods remain challenging, even for simplest rules given different kinds of input data. This paper analyzes the efficiency of all three types of well-known inference methods -- query-driven, ground-and-solve, and fact-driven -- along with their respective optimizations, and compares with optimal complexities for the first time, for analyzing transitive graph relations. We also experiment with rule systems widely considered to have the best performance. We analyze all well-known rule variants and widely varying input graphs. The results include precisely calculated optimal time complexities; comparative analysis across different inference methods, rule variants, and graph types; confirmation with performance experiments; as well as discovery of a performance bug. 2025-04-30T03:55:48Z Yanhong A. Liu John Idogun Scott D. Stoller Yi Tong http://arxiv.org/abs/2602.24271v1 NSHEDB: Noise-Sensitive Homomorphic Encrypted Database Query Engine 2026-02-27T18:41:10Z Homomorphic encryption (HE) enables computations directly on encrypted data, offering strong cryptographic guarantees for secure and privacy-preserving data storage and query execution. However, despite its theoretical power, practical adoption of HE in database systems remains limited due to extreme cipher-text expansion, memory overhead, and the computational cost of bootstrapping, which resets noise levels for correctness. This paper presents NSHEDB, a secure query processing engine designed to address these challenges at the system architecture level. NSHEDB uses word-level leveled HE (LHE) based on the BFV scheme to minimize ciphertext expansion and avoid costly bootstrapping. It introduces novel techniques for executing equality, range, and aggregation operations using purely homomorphic computation, without transciphering between different HE schemes (e.g., CKKS/BFV/TFHE) or relying on trusted hardware. Additionally, it incorporates a noise-aware query planner to extend computation depth while preserving security guarantees. We implement and evaluate NSHEDB on real-world database workloads (TPC-H) and show that it achieves 20x-V1370x speedup and a 73x storage reduction compared to state-of-the-art HE-based systems, while upholding 128-bit security in a semi-honest model with no key release or trusted components. 2026-02-27T18:41:10Z Boram Jung Yuliang Li Hung-Wei Tseng http://arxiv.org/abs/2601.06940v2 VISTA: Knowledge-Driven Vessel Trajectory Imputation with Repair Provenance 2026-02-27T14:17:03Z Repairing incomplete trajectory data is essential for downstream spatio-temporal applications. Yet, existing repair methods focus solely on reconstruction without documenting the reasoning behind repair decisions, undermining trust in safety-critical applications where repaired trajectories affect operational decisions, such as in maritime anomaly detection and route planning. We introduce repair provenance - structured, queryable metadata that documents the full reasoning chain behind each repair - which transforms imputation from pure data recovery into a task that supports downstream decision-making. We propose VISTA (knowledge-driven interpretable vessel trajectory imputation), a framework that reliably equips repaired trajectories with repair provenance by grounding LLM reasoning in data-verified knowledge. Specifically, we formalize Structured Data-derived Knowledge (SDK), a knowledge model whose data-verifiable components can be validated against real data and used to anchor and constrain LLM-generated explanations. We organize SDK in a Structured Data-derived Knowledge Graph (SD-KG) and establish a data-knowledge-data loop for extraction, validation, and incremental maintenance over large-scale AIS data. A workflow management layer with parallel scheduling, fault tolerance, and redundancy control ensures consistent and efficient end-to-end processing. Experiments on two large-scale AIS datasets show that VISTA achieves state-of-the-art accuracy, improving over baselines by 5-91% and reducing inference time by 51-93%, while producing repair provenance, whose interpretability is further validated through a case study and an interactive demo system. 2026-01-11T15:02:28Z 24 pages, 14 figures, 4 algorithms, 8 tables. Code available at https://github.com/hyLiu1994/VISTA Hengyu Liu Tianyi Li Haoyu Wang Kristian Torp Tiancheng Zhang Yushuai Li Christian S. Jensen http://arxiv.org/abs/2602.23999v1 GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search 2026-02-27T13:23:30Z Approximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking. 2026-02-27T13:23:30Z Jifan Shi Jianyang Gao James Xia Tamás Béla Fehér Cheng Long http://arxiv.org/abs/2602.15909v3 Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis 2026-02-27T12:24:11Z Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a flow matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for this work, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent. 2026-02-16T14:48:24Z 24 pages, 3 figures. Published as a conference paper at ICLR 2026 The Fourteenth International Conference on Learning Representations (ICLR 2026) Pengfei Zhang Tianxin Xie Minghao Yang Li Liu http://arxiv.org/abs/2603.05529v1 Towards Neural Graph Data Management 2026-02-27T08:59:25Z While AI systems have made remarkable progress in processing unstructured text, structured data such as graphs stored in databases, continues to grow rapidly yet remains difficult for neural models to effectively utilize. We introduce NGDBench, a unified benchmark for evaluating neural graph database capabilities across five diverse domains, including finance, medicine, and AI agent tooling. Unlike prior benchmarks limited to elementary logical operations, NGDBench supports the full Cypher query language, enabling complex pattern matching, variable-length paths, and numerical aggregations, while incorporating realistic noise injection and dynamic data management operations. Our evaluation of state-of-the-art LLMs and RAG methods reveals significant limitations in structured reasoning, noise robustness, and analytical precision, establishing NGDBench as a critical testbed for advancing neural graph data management. Our code and data are available at https://github.com/HKUST-KnowComp/NGDBench. 2026-02-27T08:59:25Z https://github.com/HKUST-KnowComp/NGDBench Yufei Li Yisen Gao Jiaxin Bai Jiaxuan Xiong Haoyu Huang Zhongwei Xie Hong Ting Tsang Yangqiu Song http://arxiv.org/abs/2603.02253v1 Cross-Layer Decision Timing Orchestration in Cost-Based Database Systems: Resolving Structural Temporal Misalignment 2026-02-27T08:12:13Z This paper analyzes execution instability in traditional cost-based database management systems (DBMS) and identifies a structural timing misalignment between optimization and execution stages that contributes to tail-latency amplification. Beyond estimation accuracy and raw execution throughput, we argue that decision timing and the availability of runtime signals materially affect robustness under uncertainty. In conventional DBMS architectures, the optimizer relies on historical statistics, the executor observes runtime data distributions and resource states, and accelerators impose up-front transfer costs and amortization constraints. This temporal asynchrony can lead to rigid early-bound decisions that fail under input-scale shifts or stale statistics. We propose a cross-layer decision timing orchestration framework that shifts final decision authority from the compile-time optimizer to the runtime executor via selective late binding of operator-level choices. A Unified Risk Signal (URS) integrates optimizer uncertainty, execution-time observations, and accelerator cost signals without collapsing them into a single static cost model. Experiments on a modified PostgreSQL prototype evaluate (i) input-scale shift, (ii) stale-statistics drift, and (iii) GPU offload break-even regimes using controlled microbenchmarks. The proposed orchestration improves execution stability, reducing P99 latency by up to 20x under severe estimation drift while maintaining comparable median latency. 2026-02-27T08:12:13Z 10 pages, 7 figures. Experimental evaluation on a modified PostgreSQL prototype Ilsun Chang http://arxiv.org/abs/2602.23061v2 MoDora: Tree-Based Semi-Structured Document Analysis System 2026-02-27T08:09:20Z Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora. 2026-02-26T14:48:49Z Extension of our SIGMOD 2026 paper. Please refer to source code available at https://github.com/weAIDB/MoDora Bangrui Xu Qihang Yao Zirui Tang Xuanhe Zhou Yeye He Shihan Yu Qianqian Xu Bin Wang Guoliang Li Conghui He Fan Wu http://arxiv.org/abs/2602.23571v1 OceanBase Bacchus: a High-Performance and Scalable Cloud-Native Shared Storage Architecture for Multi-Cloud 2026-02-27T00:46:24Z Although an increasing number of databases now embrace shared-storage architectures, current storage-disaggregated systems have yet to strike an optimal balance between cost and performance. In high-concurrency read/write scenarios, B+-tree-based shared storage struggles to efficiently absorb frequent in-place updates. Existing LSM-tree-backed disaggregated storage designs are hindered by the intricate implementation of cross-node shared-log mechanisms, where no satisfactory solution yet exists. This paper presents OceanBase Bacchus, an LSM-tree architecture tailored for object storage provided by cloud vendors. The system sustains high-performance reads and writes while rendering compute nodes stateless through shared service-oriented PALF (Paxos-backed Append-only Log File system) logging and asynchronous background services. We employ a Shared Block Cache Service to flexibly utilize cache resources. Our design places log synchronization into a shared service, providing a novel solution for log sharing in storage-compute-separated databases. The architecture decouples functionality across modules, enabling elastic scaling where compute, cache, and storage resources can be resized rapidly and independently. Through experimental evaluation using multiple benchmark tests, including SysBench and TPC-H, we confirm that OceanBase Bacchus achieves performance comparable to or superior to that of HBase in OLTP scenarios and significantly outperforms StarRocks in OLAP workloads. Leveraging Bacchus's support for multi-cloud deployment and consistent performance, we not only retain high availability and competitive performance but also achieve substantial reductions in storage costs by 59% in OLTP scenarios and 89% in OLAP scenarios. 2026-02-27T00:46:24Z Quanqing Xu Mingqiang Zhuang Chuanhui Yang Quanwei Wan Fusheng Han Fanyu Kong Hao Liu Hu Xu Junyu Ye http://arxiv.org/abs/2602.23469v1 CACTUSDB: Unlock Co-Optimization Opportunities for SQL and AI/ML Inferences 2026-02-26T19:58:54Z There is a growing demand for supporting inference queries that combine Structured Query Language (SQL) and Artificial Intelligence / Machine Learning (AI/ML) model inferences in database systems, to avoid data denormalization and transfer, facilitate management, and alleviate privacy concerns. Co-optimization techniques for executing inference queries in database systems without accuracy loss fall into four categories: (O1) Relational algebra optimization treating AI/ML models as black-box user-defined functions (UDFs); (O2) Factorized AI/ML inferences; (O3) Tensor-relational transformation; and (O4) General cross-optimization techniques. However, we found none of the existing database systems support all these techniques simultaneously, resulting in suboptimal performance. In this work, we identify two key challenges to address the above problem: (1) the difficulty of unifying all co-optimization techniques that involve disparate data and computation abstractions in one system; and (2) the lack of an optimizer that can effectively explore the exponential search space. To address these challenges, we present CactusDB, a novel system built atop Velox - a high-performance, UDF-centric database engine, open-sourced by Meta. CactusDB features a three-level Intermediate Representations (IR) that supports relational operators, expression operators, and ML functions to enable flexible optimization of arbitrary sub-computations. Additionally, we propose a novel Monte-Carlo Tree Search (MCTS)-based optimizer with query embedding, co-designed with our unique three-level IR, enabling shared and reusable optimization knowledge across different queries. Evaluation of 12 representative inference workloads and 2,000 randomly generated inference queries on well-known datasets, such as MovieLens and TPCx-AI, shows that CactusDB achieves up to 441 times speedup compared to alternative systems. 2026-02-26T19:58:54Z Accepted to ICDE 2026 as a full research paper Lixi Zhou Kanchan Chowdhury Lulu Xie Jaykumar Tandel Hong Guan Zhiwei Fan Xinwei Fu Jia Zou http://arxiv.org/abs/2602.23342v1 AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search 2026-02-26T18:48:29Z On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems. 2026-02-26T18:48:29Z The paper has been accepted by SIGMOD 2026 Weijian Chen Haotian Liu Yangshen Deng Long Xiang Liang Huang Gezi Li Bo Tang