https://arxiv.org/api/Ga+MCzJObV2xL2Si8uD84YHm+LM2026-03-21T06:50:33Z1136921015http://arxiv.org/abs/2603.00537v1Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions2026-02-28T08:21:44ZLearned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.2026-02-28T08:21:44ZSIGMOD 2026Atsuki SatoMartin AumüllerYusuke Matsuihttp://arxiv.org/abs/2509.03226v2BAMG: A Block-Aware Monotonic Graph Index for Disk-Based Approximate Nearest Neighbor Search2026-02-28T07:38:21ZApproximate Nearest Neighbor Search (ANNS) over high-dimensional vectors is a foundational problem in databases, where disk I/O often emerges as the dominant performance bottleneck at scale. To accelerate search, graph-based indexes rely on proximity graph, where nodes represent vectors and edges guide the traversal toward the target. However, existing graph indexing solutions for disk-based ANNS typically either optimize the storage layout for a given graph or construct the graph independently of the storage layout, thus overlooking their interaction. In this paper, we bridge this gap by proposing the Block-aware Monotonic Relative Neighborhood Graph (BMRNG), theoretically guaranteeing the existence of I/O monotonic search paths. The core idea is to align the graph topology with the data placement by jointly considering both geometric distance and storage layout for edge selection. To address the scalability challenge of BMRNG construction, we further develop a practical and efficient variant, the Block-Aware Monotonic Graph (BAMG), which can be constructed in linear time from a monotonic graph considering the storage layout. BAMG integrates block-aware edge pruning with a decoupled storage design that separates raw vectors from the graph index, thereby maximizing block utilization and minimizing redundant disk reads. Additionally, we design a multi-layer navigation graph for adaptive and efficient query entry, along with a block-first search algorithm that prioritizes intra-block traversal to fully exploit each disk I/O operation. Extensive experiments on real-world datasets show that BAMG can outperform state-of-the-art methods in search performance.2025-09-03T11:33:31ZHuiling LiXin HuangByron ChoiJianliang Xuhttp://arxiv.org/abs/2603.00509v1COLE$^+$: Towards Practical Column-based Learned Storage for Blockchain Systems2026-02-28T07:13:00ZBlockchain provides a decentralized and tamper-resistant ledger for securely recording transactions across a network of untrusted nodes. While its transparency and integrity are beneficial, the substantial storage requirements for maintaining a complete transaction history present significant challenges. For example, Ethereum nodes require around 23TB of storage, with an annual growth rate of 4TB. Prior studies have employed various strategies to mitigate the storage challenges. Notably, COLE significantly reduces storage size and improves throughput by adopting a column-based design that incorporates a learned index, effectively eliminating data duplication in the storage layer. However, this approach has limitations in supporting chain reorganization during blockchain forks and state pruning to minimize storage overhead. In this paper, we propose COLE$^+$, an enhanced storage solution designed to address these limitations. COLE$^+$ incorporates a novel rewind-supported in-memory tree structure for handling chain reorganization, leveraging content-defined chunking (CDC) to maintain a consistent hash digest for each block. For on-disk storage, a new two-level Merkle Hash Tree (MHT) structure, called prunable version tree, is developed to facilitate efficient state pruning. Both theoretical and empirical analyses show the effectiveness of COLE$^+$ and its potential for practical application in real-world blockchain systems.2026-02-28T07:13:00ZCe ZhangCheng XuHaibo HuJianliang Xuhttp://arxiv.org/abs/2603.00448v1Semijoins of Annotated Relations2026-02-28T04:05:43ZThe semijoin operation is a fundamental operation of relational algebra that has been extensively used in query processing. Furthermore, semijoins have been used to formulate desirable properties of acyclic schemas; in particular, a schema is acyclic if and only if it has a full reducer, i.e., a sequence of semijoins that converts a given collection of relations to a globally consistent collection of relations. In recent years, the study of acyclicity has been extended to annotated relations, where the annotations are values from some positive commutative monoid. So far, however, it has not been known if the characterization of acyclicity in terms of full reducers extends to annotated relations. Here, we develop a theory of semijoins of annotated relations. To this effect, we first introduce the notion of a semijoin function on a monoid and then characterize the positive commutative monoids for which a semijoin function exists. After this, we introduce the notion of a full reducer for a schema on a monoid and show that the following is true for every positive commutative monoid that has the inner consistency property: a schema is acyclic if and only if it has a full reducer on that monoid.2026-02-28T04:05:43Z21 pagesPhokion G. Kolaitishttp://arxiv.org/abs/2504.21291v2Efficiency of Analysis of Transitive Relations using Query-Driven, Ground-and-Solve, and Fact-Driven Inference2026-02-28T03:06:52ZLogic rules allow analysis of complex relationships to be expressed easily, especially for transitive relations in critical applications. However, understanding and predicting the efficiency of different inference methods remain challenging, even for simplest rules given different kinds of input data.
This paper analyzes the efficiency of all three types of well-known inference methods -- query-driven, ground-and-solve, and fact-driven -- along with their respective optimizations, and compares with optimal complexities for the first time, for analyzing transitive graph relations. We also experiment with rule systems widely considered to have the best performance. We analyze all well-known rule variants and widely varying input graphs. The results include precisely calculated optimal time complexities; comparative analysis across different inference methods, rule variants, and graph types; confirmation with performance experiments; as well as discovery of a performance bug.2025-04-30T03:55:48ZYanhong A. LiuJohn IdogunScott D. StollerYi Tonghttp://arxiv.org/abs/2602.24271v1NSHEDB: Noise-Sensitive Homomorphic Encrypted Database Query Engine2026-02-27T18:41:10ZHomomorphic encryption (HE) enables computations directly on encrypted data, offering strong cryptographic guarantees for secure and privacy-preserving data storage and query execution. However, despite its theoretical power, practical adoption of HE in database systems remains limited due to extreme cipher-text expansion, memory overhead, and the computational cost of bootstrapping, which resets noise levels for correctness.
This paper presents NSHEDB, a secure query processing engine designed to address these challenges at the system architecture level. NSHEDB uses word-level leveled HE (LHE) based on the BFV scheme to minimize ciphertext expansion and avoid costly bootstrapping. It introduces novel techniques for executing equality, range, and aggregation operations using purely homomorphic computation, without transciphering between different HE schemes (e.g., CKKS/BFV/TFHE) or relying on trusted hardware. Additionally, it incorporates a noise-aware query planner to extend computation depth while preserving security guarantees.
We implement and evaluate NSHEDB on real-world database workloads (TPC-H) and show that it achieves 20x-V1370x speedup and a 73x storage reduction compared to state-of-the-art HE-based systems, while upholding 128-bit security in a semi-honest model with no key release or trusted components.2026-02-27T18:41:10ZBoram JungYuliang LiHung-Wei Tsenghttp://arxiv.org/abs/2601.06940v2VISTA: Knowledge-Driven Vessel Trajectory Imputation with Repair Provenance2026-02-27T14:17:03ZRepairing incomplete trajectory data is essential for downstream spatio-temporal applications. Yet, existing repair methods focus solely on reconstruction without documenting the reasoning behind repair decisions, undermining trust in safety-critical applications where repaired trajectories affect operational decisions, such as in maritime anomaly detection and route planning. We introduce repair provenance - structured, queryable metadata that documents the full reasoning chain behind each repair - which transforms imputation from pure data recovery into a task that supports downstream decision-making. We propose VISTA (knowledge-driven interpretable vessel trajectory imputation), a framework that reliably equips repaired trajectories with repair provenance by grounding LLM reasoning in data-verified knowledge. Specifically, we formalize Structured Data-derived Knowledge (SDK), a knowledge model whose data-verifiable components can be validated against real data and used to anchor and constrain LLM-generated explanations. We organize SDK in a Structured Data-derived Knowledge Graph (SD-KG) and establish a data-knowledge-data loop for extraction, validation, and incremental maintenance over large-scale AIS data. A workflow management layer with parallel scheduling, fault tolerance, and redundancy control ensures consistent and efficient end-to-end processing. Experiments on two large-scale AIS datasets show that VISTA achieves state-of-the-art accuracy, improving over baselines by 5-91% and reducing inference time by 51-93%, while producing repair provenance, whose interpretability is further validated through a case study and an interactive demo system.2026-01-11T15:02:28Z24 pages, 14 figures, 4 algorithms, 8 tables. Code available at https://github.com/hyLiu1994/VISTAHengyu LiuTianyi LiHaoyu WangKristian TorpTiancheng ZhangYushuai LiChristian S. Jensenhttp://arxiv.org/abs/2602.23999v1GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search2026-02-27T13:23:30ZApproximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking.2026-02-27T13:23:30ZJifan ShiJianyang GaoJames XiaTamás Béla FehérCheng Longhttp://arxiv.org/abs/2602.15909v3Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis2026-02-27T12:24:11ZDeep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a flow matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for this work, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.2026-02-16T14:48:24Z24 pages, 3 figures. Published as a conference paper at ICLR 2026The Fourteenth International Conference on Learning Representations (ICLR 2026)Pengfei ZhangTianxin XieMinghao YangLi Liuhttp://arxiv.org/abs/2603.05529v1Towards Neural Graph Data Management2026-02-27T08:59:25ZWhile AI systems have made remarkable progress in processing unstructured text, structured data such as graphs stored in databases, continues to grow rapidly yet remains difficult for neural models to effectively utilize. We introduce NGDBench, a unified benchmark for evaluating neural graph database capabilities across five diverse domains, including finance, medicine, and AI agent tooling. Unlike prior benchmarks limited to elementary logical operations, NGDBench supports the full Cypher query language, enabling complex pattern matching, variable-length paths, and numerical aggregations, while incorporating realistic noise injection and dynamic data management operations. Our evaluation of state-of-the-art LLMs and RAG methods reveals significant limitations in structured reasoning, noise robustness, and analytical precision, establishing NGDBench as a critical testbed for advancing neural graph data management. Our code and data are available at https://github.com/HKUST-KnowComp/NGDBench.2026-02-27T08:59:25Zhttps://github.com/HKUST-KnowComp/NGDBenchYufei LiYisen GaoJiaxin BaiJiaxuan XiongHaoyu HuangZhongwei XieHong Ting TsangYangqiu Songhttp://arxiv.org/abs/2603.02253v1Cross-Layer Decision Timing Orchestration in Cost-Based Database Systems: Resolving Structural Temporal Misalignment2026-02-27T08:12:13ZThis paper analyzes execution instability in traditional cost-based database management systems (DBMS) and identifies a structural timing misalignment between optimization and execution stages that contributes to tail-latency amplification. Beyond estimation accuracy and raw execution throughput, we argue that decision timing and the availability of runtime signals materially affect robustness under uncertainty.
In conventional DBMS architectures, the optimizer relies on historical statistics, the executor observes runtime data distributions and resource states, and accelerators impose up-front transfer costs and amortization constraints. This temporal asynchrony can lead to rigid early-bound decisions that fail under input-scale shifts or stale statistics.
We propose a cross-layer decision timing orchestration framework that shifts final decision authority from the compile-time optimizer to the runtime executor via selective late binding of operator-level choices. A Unified Risk Signal (URS) integrates optimizer uncertainty, execution-time observations, and accelerator cost signals without collapsing them into a single static cost model.
Experiments on a modified PostgreSQL prototype evaluate (i) input-scale shift, (ii) stale-statistics drift, and (iii) GPU offload break-even regimes using controlled microbenchmarks. The proposed orchestration improves execution stability, reducing P99 latency by up to 20x under severe estimation drift while maintaining comparable median latency.2026-02-27T08:12:13Z10 pages, 7 figures. Experimental evaluation on a modified PostgreSQL prototypeIlsun Changhttp://arxiv.org/abs/2602.23061v2MoDora: Tree-Based Semi-Structured Document Analysis System2026-02-27T08:09:20ZSemi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document.
To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.2026-02-26T14:48:49ZExtension of our SIGMOD 2026 paper. Please refer to source code available at https://github.com/weAIDB/MoDoraBangrui XuQihang YaoZirui TangXuanhe ZhouYeye HeShihan YuQianqian XuBin WangGuoliang LiConghui HeFan Wuhttp://arxiv.org/abs/2602.23571v1OceanBase Bacchus: a High-Performance and Scalable Cloud-Native Shared Storage Architecture for Multi-Cloud2026-02-27T00:46:24ZAlthough an increasing number of databases now embrace shared-storage architectures, current storage-disaggregated systems have yet to strike an optimal balance between cost and performance. In high-concurrency read/write scenarios, B+-tree-based shared storage struggles to efficiently absorb frequent in-place updates. Existing LSM-tree-backed disaggregated storage designs are hindered by the intricate implementation of cross-node shared-log mechanisms, where no satisfactory solution yet exists.
This paper presents OceanBase Bacchus, an LSM-tree architecture tailored for object storage provided by cloud vendors. The system sustains high-performance reads and writes while rendering compute nodes stateless through shared service-oriented PALF (Paxos-backed Append-only Log File system) logging and asynchronous background services. We employ a Shared Block Cache Service to flexibly utilize cache resources. Our design places log synchronization into a shared service, providing a novel solution for log sharing in storage-compute-separated databases. The architecture decouples functionality across modules, enabling elastic scaling where compute, cache, and storage resources can be resized rapidly and independently. Through experimental evaluation using multiple benchmark tests, including SysBench and TPC-H, we confirm that OceanBase Bacchus achieves performance comparable to or superior to that of HBase in OLTP scenarios and significantly outperforms StarRocks in OLAP workloads. Leveraging Bacchus's support for multi-cloud deployment and consistent performance, we not only retain high availability and competitive performance but also achieve substantial reductions in storage costs by 59% in OLTP scenarios and 89% in OLAP scenarios.2026-02-27T00:46:24ZQuanqing XuMingqiang ZhuangChuanhui YangQuanwei WanFusheng HanFanyu KongHao LiuHu XuJunyu Yehttp://arxiv.org/abs/2602.23469v1CACTUSDB: Unlock Co-Optimization Opportunities for SQL and AI/ML Inferences2026-02-26T19:58:54ZThere is a growing demand for supporting inference queries that combine Structured Query Language (SQL) and Artificial Intelligence / Machine Learning (AI/ML) model inferences in database systems, to avoid data denormalization and transfer, facilitate management, and alleviate privacy concerns. Co-optimization techniques for executing inference queries in database systems without accuracy loss fall into four categories: (O1) Relational algebra optimization treating AI/ML models as black-box user-defined functions (UDFs); (O2) Factorized AI/ML inferences; (O3) Tensor-relational transformation; and (O4) General cross-optimization techniques. However, we found none of the existing database systems support all these techniques simultaneously, resulting in suboptimal performance. In this work, we identify two key challenges to address the above problem: (1) the difficulty of unifying all co-optimization techniques that involve disparate data and computation abstractions in one system; and (2) the lack of an optimizer that can effectively explore the exponential search space. To address these challenges, we present CactusDB, a novel system built atop Velox - a high-performance, UDF-centric database engine, open-sourced by Meta. CactusDB features a three-level Intermediate Representations (IR) that supports relational operators, expression operators, and ML functions to enable flexible optimization of arbitrary sub-computations. Additionally, we propose a novel Monte-Carlo Tree Search (MCTS)-based optimizer with query embedding, co-designed with our unique three-level IR, enabling shared and reusable optimization knowledge across different queries. Evaluation of 12 representative inference workloads and 2,000 randomly generated inference queries on well-known datasets, such as MovieLens and TPCx-AI, shows that CactusDB achieves up to 441 times speedup compared to alternative systems.2026-02-26T19:58:54ZAccepted to ICDE 2026 as a full research paperLixi ZhouKanchan ChowdhuryLulu XieJaykumar TandelHong GuanZhiwei FanXinwei FuJia Zouhttp://arxiv.org/abs/2602.23342v1AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search2026-02-26T18:48:29ZOn-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space.
In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems.2026-02-26T18:48:29ZThe paper has been accepted by SIGMOD 2026Weijian ChenHaotian LiuYangshen DengLong XiangLiang HuangGezi LiBo Tang