https://arxiv.org/api/7dd/Kife46i2Z/IXnwYSRTU6phc 2026-03-20T08:53:07Z 11369 0 15 http://arxiv.org/abs/2603.18836v1 Confidential Databases Without Cryptographic Mappings 2026-03-19T12:37:35Z Confidential databases (CDBs) are essential for enabling secure queries over sensitive data in untrusted cloud environments using confidential computing hardware. While adoption is growing, widespread deployment is hindered by high performance overhead from frequent synchronous cryptographic operations, which causes significant computational and memory bottlenecks. We present FEDB, a novel CDB design that removes cryptographic operations from the critical path. FEDB leverages crypto-free mappings, which maintain data-independent identifiers within the database while securely mapping them to plaintext secrets in a trusted domain. This paradigm shift reduces the runtime overhead by up to 78.0 times on industry-standard benchmarks including TPC-C and TPC-H. 2026-03-19T12:37:35Z Wenxuan Huang Zhanbo Wang Mingyu Li http://arxiv.org/abs/2603.18835v1 Tursio Database Search: How far are we from ChatGPT? 2026-03-19T12:36:48Z Business users need to search enterprise databases using natural language, just as they now search the web using ChatGPT or Perplexity. However, existing benchmarks -- designed for open-domain QA or text-to-SQL -- do not evaluate the end-to-end quality of such a search experience. We present an evaluation framework for structured database search that generates realistic banking queries across varying difficulty levels and assesses answer quality using relevance, safety, and conversational metrics via an LLM-as-judge approach. We apply this framework to compare Tursio, a database search platform, against ChatGPT and Perplexity on a credit union banking schema. Our results show that Tursio achieves answer relevancy statistically comparable to both baselines (97.8% vs. 98.1% on simple, 90.0% vs. 100.0% on medium, 89.5% vs. 100.0% on hard questions), even though Tursio answers from a structured database while the baselines generate responses from the open web. We analyze the failure modes, identify database completeness as the primary bottleneck, and outline directions for improving both the evaluation methodology and the systems under evaluation. 2026-03-19T12:36:48Z Sulbha Jain Shivani Tripathi Shi Qiao Alekh Jindal http://arxiv.org/abs/2603.15023v3 SIMD-PAC-DB: Pretty Performant PAC Privacy 2026-03-19T12:17:44Z This work presents a highly optimized implementation of PAC-DB, a recent and promising database privacy model. We prove that our SIMD-PAC-DB can compute the same privatized answer with just a single query, instead of the 128 stochastic executions against different 50% database sub-samples needed by the original PAC-DB. Our key insight is that every bit of a hashed primary key can be seen to represent membership of such a sub-sample. We present new algorithms for approximate computation of stochastic aggregates based on these hashes, which, thanks to their SIMD-friendliness, run up to 40x faster than scalar equivalents. We release an open-source DuckDB community extension which includes a rewriter that PAC-privatizes arbitrary SQL queries. Our experiments on TPC-H, Clickbench, and SQLStorm evaluate thousands of queries in terms of performance and utility, significantly advancing the ease of use and functionality of privacy-aware data systems in practice. 2026-03-16T09:24:08Z Ilaria Battiston Dandan Yuan Xiaochen Zhu Peter Boncz http://arxiv.org/abs/2603.18709v1 Let's Play Tag: Linear Time Evaluation of Conjunctive Queries under TGD Constraints 2026-03-19T10:09:17Z We study the limits of linear time evaluation of conjunctive queries under constraints expressed as tuple-generating dependencies (TGDs), across several modes of query evaluation: single-testing, all-testing, counting, lexicographic direct access, and enumeration. While full classifications seem far beyond reach, we propose an approach that, for some evaluation modes and classes of TGDs, makes it possible to lift known dichotomies from the unconstrained setting. In particular, our approach applies to all mentioned evaluation modes except enumeration, when the constraints fall into one of two classes: non-recursive sets of TGDs in which every TGD uses at most binary relation symbols in the head or has at most two frontier variables; and frontier-guarded full TGDs. We further provide a collection of examples showcasing the challenges that arise for enumeration and for less restrictive classes of TGDs. 2026-03-19T10:09:17Z Nofar Carmeli Carsten Lutz Marcin Przybyłko http://arxiv.org/abs/2603.14994v2 DP-S4S: Accurate and Scalable Select-Join-Aggregate Query Processing with User-Level Differential Privacy 2026-03-19T09:22:10Z Answering Select-Join-Aggregate queries with DP is a fundamental problem with important applications in various domains. The current SOTA methods ensure user-level DP (i.e., the adversary cannot infer the presence or absence of any given individual user with high confidence) and achieve instance-optimal accuracy on the query results. However, these solutions involve solving expensive optimization programs, which may incur prohibitive computational overhead for large databases. One promising direction to achieve scalability is through sampling, which provides a tunable trade-off between result utility and computational costs. However, applying sampling to differentially private SJA processing is a challenge for two reasons. First, it is unclear what to sample, in order to achieve the best accuracy within a given computational budget. Second, prior solutions were not designed with sampling in mind, and their mathematical tool chains are not sampling-friendly. To our knowledge, the only known solution that applies sampling to private SJA processing is S&E, a recent proposal that (i) samples users and (ii) combines sampling directly with existing solutions to enforce DP. We show that both are suboptimal designs; consequently, even with a relatively high sample rate, the error incurred by S&E can be 10x higher than the underlying DP mechanism without sampling. Motivated by this, we propose Differentially Private Sampling for Scale (DP-S4S), a novel mechanism that addresses the above challenges by (i) sampling aggregation units instead of users, and (ii) laying the mathematical foundation for SJA processing under RDP, which composes more easily with sampling. Further, DP-S4S can answer both scalar and vector SJA queries. Extensive experiments on real data demonstrate that DP-S4S enables scalable SJA processing on large datasets under user-level DP, while maintaining high result utility. 2026-03-16T08:58:38Z Yuan Qiu Xiaokui Xiao Yin Yang 10.1145/3802042 http://arxiv.org/abs/2603.18654v1 QuaQue: Design and SQL Implementation of Condensed Algebra for Concurrent Versioning of Knowledge Graphs 2026-03-19T09:18:45Z The management of versioned knowledge graphs presents significant challenges, particularly in querying data across multiple versions efficiently. This paper introduces QuaQue, a key component of the ConVer-G system, which addresses this challenge by translating SPARQL (SPARQL Protocol and RDF Query Language) queries into SQL (Structured Query Language). QuaQue leverages a novel condensed algebra to operate on a relational model where versioning information is compactly stored using bitstrings. This approach allows for efficient querying of concurrent versions of knowledge graphs within a standard relational database system. We present the key concepts of our condensed algebra, detail the translation process from SPARQL algebra to SQL, and provide a comparative benchmark against a native RDF (Resource Description Framework) triple store, demonstrating the viability and performance benefits of our approach. 2026-03-19T09:18:45Z 11 pages, 6 figures, DBKDA conference DBKDA 2026, The Eighteenth International Conference on Advances in Databases, Knowledge, and Data Applications Jey Puget Gil Emmanuel Coquery John Samuel Gilles Gesquière http://arxiv.org/abs/2603.18650v1 DeePAW: A universal machine learning model for orbital-free ab initio calculations 2026-03-19T09:16:30Z Developing universal machine learning models for ab initio calculations is the frontier of materials cutting edge research in the new era of artificial intelligence. Here, we present the Deep Augment Way model (DeePAW) that is a universal machine learning (ML) model for orbital-free (OF) ab initio calculations, based on the density functional theory (DFT). DeePAW is currently the best OFDFT ML model according to the three criterions, 1) covering the largest number of elements, 2) having the widest application capability to diverse crystal structures, and 3) achieving the highest prediction accuracy without further fine-tuning. These scientific merits and innovations of DeePAW are stemmed from the novel SE(3)-equivariant double massage passing neuron networks. Besides predicting electron density distributions, DeePAW predicts formation energies of crystals as well and therefore paves an efficient avenue for multiscale materials modeling beyond conventional electronic structure calculation methods. 2026-03-19T09:16:30Z Tianhao Su Shunbo Hu Yue Wu Runhai Oyang Xitao Wang Musen Li Jeffrey Reimers Tong-Yi Zhang http://arxiv.org/abs/2503.07884v2 LLMIA: An Out-of-the-Box Index Advisor via In-Context Learning with LLMs 2026-03-19T04:07:10Z Index recommendation is crucial for optimizing database performance. However, existing heuristic- and learning-based methods often rely on inefficient exhaustive search and estimated costs, leading to low efficiency (due to the vast search space) and unsatisfactory actual latency (due to inaccurate estimations). Inspired by the refinement strategies of experienced DBAs-who efficiently identify and iteratively refine indexes with database feedback-we present LLMIA, an out-of-the-box, tuning-free index advisor leveraging large language models (LLMs) through in-context learning for index recommendation. LLMIA injects database expertise into the LLM using a high-quality demonstration pool and comprehensive workload feature extraction, while iteratively incorporating database feedback to guide the index refinement. This design enables LLMIA to emulate the decision-making process of expert DBAs: efficiently recommending and refining indexes for various workloads within just a few interactions with the DBMS. We validate LLMIA with extensive experiments on five standard OLAP benchmarks (TPC-H with different scales, JOB, TPC-DS, SSB), where it consistently outperforms or matches 12 baselines by producing superior index recommendations with minimal database interactions. Additionally, LLMIA demonstrates robust generalization on two real-world commercial workloads, delivering high-quality recommendations without the need for additional adaptation or retraining, highlighting its out-of-the-box capability. 2025-03-10T22:01:24Z Xinxin Zhao Xinmei Huang Haoyang Li Jing Zhang Shuai Wang Tieying Zhang Jianjun Chen Rui Shi Cuiping Li Hong Chen http://arxiv.org/abs/2603.18447v1 SODIUM: From Open Web Data to Queryable Databases 2026-03-19T03:17:56Z During research, domain experts often ask analytical questions whose answers require integrating data from a wide range of web sources. Thus, they must spend substantial effort searching, extracting, and organizing raw data before analysis can begin. We formalize this process as the SODIUM task, where we conceptualize open domains such as the web as latent databases that must be systematically instantiated to support downstream querying. Solving SODIUM requires (1) conducting in-depth and specialized exploration of the open web, which is further strengthened by (2) exploiting structural correlations for systematic information extraction and (3) integrating collected information into coherent, queryable database instances. To quantify the challenges in automating SODIUM, we construct SODIUM-Bench, a benchmark of 105 tasks derived from published academic papers across 6 domains, where systems are tasked with exploring the open web to collect and aggregate data from diverse sources into structured tables. Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy. To bridge this gap, we develop SODIUM-Agent, a multi-agent system composed of a web explorer and a cache manager. Powered by our proposed ATP-BFS algorithm and optimized through principled management of cached sources and navigation paths, SODIUM-Agent conducts deep and comprehensive web exploration and performs structurally coherent information extraction. SODIUM-Agent achieves 91.1% accuracy on SODIUM-Bench, outperforming the strongest baseline by approximately 2 times and the weakest by up to 73 times. 2026-03-19T03:17:56Z Chuxuan Hu Philip Li Maxwell Yang Daniel Kang http://arxiv.org/abs/2603.14899v2 A New Lower Bounding Paradigm and Tighter Lower Bounds for Elastic Similarity Measures 2026-03-19T03:17:44Z Elastic similarity measures are fundamental to time series similarity search because of their ability to handle temporal misalignments. These measures are inherently computationally expensive, therefore necessitating the use of lower bounds to prune unnecessary comparisons. This paper proposes a new \emph{Bipartite Graph Edge-Cover Paradigm} for deriving lower bounds, which applies to a broad class of elastic similarity measures. This paradigm formulates lower bounding as a vertex-weighting problem on a weighted bipartite graph induced from the input time series. Under this paradigm, most of the existing lower bounds of elastic similarity measures can be viewed as simple instantiations. We further propose \textit{BGLB}, an instantiation of the proposed paradigm that incorporates an additional augmentation term, yielding lower bounds that are provably tighter. Theoretical analysis and extensive experiments on 128 real-world datasets demonstrate that \textit{BGLB} achieves the tightest known lower bounds for six elastic measures (ERP, MSM, TWED, LCSS, EDR, and SWALE). Moreover, \textit{BGLB} remains highly competitive for \textit{DTW} with a favorable trade-off between tightness and computational efficiency. In nearest neighbor search, integrating \textit{BGLB} into filter pipelines consistently outperforms state-of-the-art methods, achieving speedups ranging from $24.6\%$ to $84.9\%$ across various elastic similarity measures. Besides, \textit{BGLB} also delivers a significant acceleration in density-based clustering applications, validating the practical potential of \textit{BGLB} in time series similarity search tasks based on elastic similarity measures. 2026-03-16T07:02:18Z Zemin Chao Boyu Xiao Zitong Li Zhixin Qi Xianglong Liu Hongzhi Wang http://arxiv.org/abs/2601.09735v4 Multiverse: Transactional Memory with Dynamic Multiversioning 2026-03-18T21:12:49Z Software transactional memory (STM) allows programmers to easily implement concurrent data structures. STMs simplify atomicity. Recent STMs can achieve good performance for some workloads but they have some limitations. In particular, STMs typically cannot support long-running reads which access a large number of addresses that are frequently updated. Multiversioning is a common approach used to support this type of workload. However, multiversioning is often expensive and can reduce the performance of transactions where versioning is not necessary. In this work we present Multiverse, a new STM that combines the best of both unversioned TM and multiversioning. Multiverse features versioned and unversioned transactions which can execute concurrently. A main goal of Multiverse is to ensure that unversioned transactions achieve performance comparable to the state of the art unversioned STM while still supporting fast versioned transactions needed to enable long running reads. We implement Multiverse and compare it against several STMs. Our experiments demonstrate that Multiverse achieves comparable or better performance for common case workloads where there are no long running reads. For workloads with long running reads and frequent updates Multiverse significantly outperforms existing STMS. In several cases for these workloads the throughput of Multiverse is several orders of magnitude faster than other STMs. 2026-01-03T01:58:08Z Gaetano Coccimiglio Trevor Brown Srivatsan Ravi http://arxiv.org/abs/2603.15970v2 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models 2026-03-18T17:17:29Z Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter ($AI.IF$) operator and also important gains for semantic ranking ($AI.RANK$). The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training. 2026-03-16T22:42:45Z Yeounoh Chung Rushabh Desai Jian He Yu Xiao Thibaud Hottelier Yves-Laurent Kom Samo Pushkar Kadilkar Xianshun Chen Sam Idicula Fatma Özcan Alon Halevy Yannis Papakonstantinou 10.1145/3802002 http://arxiv.org/abs/2603.15722v2 A Framework and Prototype for a Navigable Map of Datasets in Engineering Design and Systems Engineering 2026-03-18T15:32:25Z The proliferation of data across the system lifecycle presents both a significant opportunity and a challenge for Engineering Design and Systems Engineering (EDSE). While this "digital thread" has the potential to drive innovation, the fragmented and inaccessible nature of existing datasets hinders method validation, limits reproducibility, and slows research progress. Unlike fields such as computer vision and natural language processing, which benefit from established benchmark ecosystems, engineering design research often relies on small, proprietary, or ad-hoc datasets. This paper addresses this challenge by proposing a systematic framework for a "Map of Datasets in EDSE." The framework is built upon a multi-dimensional taxonomy designed to classify engineering datasets by domain, lifecycle stage, data type, and format, enabling faceted discovery. An architecture for an interactive discovery tool is detailed and demonstrated through a working prototype, employing a knowledge graph data model to capture rich semantic relationships between datasets, tools, and publications. An analysis of the current data landscape reveals underrepresented areas ("data deserts") in early-stage design and system architecture, as well as relatively well-represented areas ("data oases") in predictive maintenance and autonomous systems. The paper identifies key challenges in curation and sustainability and proposes mitigation strategies, laying the groundwork for a dynamic, community-driven resource to accelerate data-centric engineering research. 2026-03-16T17:08:20Z 10 pages, 3 figures, Submitted to ASME IDETC 2026-DAC22 H. Sinan Bank Daniel R. Herber http://arxiv.org/abs/2603.09927v3 How to Write to SSDs 2026-03-18T14:47:04Z This paper demonstrates that adopting out-of-place writes is essential for database systems to fully leverage SSD performance and extend SSD lifespan. We propose a set of out-of-place optimizations that collectively reduce write amplification across both the DBMS and SSD layers. We redesign the in-place, B-tree-based LeanStore to write out-of-place and support these optimizations, and evaluate it on diverse OLTP benchmarks, dataset sizes, and SSDs. The final design improves throughput by 1.65-2.24x and reduces flash writes per operation by 6.2-9.8x on YCSB-A. On TPC-C with 15,000 warehouses, throughput improves by 2.45x while flash writes decrease by 7.2x. Finally, we show that the architecture can seamlessly support novel SSD interfaces such as ZNS and FDP. 2026-03-10T17:21:58Z Accepted to PVLDB 2026. This arXiv version contains an additional section Bohyun Lee Tobias Ziegler Viktor Leis http://arxiv.org/abs/2603.17668v1 Halo: Domain-Aware Query Optimization for Long-Context Question Answering 2026-03-18T12:34:02Z Long-context question answering (QA) over lengthy documents is critical for applications such as financial analysis, legal review, and scientific research. Current approaches, such as processing entire documents via a single LLM call or retrieving relevant chunks via RAG have two drawbacks: First, as context size increases, response quality can degrade, impacting accuracy. Second, iteratively processing hundreds of input documents can incur prohibitively high costs in API calls. To improve response quality and reduce the number of iterations needed to get the desired response, users tend to add domain knowledge to their prompts. However, existing systems fail to systematically capture and use this knowledge to guide query processing. Domain knowledge is treated as prompt tokens alongside the document: the LLM may or may not follow it, there is no reduction in computational cost, and when outputs are incorrect, users must manually iterate. We present Halo, a long-context QA framework that automatically extracts domain knowledge from user prompts and applies it as executable operators across a multi-stage query execution pipeline. Halo identifies three common forms of domain knowledge - where in the document to look, what content to ignore, and how to verify the answer - and applies each at the pipeline stage where it is most effective: pruning the document before chunk selection, filtering irrelevant chunks before inference, and ranking candidate responses after generation. To handle imprecise or invalid domain knowledge, Halo includes a fallback mechanism that detects low-quality operators at runtime and selectively disables them. Our evaluation across finance, literature, and scientific datasets shows that Halo achieves up to 13% higher accuracy and 4.8x lower cost compared to baselines, and enables a lightweight open-source model to approach frontier LLM accuracy at 78x lower cost. 2026-03-18T12:34:02Z Pramod Chunduri Francisco Romero Ali Payani Kexin Rong Joy Arulraj