https://arxiv.org/api/knN/rZBlP3viycnz5jknWvlnSpA2026-03-21T04:09:36Z1136918015http://arxiv.org/abs/2603.03126v1The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment2026-03-03T15:58:18ZScholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.2026-03-03T15:58:18Z18 pages, 8 figures, 7 tables. Dataset DOI: 10.57967/hf/7850. Code: https://github.com/J0nasW/science-datalakeJonas Wilinskihttp://arxiv.org/abs/2603.03097v1Odin: Multi-Signal Graph Intelligence for Autonomous Discovery in Knowledge Graphs2026-03-03T15:34:02ZWe present Odin, the first production-deployed graph intelligence engine for autonomous discovery of meaningful patterns in knowledge graphs without prior specification. Unlike retrieval-based systems that answer predefined queries, Odin guides exploration through the COMPASS (Composite Oriented Multi-signal Path Assessment) score, a novel metric that combines (1) structural importance via Personalized PageRank, (2) semantic plausibility through Neural Probabilistic Logic Learning (NPLL) used as a discriminative filter rather than generative model, (3) temporal relevance with configurable decay, and (4) community-aware guidance through GNN-identified bridge entities and inter-community affinity scores. This multi-signal integration, particularly the bridge scoring mechanism, addresses the "echo chamber" problem where graph exploration becomes trapped in dense local communities. We formalize the autonomous discovery problem, prove theoretical properties of our scoring function, and demonstrate that beam search with multi-signal guidance achieves $O(b \cdot h)$ complexity while maintaining high recall compared to exhaustive exploration. To our knowledge, Odin represents the first autonomous discovery system deployed in regulated production environments (healthcare and insurance), demonstrating significant improvements in pattern discovery quality and analyst efficiency. Our approach maintains complete provenance traceability -- a critical requirement for regulated industries where hallucination is unacceptable.2026-03-03T15:34:02ZMuyukani KizitoElizabeth Nyamberehttp://arxiv.org/abs/2603.02995v1A Graph-Native Approach to Normalization2026-03-03T13:46:50ZIn recent years, knowledge graphs (KGs) - in particular in the form of labeled property graphs (LPGs) - have become essential components in a broad range of applications. Although the absence of strict schemas for KGs facilitates structural issues that lead to redundancies and subsequently to inconsistencies and anomalies, the problem of KG quality has so far received only little attention. Inspired by normalization using functional dependencies for relational data, a first approach exploiting dependencies within nodes has been proposed. However, real-world KGs also expose functional dependencies involving edges. In this paper, we therefore propose graph-native normalization, which considers dependencies within nodes, edges, and their combination. We define a range of graph-native normal forms and graph object functional dependencies and propose algorithms for transforming graphs accordingly. We evaluate our contributions using a broad range of synthetic and native graph datasets.2026-03-03T13:46:50ZJohannes SchrottMaxime JakubowskiKatja Hosehttp://arxiv.org/abs/2603.02941v1Timehash: Hierarchical Time Indexing for Efficient Business Hours Search2026-03-03T12:49:41ZTemporal range filtering is a critical operation in large-scale search systems, particularly for location-based services that need to filter businesses by operating hours. Traditional approaches either suffer from poor query performance (scope filtering) or index size explosion (minute-level indexing).
We present Timehash, a novel hierarchical time indexing algorithm that achieves over 99% reduction in index size compared to minute-level indexing while maintaining 100% precision. Timehash employs a flexible multi-resolution strategy with customizable hierarchical levels. Through empirical analysis on distributions from 12.6 million business records of a production location search service, we demonstrate a data-driven methodology for selecting optimal hierarchies tailored to specific data distributions.
We evaluated Timehash on up to 12.6 million synthetic POIs generated from production distributions. Experimental results show that a five-level hierarchy reduces index terms to 5.6 per document (99.1% reduction versus minute-level indexing), with zero false positives and zero false negatives. Scalability benchmarks confirm constant per-document cost from 100K to 12.6M POIs, while supporting complex scenarios such as break times and irregular schedules. Our approach is generalizable to various temporal filtering problems in search systems, e-commerce, and reservation platforms.2026-03-03T12:49:41Z12 pages, 2 figures, 8 tables. Submitted to VLDB 2026 Industry TrackJinoh KimJaewon Sonhttp://arxiv.org/abs/2511.04584v4Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis2026-03-03T09:31:14ZNatural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction where users are intentional about the degree to which they specify queries. We develop a principled framework based on a shared responsibility of query specification between user and system, distinguishing unambiguous and ambiguous cooperative queries, which systems can resolve through reasonable inference, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze queries in 15 datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system's accuracy nor for evaluating interpretation capabilities. This conceptualization around cooperation in resolving queries informs how to design and evaluate natural language interfaces for tabular data analysis, for which we distill concrete directions for future research and broader implications.2025-11-06T17:39:18ZAccepted to the AI for Tabular Data workshop at EurIPS 2025Daniel GommCornelius WolffMadelon Hulseboshttp://arxiv.org/abs/2509.12610v2ScaleDoc: Scaling LLM-based Predicates over Large Document Collections2026-03-03T06:03:52ZPredicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.2025-09-16T03:18:06ZHengrui ZhangYulong HuiYihao LiuHuanchen Zhanghttp://arxiv.org/abs/2603.02537v1Large Language Model-Enhanced Relational Operators: Taxonomy, Benchmark, and Analysis2026-03-03T02:51:26ZWith the development of large language models (LLMs), numerous studies integrate LLMs through operator-like components to enhance relational data processing tasks, e.g., filters with semantic predicates, knowledge-augmented table imputation, reasoning-driven entity matching and more challenging semantic query processing. These components invoke LLMs while preserving a relational input/output interface, which we refer to as LLM-Enhanced Relational Operators (LROs). From an operator perspective, unfortunately, these existing LROs suffer from fragmented definition, various implementation strategies and inadequate evaluation benchmarks. To this end, in this paper, we first establish a unified LRO taxonomy to align existing LROs, and categorize them into: Select, Match, Impute, Cluster and Order, along with their operands and implementation variants. Second, we design LROBench, a comprehensive benchmark featuring 290 single-LRO queries and 60 multi-LRO queries, spanning 27 databases across more than 10 domains. LROBench covers all operating logics and operand granularities in its single-LRO workload, and provides challenging multi-LRO queries stratified by query complexity. Based on these, we evaluate individual LROs under various implementations, deriving practical insights into LRO design choices and summarizing our empirical best practices. We further compare the end-to-end performance of existing multi-LRO systems against an LRO suite instantiated with these best practices, in order to investigate how to design an effective LRO set for multi-LRO systems targeting complex semantic queries. Last, to facilitate future work, we outline promising future directions and open-source all benchmark data and evaluation code, available at https://github.com/LROBench/LROBench/.2026-03-03T02:51:26ZYunxiang SuTianjing ZengZhongjun DingYin LinRong ZhuZhewei WeiBolin DingJingren Zhouhttp://arxiv.org/abs/2511.16935v4LinkML: An Open Data Modeling Framework2026-03-02T23:31:16ZScientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult. LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics.
LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.2025-11-21T04:04:28ZFixed Table 3Gigascience. Oxford University Press (OUP); 2025 Dec 12;(giaf152):giaf152Sierra A. T. MoxonHarold SolbrigNomi L. HarrisPatrick KalitaMark A. MillerSujay PatilKevin SchaperChris BizonJ. Harry CaufieldSilvano Cirujano CuestaCorey CoxFrank DekervelDamion M. DooleyWilliam D. DuncanTim FlissSarah GehrkeAdam S. L. GraefeHarshad HegdeAJ IrelandJulius O. B. JacobsenMadan KrishnamurthyCarlo KrollDavid LinkeRyan LyNicolas MatentzogluJames A. OvertonJonny L. SaundersDeepak R. UnniGaurav VaidyaWouter-Michiel A. M. VierdagLinkML Community ContributorsOliver RuebelChristopher G. ChuteMatthew H. BrushMelissa A. HaendelChristopher J. Mungall10.1093/gigascience/giaf152http://arxiv.org/abs/2603.02164v1Catapults to the Rescue: Accelerating Vector Search by Exploiting Query Locality2026-03-02T18:27:56ZGraph-based indexing is the dominant approach for approximate nearest neighbor search in vector databases, offering high recall with low latency across billions of vectors. However, in such indices, the edge set of the proximity graph is only modified to reflect changes in the indexed data, never to adapt to the query workload. This is wasteful: real-world query streams exhibit strong spatial and temporal locality, yet every query must re-traverse the same intermediate hops from fixed or random entry points. We present CatapultDB, a lightweight mechanism that, for the first time, dynamically determines where to begin the search in an ANN index on the fly, therefore exploiting query locality. CatapultDB injects shortcut edges called catapults that connect query regions to frequently visited destination nodes. Catapults are maintained as an additional layer on top of the graph, so the standard vector search algorithm remains unchanged: queries are simply routed to a better starting point when an appropriate catapult exists. This transparent design preserves the full feature set of the underlying system, including filtered search, dynamic insertions, and disk-resident indices. We implement CatapultDB and evaluate it using four workloads with varying amounts of bias. Our experiments show that CatapultDB increases throughput by up to 2.51x compared to DiskANN at equivalent or better recall, matches the efficiency of LSH-based approaches without sacrificing filtering or requiring index reconstruction, and adapts gracefully to workload shifts, unlike cache-based alternatives.2026-03-02T18:27:56ZSami AbuzakukAnne-Marie KermarrecRafael PiresMathis RandlMartijn de Voshttp://arxiv.org/abs/2603.02150v1Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)2026-03-02T18:12:02ZThe extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.2026-03-02T18:12:02ZSent for review at the main conference of the International Conference of Document Analysis and Recognition (ICDAR) 2026Miguel Lopez-DuranJulian FierrezAythami MoralesDaniel DeAlcalaGonzalo ManceraJavier IrigoyenRuben TolosanaOscar DelgadoFrancisco JuradoAlvaro Ortigosahttp://arxiv.org/abs/2603.02108v1Milliscale: Fast Commit on Low-Latency Object Storage2026-03-02T17:25:39ZWith millisecond-level latency and support for mutable objects, recent low-latency object storage services as represented by Amazon S3 Express One Zone have become an attractive option for OLTP engines to directly commit transactions and persist operational data with transparent strong consistency, high durability and high availability. But a naïve adoption can still lead to high commit latency due to idiosyncrasies of S3 Express One Zone and modern decentralized logging.
This paper presents Milliscale, a memory-optimized OLTP engine for low-latency object storage. Milliscale optimizes commit latency with new techniques that lower commit delays and reduce the number of object access requests. Our evaluation using representative benchmarks shows that Milliscale delivers much lower commit latency than baselines while sustaining high throughput.2026-03-02T17:25:39ZJiatang ZhouKaisong HuangTianzheng Wanghttp://arxiv.org/abs/2603.02081v1GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered2026-03-02T17:03:43ZTraditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires substantial engineering effort and cost. In this paper, we argue that recent advances in Large Language Models (LLMs) are starting to shape the next generation of query processing systems.
We propose using LLMs to synthesize execution code for each incoming query, instead of continuously building, extending, and maintaining complex query processing engines. As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi-agent system, and we evaluate it on OLAP workloads. We use queries from the well-known TPC-H benchmark and also construct a new benchmark designed to reduce potential data leakage from LLM training data. We compare GenDB with state-of-the-art query engines, including DuckDB, Umbra, MonetDB, ClickHouse, and PostgreSQL. GenDB achieves significantly better performance than these systems. Finally, we discuss the current limitations of GenDB and outline future extensions and related research challenges.2026-03-02T17:03:43ZJiale LaoImmanuel Trummerhttp://arxiv.org/abs/2601.14176v2ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery2026-03-02T16:58:19ZThe rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results demonstrate the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.2026-01-20T17:27:12ZYouran SunYixin WenHaizhao Yanghttp://arxiv.org/abs/2603.02001v1Bespoke OLAP: Synthesizing Workload-Specific One-size-fits-one Database Engines2026-03-02T15:51:45ZModern OLAP engines are designed to support arbitrary analytical workloads, but this generality incurs structural overhead, including runtime schema interpretation, indirection layers, and abstraction boundaries, even in highly optimized systems. An engine specialized to a fixed workload can eliminate these costs and exploit workload-specific data structures and execution algorithms for substantially higher performance. Historically, constructing such bespoke engines has been economically impractical due to the high manual engineering effort. Recent advances in LLM-based code synthesis challenge this tradeoff by enabling automated system generation. However, naively prompting an LLM to produce a database engine does not yield a correct or efficient design, as effective synthesis requires systematic performance feedback, structured refinement, and careful management of deep architectural interdependencies. We present Bespoke OLAP, a fully autonomous synthesis pipeline for constructing high-performance database engines tightly tailored to a given workload. Our approach integrates iterative performance evaluation and automated validation to guide synthesis from storage to query execution. We demonstrate that Bespoke OLAP can generate a workload-specific engine from scratch within minutes to hours, achieving order-of-magnitude speedups over modern general-purpose systems such as DuckDB.2026-03-02T15:51:45ZJohannes WehrsteinTimo EckmannMatthias JasnyCarsten Binnighttp://arxiv.org/abs/2501.16759v3Are Joins over LSM-Trees Ready? Take RocksDB as an Example2026-03-02T12:37:35ZLSM-tree-based data stores are widely adopted in industries for their excellent performance. As data scales increase, disk-based join operations become indispensable yet costly for the database, making the selection of suitable join methods crucial for system optimization. Current LSM-based stores generally adhere to conventional relational database practices and support only a limited number of join methods. However, the LSM-tree delivers distinct read and write efficiency compared to the relational databases, which could accordingly impact the performance of various join methods. Therefore, it is necessary to reconsider the selection of join methods in this context to fully explore the potential of various join algorithms and index designs. In this work, we present a systematic study and an exhaustive benchmark for joins over LSM-trees. We define a configuration space for join methods, encompassing various join algorithms, secondary index types, and consistency strategies. We also summarize a theoretical analysis to evaluate the overhead of each join method for an in-depth understanding. Furthermore, we implement all join methods in the configuration space on a unified platform and compare their performance through extensive experiments. Our theoretical and experimental results yield several insights and takeaways tailored to joins in LSM-based stores that aid developers in choosing proper join methods based on their working conditions.2025-01-28T07:30:35ZAccepted by VLDB 2025Proc. VLDB Endow. 18, 4 (2025), 1077-1090Weiping YuFan WangXuwei ZhangSiqiang Luo10.14778/3717755.3717767