https://arxiv.org/api/knN/rZBlP3viycnz5jknWvlnSpA 2026-03-21T04:09:36Z 11369 180 15 http://arxiv.org/abs/2603.03126v1 The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment 2026-03-03T15:58:18Z

Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.

2026-03-03T15:58:18Z 18 pages, 8 figures, 7 tables. Dataset DOI: 10.57967/hf/7850. Code: https://github.com/J0nasW/science-datalake Jonas Wilinski http://arxiv.org/abs/2603.03097v1 Odin: Multi-Signal Graph Intelligence for Autonomous Discovery in Knowledge Graphs 2026-03-03T15:34:02Z

We present Odin, the first production-deployed graph intelligence engine for autonomous discovery of meaningful patterns in knowledge graphs without prior specification. Unlike retrieval-based systems that answer predefined queries, Odin guides exploration through the COMPASS (Composite Oriented Multi-signal Path Assessment) score, a novel metric that combines (1) structural importance via Personalized PageRank, (2) semantic plausibility through Neural Probabilistic Logic Learning (NPLL) used as a discriminative filter rather than generative model, (3) temporal relevance with configurable decay, and (4) community-aware guidance through GNN-identified bridge entities and inter-community affinity scores. This multi-signal integration, particularly the bridge scoring mechanism, addresses the "echo chamber" problem where graph exploration becomes trapped in dense local communities. We formalize the autonomous discovery problem, prove theoretical properties of our scoring function, and demonstrate that beam search with multi-signal guidance achieves $O(b \cdot h)$ complexity while maintaining high recall compared to exhaustive exploration. To our knowledge, Odin represents the first autonomous discovery system deployed in regulated production environments (healthcare and insurance), demonstrating significant improvements in pattern discovery quality and analyst efficiency. Our approach maintains complete provenance traceability -- a critical requirement for regulated industries where hallucination is unacceptable.

2026-03-03T15:34:02Z Muyukani Kizito Elizabeth Nyambere http://arxiv.org/abs/2603.02995v1 A Graph-Native Approach to Normalization 2026-03-03T13:46:50Z

In recent years, knowledge graphs (KGs) - in particular in the form of labeled property graphs (LPGs) - have become essential components in a broad range of applications. Although the absence of strict schemas for KGs facilitates structural issues that lead to redundancies and subsequently to inconsistencies and anomalies, the problem of KG quality has so far received only little attention. Inspired by normalization using functional dependencies for relational data, a first approach exploiting dependencies within nodes has been proposed. However, real-world KGs also expose functional dependencies involving edges. In this paper, we therefore propose graph-native normalization, which considers dependencies within nodes, edges, and their combination. We define a range of graph-native normal forms and graph object functional dependencies and propose algorithms for transforming graphs accordingly. We evaluate our contributions using a broad range of synthetic and native graph datasets.

2026-03-03T13:46:50Z Johannes Schrott Maxime Jakubowski Katja Hose http://arxiv.org/abs/2603.02941v1 Timehash: Hierarchical Time Indexing for Efficient Business Hours Search 2026-03-03T12:49:41Z

Temporal range filtering is a critical operation in large-scale search systems, particularly for location-based services that need to filter businesses by operating hours. Traditional approaches either suffer from poor query performance (scope filtering) or index size explosion (minute-level indexing). We present Timehash, a novel hierarchical time indexing algorithm that achieves over 99% reduction in index size compared to minute-level indexing while maintaining 100% precision. Timehash employs a flexible multi-resolution strategy with customizable hierarchical levels. Through empirical analysis on distributions from 12.6 million business records of a production location search service, we demonstrate a data-driven methodology for selecting optimal hierarchies tailored to specific data distributions. We evaluated Timehash on up to 12.6 million synthetic POIs generated from production distributions. Experimental results show that a five-level hierarchy reduces index terms to 5.6 per document (99.1% reduction versus minute-level indexing), with zero false positives and zero false negatives. Scalability benchmarks confirm constant per-document cost from 100K to 12.6M POIs, while supporting complex scenarios such as break times and irregular schedules. Our approach is generalizable to various temporal filtering problems in search systems, e-commerce, and reservation platforms.

2026-03-03T12:49:41Z 12 pages, 2 figures, 8 tables. Submitted to VLDB 2026 Industry Track Jinoh Kim Jaewon Son http://arxiv.org/abs/2511.04584v4 Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis 2026-03-03T09:31:14Z

Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction where users are intentional about the degree to which they specify queries. We develop a principled framework based on a shared responsibility of query specification between user and system, distinguishing unambiguous and ambiguous cooperative queries, which systems can resolve through reasonable inference, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze queries in 15 datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system's accuracy nor for evaluating interpretation capabilities. This conceptualization around cooperation in resolving queries informs how to design and evaluate natural language interfaces for tabular data analysis, for which we distill concrete directions for future research and broader implications.

2025-11-06T17:39:18Z Accepted to the AI for Tabular Data workshop at EurIPS 2025 Daniel Gomm Cornelius Wolff Madelon Hulsebos http://arxiv.org/abs/2509.12610v2 ScaleDoc: Scaling LLM-based Predicates over Large Document Collections 2026-03-03T06:03:52Z

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.

2025-09-16T03:18:06Z Hengrui Zhang Yulong Hui Yihao Liu Huanchen Zhang http://arxiv.org/abs/2603.02537v1 Large Language Model-Enhanced Relational Operators: Taxonomy, Benchmark, and Analysis 2026-03-03T02:51:26Z

With the development of large language models (LLMs), numerous studies integrate LLMs through operator-like components to enhance relational data processing tasks, e.g., filters with semantic predicates, knowledge-augmented table imputation, reasoning-driven entity matching and more challenging semantic query processing. These components invoke LLMs while preserving a relational input/output interface, which we refer to as LLM-Enhanced Relational Operators (LROs). From an operator perspective, unfortunately, these existing LROs suffer from fragmented definition, various implementation strategies and inadequate evaluation benchmarks. To this end, in this paper, we first establish a unified LRO taxonomy to align existing LROs, and categorize them into: Select, Match, Impute, Cluster and Order, along with their operands and implementation variants. Second, we design LROBench, a comprehensive benchmark featuring 290 single-LRO queries and 60 multi-LRO queries, spanning 27 databases across more than 10 domains. LROBench covers all operating logics and operand granularities in its single-LRO workload, and provides challenging multi-LRO queries stratified by query complexity. Based on these, we evaluate individual LROs under various implementations, deriving practical insights into LRO design choices and summarizing our empirical best practices. We further compare the end-to-end performance of existing multi-LRO systems against an LRO suite instantiated with these best practices, in order to investigate how to design an effective LRO set for multi-LRO systems targeting complex semantic queries. Last, to facilitate future work, we outline promising future directions and open-source all benchmark data and evaluation code, available at https://github.com/LROBench/LROBench/.

2026-03-03T02:51:26Z Yunxiang Su Tianjing Zeng Zhongjun Ding Yin Lin Rong Zhu Zhewei Wei Bolin Ding Jingren Zhou http://arxiv.org/abs/2511.16935v4 LinkML: An Open Data Modeling Framework 2026-03-02T23:31:16Z

Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult. LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics. LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.

2025-11-21T04:04:28Z Fixed Table 3 Gigascience. Oxford University Press (OUP); 2025 Dec 12;(giaf152):giaf152 Sierra A. T. Moxon Harold Solbrig Nomi L. Harris Patrick Kalita Mark A. Miller Sujay Patil Kevin Schaper Chris Bizon J. Harry Caufield Silvano Cirujano Cuesta Corey Cox Frank Dekervel Damion M. Dooley William D. Duncan Tim Fliss Sarah Gehrke Adam S. L. Graefe Harshad Hegde AJ Ireland Julius O. B. Jacobsen Madan Krishnamurthy Carlo Kroll David Linke Ryan Ly Nicolas Matentzoglu James A. Overton Jonny L. Saunders Deepak R. Unni Gaurav Vaidya Wouter-Michiel A. M. Vierdag LinkML Community Contributors Oliver Ruebel Christopher G. Chute Matthew H. Brush Melissa A. Haendel Christopher J. Mungall 10.1093/gigascience/giaf152 http://arxiv.org/abs/2603.02164v1 Catapults to the Rescue: Accelerating Vector Search by Exploiting Query Locality 2026-03-02T18:27:56Z

Graph-based indexing is the dominant approach for approximate nearest neighbor search in vector databases, offering high recall with low latency across billions of vectors. However, in such indices, the edge set of the proximity graph is only modified to reflect changes in the indexed data, never to adapt to the query workload. This is wasteful: real-world query streams exhibit strong spatial and temporal locality, yet every query must re-traverse the same intermediate hops from fixed or random entry points. We present CatapultDB, a lightweight mechanism that, for the first time, dynamically determines where to begin the search in an ANN index on the fly, therefore exploiting query locality. CatapultDB injects shortcut edges called catapults that connect query regions to frequently visited destination nodes. Catapults are maintained as an additional layer on top of the graph, so the standard vector search algorithm remains unchanged: queries are simply routed to a better starting point when an appropriate catapult exists. This transparent design preserves the full feature set of the underlying system, including filtered search, dynamic insertions, and disk-resident indices. We implement CatapultDB and evaluate it using four workloads with varying amounts of bias. Our experiments show that CatapultDB increases throughput by up to 2.51x compared to DiskANN at equivalent or better recall, matches the efficiency of LSH-based approaches without sacrificing filtering or requiring index reconstruction, and adapts gracefully to workload shifts, unlike cache-based alternatives.

2026-03-02T18:27:56Z Sami Abuzakuk Anne-Marie Kermarrec Rafael Pires Mathis Randl Martijn de Vos http://arxiv.org/abs/2603.02150v1 Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER) 2026-03-02T18:12:02Z

The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.

2026-03-02T18:12:02Z Sent for review at the main conference of the International Conference of Document Analysis and Recognition (ICDAR) 2026 Miguel Lopez-Duran Julian Fierrez Aythami Morales Daniel DeAlcala Gonzalo Mancera Javier Irigoyen Ruben Tolosana Oscar Delgado Francisco Jurado Alvaro Ortigosa http://arxiv.org/abs/2603.02108v1 Milliscale: Fast Commit on Low-Latency Object Storage 2026-03-02T17:25:39Z

With millisecond-level latency and support for mutable objects, recent low-latency object storage services as represented by Amazon S3 Express One Zone have become an attractive option for OLTP engines to directly commit transactions and persist operational data with transparent strong consistency, high durability and high availability. But a naïve adoption can still lead to high commit latency due to idiosyncrasies of S3 Express One Zone and modern decentralized logging. This paper presents Milliscale, a memory-optimized OLTP engine for low-latency object storage. Milliscale optimizes commit latency with new techniques that lower commit delays and reduce the number of object access requests. Our evaluation using representative benchmarks shows that Milliscale delivers much lower commit latency than baselines while sustaining high throughput.

2026-03-02T17:25:39Z Jiatang Zhou Kaisong Huang Tianzheng Wang http://arxiv.org/abs/2603.02081v1 GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered 2026-03-02T17:03:43Z

Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires substantial engineering effort and cost. In this paper, we argue that recent advances in Large Language Models (LLMs) are starting to shape the next generation of query processing systems. We propose using LLMs to synthesize execution code for each incoming query, instead of continuously building, extending, and maintaining complex query processing engines. As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources. We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi-agent system, and we evaluate it on OLAP workloads. We use queries from the well-known TPC-H benchmark and also construct a new benchmark designed to reduce potential data leakage from LLM training data. We compare GenDB with state-of-the-art query engines, including DuckDB, Umbra, MonetDB, ClickHouse, and PostgreSQL. GenDB achieves significantly better performance than these systems. Finally, we discuss the current limitations of GenDB and outline future extensions and related research challenges.

2026-03-02T17:03:43Z Jiale Lao Immanuel Trummer http://arxiv.org/abs/2601.14176v2 ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery 2026-03-02T16:58:19Z

The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results demonstrate the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.

2026-01-20T17:27:12Z Youran Sun Yixin Wen Haizhao Yang http://arxiv.org/abs/2603.02001v1 Bespoke OLAP: Synthesizing Workload-Specific One-size-fits-one Database Engines 2026-03-02T15:51:45Z

Modern OLAP engines are designed to support arbitrary analytical workloads, but this generality incurs structural overhead, including runtime schema interpretation, indirection layers, and abstraction boundaries, even in highly optimized systems. An engine specialized to a fixed workload can eliminate these costs and exploit workload-specific data structures and execution algorithms for substantially higher performance. Historically, constructing such bespoke engines has been economically impractical due to the high manual engineering effort. Recent advances in LLM-based code synthesis challenge this tradeoff by enabling automated system generation. However, naively prompting an LLM to produce a database engine does not yield a correct or efficient design, as effective synthesis requires systematic performance feedback, structured refinement, and careful management of deep architectural interdependencies. We present Bespoke OLAP, a fully autonomous synthesis pipeline for constructing high-performance database engines tightly tailored to a given workload. Our approach integrates iterative performance evaluation and automated validation to guide synthesis from storage to query execution. We demonstrate that Bespoke OLAP can generate a workload-specific engine from scratch within minutes to hours, achieving order-of-magnitude speedups over modern general-purpose systems such as DuckDB.

2026-03-02T15:51:45Z Johannes Wehrstein Timo Eckmann Matthias Jasny Carsten Binnig http://arxiv.org/abs/2501.16759v3 Are Joins over LSM-Trees Ready? Take RocksDB as an Example 2026-03-02T12:37:35Z

LSM-tree-based data stores are widely adopted in industries for their excellent performance. As data scales increase, disk-based join operations become indispensable yet costly for the database, making the selection of suitable join methods crucial for system optimization. Current LSM-based stores generally adhere to conventional relational database practices and support only a limited number of join methods. However, the LSM-tree delivers distinct read and write efficiency compared to the relational databases, which could accordingly impact the performance of various join methods. Therefore, it is necessary to reconsider the selection of join methods in this context to fully explore the potential of various join algorithms and index designs. In this work, we present a systematic study and an exhaustive benchmark for joins over LSM-trees. We define a configuration space for join methods, encompassing various join algorithms, secondary index types, and consistency strategies. We also summarize a theoretical analysis to evaluate the overhead of each join method for an in-depth understanding. Furthermore, we implement all join methods in the configuration space on a unified platform and compare their performance through extensive experiments. Our theoretical and experimental results yield several insights and takeaways tailored to joins in LSM-based stores that aid developers in choosing proper join methods based on their working conditions.

2025-01-28T07:30:35Z Accepted by VLDB 2025 Proc. VLDB Endow. 18, 4 (2025), 1077-1090 Weiping Yu Fan Wang Xuwei Zhang Siqiang Luo 10.14778/3717755.3717767