https://arxiv.org/api/Oxbj0pZRrG6y3Zgu60K6z5vSzAc 2026-03-20T23:34:43Z 11369 135 15 http://arxiv.org/abs/2512.22364v2 Cost Trade-offs of Reasoning and Non-Reasoning Large Language Models in Text-to-SQL 2026-03-07T21:01:20Z

While Text-to-SQL systems achieve high accuracy, existing efficiency metrics like the Valid Efficiency Score prioritize execution time, a metric we show is fundamentally decoupled from consumption-based cloud billing. This paper evaluates cloud query execution cost trade-offs between reasoning and non-reasoning Large Language Models by performing 180 Text-to-SQL query executions across six LLMs on Google BigQuery using the 230 GB StackOverflow dataset. Our analysis reveals that reasoning models process 44.5% fewer bytes than non-reasoning counterparts while maintaining equivalent correctness at 96.7% to 100%, and that execution time correlates weakly with query cost at $r=0.16$, indicating that speed optimization does not imply cost efficiency. Non-reasoning models also exhibit extreme cost variance of up to 3.4$\times$, producing outliers exceeding 36 GB per query, over 20$\times$ the best model's 1.8 GB average, due to missing partition filters and inefficient joins. We identify these prevalent inefficiency patterns and provide deployment guidelines to mitigate financial risks in cost-sensitive enterprise environments.

2025-12-26T19:51:35Z Saurabh Deochake Debajyoti Mukhopadhyay http://arxiv.org/abs/2603.07278v1 LLM-FK: Multi-Agent LLM Reasoning for Foreign Key Detection in Large-Scale Complex Databases 2026-03-07T16:41:02Z

Detecting missing foreign keys (FKs) requires accurately modeling semantic dependencies across database schemas, which conventional heuristic-based methods are fundamentally limited in capturing. We propose LLM-FK, the first fully automated multi-agent framework for FK detection, designed to address three core challenges that hinder naive LLM-based solutions in large-scale complex databases: combinatorial search space explosion, ambiguous inference under limited context, and global inconsistency arising from isolated local predictions. LLM-FK coordinates four specialized agents: a Profiler that decomposes the FK detection problem into the task of validating FK candidate column pairs and prunes the search space via a unique-key-driven schema decomposition strategy; an Interpreter that injects self-augmented domain knowledge; a Refiner that constructs compact structural representations and performs multi-perspective chain-of-thought reasoning; and a Verifier that enforces schema-wide consistency through a holistic conflict resolution strategy. Experiments on five benchmark datasets demonstrate that LLM-FK consistently achieves F1-scores above 93%, surpassing existing baselines by 15% on the large-scale MusicBrainz database, while reducing the candidate search space by two to three orders of magnitude without losing true FKs and maintaining robustness under challenging conditions like missing data. These results demonstrate the effectiveness and scalability of LLM-FK in real-world databases.

2026-03-07T16:41:02Z 28 pages, 13 figures Zijian Tang Ying Zhang Sibo Cai Ruoxuan Wang http://arxiv.org/abs/2603.07268v1 Sketch-Oriented Databases 2026-03-07T15:52:28Z

This paper introduces sketch-oriented databases, a categorical framework that encodes database paradigms as finite-limit sketches and individual databases and schemas as set-valued models. It illustrates the formalism through graph-oriented paradigms such as quivers, RDF triplestores and property graphs. It also shows how common graph features such as labels, attributes, typing, and paths, are uniformly captured by sketch constructions. Because paths play an important role in queries, we propose inference rules formalized via localizers to compute useful paths lazily; such localizers are also useful for tasks like database type conformance. Finally, the paper introduces stuttering sketches, whose aim is to facilitate modular composition and scalable model growth: stuttering sketches are finite-limit sketches in which relations are specified by a single limit instead of two nested limits, and the paper proves that finite unions of models of a stuttering sketch are pointwise colimits.

2026-03-07T15:52:28Z 20 pages, 1 appendix, Dominique Duval Rachid Echahed http://arxiv.org/abs/2603.07235v1 Novel Table Search [Technical Report] 2026-03-07T14:36:41Z

Avoiding redundancy in query results has been extensively studied in relational databases and information retrieval, yet its implications for data lakes remain largely unexplored. We bridge this gap by investigating how to discover unionable tables that contribute new information for a given query table in large-scale data lakes. We formally define Novel Table Search (NTS) as the problem of finding tables that are novel with respect to a given query table and identify two desirable properties that any scoring function for NTS should satisfy. We introduce a concrete scoring mechanism designed to maximize syntactic novelty, prove that it satisfies the proposed properties, and show that the associated optimization problem is NP-hard. To address this challenge, we develop an efficient approximation technique based on penalization, i.e., Attribute-Based Novel Table Search (ANTs). We propose three additional NTS variants to achieve syntactic novelty and introduce two evaluation metrics for syntactic novelty. Through extensive experiments, we demonstrate that ANTs outperforms other methods in capturing syntactic novelty across evaluation metrics and various benchmarks, while also achieving the lowest execution time.

2026-03-07T14:36:41Z 20 pages, 2026 IEEE 42nd International Conference on Data Engineering (ICDE) Besat Kassaie Renée J. Miller http://arxiv.org/abs/2603.07146v1 Fine-Grained Table Retrieval Through the Lens of Complex Queries 2026-03-07T10:57:32Z

Enabling question answering over tables and databases in natural language has become a key capability in the democratization of insights from tabular data sources. These systems first require retrieval of data that is relevant to a given natural language query, for which several methods have been introduced. In this work we present and study a table retrieval mechanism devising fine-grained typed query decomposition and global connectivity-awareness (DCTR), to handle the challenges induced by open-domain question answering over relational databases in complex usage contexts. We evaluate the effectiveness of the two mechanisms through the lens of retrieval complexity which we measure along the axes of query- and data complexity. Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.

2026-03-07T10:57:32Z Wojciech Kosiuk Xingyu Ji Yeounoh Chung Fatma Özcan Madelon Hulsebos http://arxiv.org/abs/2603.06952v1 Not All Neighbors Matter: Understanding the Impact of Graph Sparsification on GNN Pipelines 2026-03-07T00:02:33Z

As graphs scale to billions of nodes and edges, graph Machine Learning workloads are constrained by the cost of multi-hop traversals over exponentially growing neighborhoods. While various system-level and algorithmic optimizations have been proposed to accelerate Graph Neural Network (GNN) pipelines, data management and movement remain the primary bottlenecks at scale. In this paper, we explore whether graph sparsification, a well-established technique that reduces edges to create sparser neighborhoods, can serve as a lightweight pre-processing step to address these bottlenecks while preserving accuracy on node classification tasks. We develop an extensible experimental framework that enables systematic evaluation of how different sparsification methods affect the performance and accuracy of GNN models. We conduct the first comprehensive study of GNN training and inference on sparsified graphs, revealing several key findings. First, sparsification often preserves or even improves predictive performance. As an example, random sparsification raises the accuracy of the GAT model by 6.8% on the PubMed graph. Second, benefits increase with scale, substantially accelerating both training and inference. Our results show that the K-Neighbor sparsifier improves model serving performance on the Products graph by 11.7x with only a 0.7% accuracy drop. Importantly, we find that the computational overhead of sparsification is quickly amortized, making it practical for very large graphs.

2026-03-07T00:02:33Z Yuhang Song Naima Abrar Shami Romaric Duvignau Vasiliki Kalavri http://arxiv.org/abs/2603.06405v1 Tag-specific Regret Minimization Problem in Outdoor Advertising 2026-03-06T15:47:34Z

Recently, out-of-home advertising has become a popular marketing technique, due to its higher return on investment. E-commerce houses approach the influence provider to achieve effective advertising through their tags (advertising content), influence demand, and budgets. The influence provider's goal will be to make proper tag allocations, meet the required influence demand within the budget constraint, and minimize total regret. We formalize this as a combinatorial optimization problem and refer to it as \textsc{Tag-specific Regret Minimization in Outdoor Advertising (TRMOA)}. We show that TRMOA is NP-hard and inapproximable within a constant factor. The regret model we consider is non-monotone and non-submodular, and the simple greedy approach is ineffective. We introduce a fairness-aware greedy round-robin approach that reduces regret with balanced allocation across advertisers. To improve, we also introduce randomized greedy and local search algorithms. We have experimented with all the methodologies using real-world trajectory and billboard datasets to show the effectiveness and efficiency of the solution methodologies.

2026-03-06T15:47:34Z 11 Pages Dildar Ali Abishek Salaria Ansh Jasrotia Suman Banerjee http://arxiv.org/abs/2602.21480v3 Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"? 2026-03-06T14:09:39Z

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as "Text-to-Big SQL". However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. Furthermore, we provide LLM-specific insights, including fine-grained, cross-model comparisons of latency and cost.

2026-02-25T01:12:35Z 16 pages, 7 figures Germán T. Eizaguirre Lars Tissen Marc Sánchez-Artigas http://arxiv.org/abs/2603.06159v1 Efficient Vector Search in the Wild: One Model for Multi-K Queries 2026-03-06T11:09:32Z

Learned top-K search is a promising approach for serving vector queries with both high accuracy and performance. However, current models trained for a specific K value fail to generalize to real-world multi-K queries: they suffer from accuracy degradation (for larger Ks) and performance loss (for smaller Ks). Training the model to generalize on different Ks requires orders of magnitude more preprocessing time and is not suitable for serving vector queries in the wild. We present OMEGA, a K-generalizable learned top-K search method that simultaneously achieves high accuracy, high performance, and low preprocessing cost for multi-K vector queries. The key idea is that a base model properly trained on K=1 with our trajectory-based features can be used to accurately predict larger Ks with a dynamic refinement procedure and smaller Ks with minimal performance loss. To make our refinements efficient, we further leverage the statistical properties of top-K searches to reduce excessive model invocations. Extensive evaluations on multiple public and production datasets show that, under the same preprocessing budgets, OMEGA achieves 6-33% lower average latency compared to state-of-the-art learned search methods, while all systems achieve the same recall target. With only 16-30% of the preprocessing time, OMEGA attains 1.01-1.28x of the optimal average latency of these baselines.

2026-03-06T11:09:32Z Yifan Peng Jiafei Fan Xingda Wei Sijie Shen Rong Chen Jianning Wang Xiaojian Luo Wenyuan Yu Jingren Zhou Haibo Chen http://arxiv.org/abs/2603.04169v2 Efficient Query Rewrite Rule Discovery via Standardized Enumeration and Learning-to-Rank(extend) 2026-03-06T03:11:26Z

Query rewriting is essential for database performance optimization, but existing automated rule enumeration methods suffer from exponential search spaces, severe redundancy, and poor scalability, especially when handling complex query plans with five or more nodes, where a node represents an operator in the plan tree. We present SLER, a scalable system that enables efficient and effective rewrite rule discovery by combining standardized template enumeration with a learning to rank approach. SLER uses standardized templates, abstractions of query plans with operator structures preserved but data specific details removed, to eliminate structural redundancies and drastically reduce the search space. A learn to rank model guides enumeration by pre filtering the most promising template pairs, enabling scalable rule generation for large node templates. Evaluated on over 11000 real world SQL queries from both open source and commercial workloads, SLER has automatically constructed a rewrite rule repository exceeding 1 million rules - the largest empirically validated rewrite rule library to date. Notably, at the scale of one million rules, SLER supports query plan templates with complexity up to channel level depth. This unprecedented scale opens the door to discovering highly intricate transformations across diverse query patterns. Critically, SLER's template driven design and learned ranking mechanism are inherently extensible, allowing seamless integration of new and complex operators, paving the way for next generation optimizers powered by comprehensive, adaptive rule spaces.

2026-03-04T15:25:20Z Yuan Zhang Yuxing Chen Yuekun Yu Jinbin Huang Rui Mao Anqun Pan Lixiong Zheng Jianbin Qin http://arxiv.org/abs/2603.05704v1 Querying with Conflicts of Interest 2026-03-05T21:57:29Z

Conflicts of interest often arise between data sources and their users regarding how the users' information needs should be interpreted by the data source. For example, an online product search might be biased towards presenting certain products higher than in its list of results to improve its revenue, which may not follow the user's desired ranking expressed in their query. The research community has proposed schemes for data systems to implement to ensure unbiased results. However, data systems and services usually have little or no incentive to implement these measures, e.g., these biases often increase their profits. In this paper, we propose a novel formal framework for querying in settings where the data source has incentives to return biased answers intentionally due to the conflict of interest between the user and the data source. We propose efficient algorithms to detect whether it is possible for users to extract relevant information from biased data sources. We propose methods to detect biased information in the results of a query efficiently. We also propose algorithms to reformulate input queries to increase the amount of relevant information in the returned results over biased data sources. Using experiments on real-world datasets, we show that our algorithms are efficient and return relevant information over large data.

2026-03-05T21:57:29Z Nischal Aryal Arash Termehchy Marianne Winslett http://arxiv.org/abs/2603.04184v2 Publication and Maintenance of Relational Data in Enterprise Knowledge Graphs (Revised Version) 2026-03-05T20:53:48Z

Enterprise knowledge graphs (EKGa) are a novel paradigm for consolidating and semantically integrating large numbers of heterogeneous data sources into a comprehensive dataspace. The main goal of an EKG is to provide a data layer that is semantically connected to enterprise data, so that applications can have integrated access to enterprise data sources through that semantic layer. To make legacy relational data sources accessible through the organization's knowledge graph, it is necessary to create an RDF view of the underlying relational data (RDB2RDF view). An RDB2RDF view can be materialized to improve query performance and data availability. However, a materialized RDB2RDF view must be continuously maintained to reflect updates over the relational database. This article proposes a formal framework for constructing the materialized data graph for an RDB2RDF view and for incrementally maintaining the view's data graph. The article also presents an architecture and algorithms for implementing the proposed framework.

2026-03-04T15:37:45Z Vânia Maria Ponte Vidal Departamento de Computação, UFC, Fortaleza, Brazil Valéria Magalhães Pequeno TechLab, Departamento de Ciências e Tecnologias, UAL, Lisboa, Portugal Marco Antonio Casanova Instituto Tecgraf, Puc-Rio, Rio de Janeiro, Brazil Narciso Arruda Departamento de Computação, UFC, Fortaleza, Brazil Carlos Brito Departamento de Computação, UFC, Fortaleza, Brazil http://arxiv.org/abs/2504.20047v3 HCT-QA: A Benchmark for Question Answering on Human-Centric Tables 2026-03-05T20:10:34Z

Tabular data embedded in PDF files, web pages, and other types of documents is prevalent in various domains. These tables, which we call human-centric tables (HCTs for short), are dense in information but often exhibit complex structural and semantic layouts. To query these HCTs, some existing solutions focus on transforming them into relational formats. However, they fail to handle the diverse and complex layouts of HCTs, making them not amenable to easy querying with SQL-based approaches. Another emerging option is to use Large Language Models (LLMs) and Vision Language Models (VLMs). However, there is a lack of standard evaluation benchmarks to measure and compare the performance of models to query HCTs using natural language. To address this gap, we propose the HumanCentric Tables Question-Answering extensive benchmark (HCTQA) consisting of thousands of HCTs with several thousands of natural language questions with their respective answers. More specifically, HCT-QA includes 1,880 real-world HCTs with 9,835 QA pairs in addition to 4,679 synthetic HCTs with 67.7K QA pairs. Also, we show through extensive experiments the performance of 25 and 9 different LLMS and VLMs, respectively, in an answering HCT-QA's questions. In addition, we show how finetuning an LLM on HCT-QA improves F1 scores by up to 25 percentage points compared to the off-the-shelf model. Compared to existing benchmarks, HCT-QA stands out for its broad complexity and diversity of covered HCTs and generated questions, its comprehensive metadata enabling deeper insight and analysis, and its novel synthetic data and QA generator.

2025-03-09T11:02:11Z Mohammad S. Ahmad Zan A. Naeem Michaël Aupetit Ahmed Elmagarmid Mohamed Eltabakh Xiaosong Ma Mourad Ouzzani Chaoyi Ruan Hani Al-Sayeh http://arxiv.org/abs/2603.05632v1 Space-efficient B-tree Implementation for Memory-Constrained Flash Embedded Devices 2026-03-05T19:42:50Z

Small devices collecting data for agricultural, environmental, and industrial monitoring enable Internet of Things (IoT) applications. Given their critical role in data collection, there is a need for optimizations to improve on-device data processing. Edge device computing allows processing of the data closer to where it is collected and reduces the amount of network transmissions. The B-tree has been optimized for flash storage on servers and solid-state drives, but these optimizations often require hardware and memory resources not available on embedded devices. The contribution of this work is the development and experimental evaluation of multiple variants for B-trees on memory-constrained embedded devices. Experimental results demonstrate that even the smallest devices can perform efficient B-tree indexing, and there is a significant performance advantage for using storage-specific optimizations.

2026-03-05T19:42:50Z Extended version of CoopIS 2024 paper. 19 pages Nadir Ould-Khessal Scott Fazackerley Ramon Lawrence http://arxiv.org/abs/2506.06541v3 KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes 2026-03-05T19:25:53Z

Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. We evaluate 8 LLMs using our single-agent reference framework DS-Guru, alongside both open- and closed-source single- and multi-agent systems, and find that while current agentic systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 55% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 62%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.

2025-06-06T21:18:45Z Eugenie Lai Gerardo Vitagliano Ziyu Zhang Om Chabra Sivaprasad Sudhir Anna Zeng Anton A. Zabreyko Chenning Li Ferdi Kossmann Jialin Ding Jun Chen Markos Markakis Matthew Russo Weiyang Wang Ziniu Wu Michael J. Cafarella Lei Cao Samuel Madden Tim Kraska