https://arxiv.org/api/Oxbj0pZRrG6y3Zgu60K6z5vSzAc2026-03-20T23:34:43Z1136913515http://arxiv.org/abs/2512.22364v2Cost Trade-offs of Reasoning and Non-Reasoning Large Language Models in Text-to-SQL2026-03-07T21:01:20ZWhile Text-to-SQL systems achieve high accuracy, existing efficiency metrics like the Valid Efficiency Score prioritize execution time, a metric we show is fundamentally decoupled from consumption-based cloud billing. This paper evaluates cloud query execution cost trade-offs between reasoning and non-reasoning Large Language Models by performing 180 Text-to-SQL query executions across six LLMs on Google BigQuery using the 230 GB StackOverflow dataset. Our analysis reveals that reasoning models process 44.5% fewer bytes than non-reasoning counterparts while maintaining equivalent correctness at 96.7% to 100%, and that execution time correlates weakly with query cost at $r=0.16$, indicating that speed optimization does not imply cost efficiency. Non-reasoning models also exhibit extreme cost variance of up to 3.4$\times$, producing outliers exceeding 36 GB per query, over 20$\times$ the best model's 1.8 GB average, due to missing partition filters and inefficient joins. We identify these prevalent inefficiency patterns and provide deployment guidelines to mitigate financial risks in cost-sensitive enterprise environments.2025-12-26T19:51:35ZSaurabh DeochakeDebajyoti Mukhopadhyayhttp://arxiv.org/abs/2603.07278v1LLM-FK: Multi-Agent LLM Reasoning for Foreign Key Detection in Large-Scale Complex Databases2026-03-07T16:41:02ZDetecting missing foreign keys (FKs) requires accurately modeling semantic dependencies across database schemas, which conventional heuristic-based methods are fundamentally limited in capturing. We propose LLM-FK, the first fully automated multi-agent framework for FK detection, designed to address three core challenges that hinder naive LLM-based solutions in large-scale complex databases: combinatorial search space explosion, ambiguous inference under limited context, and global inconsistency arising from isolated local predictions. LLM-FK coordinates four specialized agents: a Profiler that decomposes the FK detection problem into the task of validating FK candidate column pairs and prunes the search space via a unique-key-driven schema decomposition strategy; an Interpreter that injects self-augmented domain knowledge; a Refiner that constructs compact structural representations and performs multi-perspective chain-of-thought reasoning; and a Verifier that enforces schema-wide consistency through a holistic conflict resolution strategy. Experiments on five benchmark datasets demonstrate that LLM-FK consistently achieves F1-scores above 93%, surpassing existing baselines by 15% on the large-scale MusicBrainz database, while reducing the candidate search space by two to three orders of magnitude without losing true FKs and maintaining robustness under challenging conditions like missing data. These results demonstrate the effectiveness and scalability of LLM-FK in real-world databases.2026-03-07T16:41:02Z28 pages, 13 figuresZijian TangYing ZhangSibo CaiRuoxuan Wanghttp://arxiv.org/abs/2603.07268v1Sketch-Oriented Databases2026-03-07T15:52:28ZThis paper introduces sketch-oriented databases, a categorical
framework that encodes database paradigms as finite-limit sketches
and individual databases and schemas as set-valued models. It
illustrates the formalism through graph-oriented paradigms such as
quivers, RDF triplestores and property graphs. It also
shows how common graph features such as labels, attributes, typing, and
paths, are uniformly captured by sketch constructions. Because paths
play an important role in queries, we propose inference rules
formalized via localizers to compute useful paths lazily; such
localizers are also useful for tasks like database type conformance.
Finally, the paper introduces stuttering sketches, whose aim is to
facilitate modular composition and scalable model growth: stuttering
sketches are finite-limit sketches in which relations are specified
by a single limit instead of two nested limits, and the paper proves
that finite unions of models of a stuttering sketch are pointwise
colimits.2026-03-07T15:52:28Z20 pages, 1 appendix,Dominique DuvalRachid Echahedhttp://arxiv.org/abs/2603.07235v1Novel Table Search [Technical Report]2026-03-07T14:36:41ZAvoiding redundancy in query results has been extensively studied in relational databases and information retrieval, yet its implications for data lakes remain largely unexplored. We bridge this gap by investigating how to discover unionable tables that contribute new information for a given query table in large-scale data lakes. We formally define Novel Table Search (NTS) as the problem of finding tables that are novel with respect to a given query table and identify two desirable properties that any scoring function for NTS should satisfy. We introduce a concrete scoring mechanism designed to maximize syntactic novelty, prove that it satisfies the proposed properties, and show that the associated optimization problem is NP-hard. To address this challenge, we develop an efficient approximation technique based on penalization, i.e., Attribute-Based Novel Table Search (ANTs). We propose three additional NTS variants to achieve syntactic novelty and introduce two evaluation metrics for syntactic novelty. Through extensive experiments, we demonstrate that ANTs outperforms other methods in capturing syntactic novelty across evaluation metrics and various benchmarks, while also achieving the lowest execution time.2026-03-07T14:36:41Z20 pages, 2026 IEEE 42nd International Conference on Data Engineering (ICDE)Besat KassaieRenée J. Millerhttp://arxiv.org/abs/2603.07146v1Fine-Grained Table Retrieval Through the Lens of Complex Queries2026-03-07T10:57:32ZEnabling question answering over tables and databases in natural language has become a key capability in the democratization of insights from tabular data sources. These systems first require retrieval of data that is relevant to a given natural language query, for which several methods have been introduced. In this work we present and study a table retrieval mechanism devising fine-grained typed query decomposition and global connectivity-awareness (DCTR), to handle the challenges induced by open-domain question answering over relational databases in complex usage contexts. We evaluate the effectiveness of the two mechanisms through the lens of retrieval complexity which we measure along the axes of query- and data complexity. Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.2026-03-07T10:57:32ZWojciech KosiukXingyu JiYeounoh ChungFatma ÖzcanMadelon Hulseboshttp://arxiv.org/abs/2603.06952v1Not All Neighbors Matter: Understanding the Impact of Graph Sparsification on GNN Pipelines2026-03-07T00:02:33ZAs graphs scale to billions of nodes and edges, graph Machine Learning workloads are constrained by the cost of multi-hop traversals over exponentially growing neighborhoods. While various system-level and algorithmic optimizations have been proposed to accelerate Graph Neural Network (GNN) pipelines, data management and movement remain the primary bottlenecks at scale. In this paper, we explore whether graph sparsification, a well-established technique that reduces edges to create sparser neighborhoods, can serve as a lightweight pre-processing step to address these bottlenecks while preserving accuracy on node classification tasks.
We develop an extensible experimental framework that enables systematic evaluation of how different sparsification methods affect the performance and accuracy of GNN models. We conduct the first comprehensive study of GNN training and inference on sparsified graphs, revealing several key findings. First, sparsification often preserves or even improves predictive performance. As an example, random sparsification raises the accuracy of the GAT model by 6.8% on the PubMed graph. Second, benefits increase with scale, substantially accelerating both training and inference. Our results show that the K-Neighbor sparsifier improves model serving performance on the Products graph by 11.7x with only a 0.7% accuracy drop. Importantly, we find that the computational overhead of sparsification is quickly amortized, making it practical for very large graphs.2026-03-07T00:02:33ZYuhang SongNaima Abrar ShamiRomaric DuvignauVasiliki Kalavrihttp://arxiv.org/abs/2603.06405v1Tag-specific Regret Minimization Problem in Outdoor Advertising2026-03-06T15:47:34ZRecently, out-of-home advertising has become a popular marketing technique, due to its higher return on investment. E-commerce houses approach the influence provider to achieve effective advertising through their tags (advertising content), influence demand, and budgets. The influence provider's goal will be to make proper tag allocations, meet the required influence demand within the budget constraint, and minimize total regret. We formalize this as a combinatorial optimization problem and refer to it as \textsc{Tag-specific Regret Minimization in Outdoor Advertising (TRMOA)}. We show that TRMOA is NP-hard and inapproximable within a constant factor. The regret model we consider is non-monotone and non-submodular, and the simple greedy approach is ineffective. We introduce a fairness-aware greedy round-robin approach that reduces regret with balanced allocation across advertisers. To improve, we also introduce randomized greedy and local search algorithms. We have experimented with all the methodologies using real-world trajectory and billboard datasets to show the effectiveness and efficiency of the solution methodologies.2026-03-06T15:47:34Z11 PagesDildar AliAbishek SalariaAnsh JasrotiaSuman Banerjeehttp://arxiv.org/abs/2602.21480v3Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?2026-03-06T14:09:39ZText-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as "Text-to-Big SQL". However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. Furthermore, we provide LLM-specific insights, including fine-grained, cross-model comparisons of latency and cost.2026-02-25T01:12:35Z16 pages, 7 figuresGermán T. EizaguirreLars TissenMarc Sánchez-Artigashttp://arxiv.org/abs/2603.06159v1Efficient Vector Search in the Wild: One Model for Multi-K Queries2026-03-06T11:09:32ZLearned top-K search is a promising approach for serving vector queries with both high accuracy and performance. However, current models trained for a specific K value fail to generalize to real-world multi-K queries: they suffer from accuracy degradation (for larger Ks) and performance loss (for smaller Ks). Training the model to generalize on different Ks requires orders of magnitude more preprocessing time and is not suitable for serving vector queries in the wild. We present OMEGA, a K-generalizable learned top-K search method that simultaneously achieves high accuracy, high performance, and low preprocessing cost for multi-K vector queries. The key idea is that a base model properly trained on K=1 with our trajectory-based features can be used to accurately predict larger Ks with a dynamic refinement procedure and smaller Ks with minimal performance loss. To make our refinements efficient, we further leverage the statistical properties of top-K searches to reduce excessive model invocations. Extensive evaluations on multiple public and production datasets show that, under the same preprocessing budgets, OMEGA achieves 6-33% lower average latency compared to state-of-the-art learned search methods, while all systems achieve the same recall target. With only 16-30% of the preprocessing time, OMEGA attains 1.01-1.28x of the optimal average latency of these baselines.2026-03-06T11:09:32ZYifan PengJiafei FanXingda WeiSijie ShenRong ChenJianning WangXiaojian LuoWenyuan YuJingren ZhouHaibo Chenhttp://arxiv.org/abs/2603.04169v2Efficient Query Rewrite Rule Discovery via Standardized Enumeration and Learning-to-Rank(extend)2026-03-06T03:11:26ZQuery rewriting is essential for database performance optimization, but existing automated rule enumeration methods suffer from exponential search spaces, severe redundancy, and poor scalability, especially when handling complex query plans with five or more nodes, where a node represents an operator in the plan tree. We present SLER, a scalable system that enables efficient and effective rewrite rule discovery by combining standardized template enumeration with a learning to rank approach. SLER uses standardized templates, abstractions of query plans with operator structures preserved but data specific details removed, to eliminate structural redundancies and drastically reduce the search space. A learn to rank model guides enumeration by pre filtering the most promising template pairs, enabling scalable rule generation for large node templates. Evaluated on over 11000 real world SQL queries from both open source and commercial workloads, SLER has automatically constructed a rewrite rule repository exceeding 1 million rules - the largest empirically validated rewrite rule library to date. Notably, at the scale of one million rules, SLER supports query plan templates with complexity up to channel level depth. This unprecedented scale opens the door to discovering highly intricate transformations across diverse query patterns. Critically, SLER's template driven design and learned ranking mechanism are inherently extensible, allowing seamless integration of new and complex operators, paving the way for next generation optimizers powered by comprehensive, adaptive rule spaces.2026-03-04T15:25:20ZYuan ZhangYuxing ChenYuekun YuJinbin HuangRui MaoAnqun PanLixiong ZhengJianbin Qinhttp://arxiv.org/abs/2603.05704v1Querying with Conflicts of Interest2026-03-05T21:57:29ZConflicts of interest often arise between data sources and their users regarding how the users' information needs should be interpreted by the data source. For example, an online product search might be biased towards presenting certain products higher than in its list of results to improve its revenue, which may not follow the user's desired ranking expressed in their query. The research community has proposed schemes for data systems to implement to ensure unbiased results. However, data systems and services usually have little or no incentive to implement these measures, e.g., these biases often increase their profits. In this paper, we propose a novel formal framework for querying in settings where the data source has incentives to return biased answers intentionally due to the conflict of interest between the user and the data source. We propose efficient algorithms to detect whether it is possible for users to extract relevant information from biased data sources. We propose methods to detect biased information in the results of a query efficiently. We also propose algorithms to reformulate input queries to increase the amount of relevant information in the returned results over biased data sources. Using experiments on real-world datasets, we show that our algorithms are efficient and return relevant information over large data.2026-03-05T21:57:29ZNischal AryalArash TermehchyMarianne Winsletthttp://arxiv.org/abs/2603.04184v2Publication and Maintenance of Relational Data in Enterprise Knowledge Graphs (Revised Version)2026-03-05T20:53:48ZEnterprise knowledge graphs (EKGa) are a novel paradigm for consolidating and semantically integrating large numbers of heterogeneous data sources into a comprehensive dataspace. The main goal of an EKG is to provide a data layer that is semantically connected to enterprise data, so that applications can have integrated access to enterprise data sources through that semantic layer. To make legacy relational data sources accessible through the organization's knowledge graph, it is necessary to create an RDF view of the underlying relational data (RDB2RDF view). An RDB2RDF view can be materialized to improve query performance and data availability. However, a materialized RDB2RDF view must be continuously maintained to reflect updates over the relational database. This article proposes a formal framework for constructing the materialized data graph for an RDB2RDF view and for incrementally maintaining the view's data graph. The article also presents an architecture and algorithms for implementing the proposed framework.2026-03-04T15:37:45ZVânia Maria Ponte VidalDepartamento de Computação, UFC, Fortaleza, BrazilValéria Magalhães PequenoTechLab, Departamento de Ciências e Tecnologias, UAL, Lisboa, PortugalMarco Antonio CasanovaInstituto Tecgraf, Puc-Rio, Rio de Janeiro, BrazilNarciso ArrudaDepartamento de Computação, UFC, Fortaleza, BrazilCarlos BritoDepartamento de Computação, UFC, Fortaleza, Brazilhttp://arxiv.org/abs/2504.20047v3HCT-QA: A Benchmark for Question Answering on Human-Centric Tables2026-03-05T20:10:34ZTabular data embedded in PDF files, web pages, and other types of documents is prevalent in various domains. These tables, which we call human-centric tables (HCTs for short), are dense in information but often exhibit complex structural and semantic layouts. To query these HCTs, some existing solutions focus on transforming them into relational formats. However, they fail to handle the diverse and complex layouts of HCTs, making them not amenable to easy querying with SQL-based approaches. Another emerging option is to use Large Language Models (LLMs) and Vision Language Models (VLMs). However, there is a lack of standard evaluation benchmarks to measure and compare the performance of models to query HCTs using natural language. To address this gap, we propose the HumanCentric Tables Question-Answering extensive benchmark (HCTQA) consisting of thousands of HCTs with several thousands of natural language questions with their respective answers. More specifically, HCT-QA includes 1,880 real-world HCTs with 9,835 QA pairs in addition to 4,679 synthetic HCTs with 67.7K QA pairs. Also, we show through extensive experiments the performance of 25 and 9 different LLMS and VLMs, respectively, in an answering HCT-QA's questions. In addition, we show how finetuning an LLM on HCT-QA improves F1 scores by up to 25 percentage points compared to the off-the-shelf model. Compared to existing benchmarks, HCT-QA stands out for its broad complexity and diversity of covered HCTs and generated questions, its comprehensive metadata enabling deeper insight and analysis, and its novel synthetic data and QA generator.2025-03-09T11:02:11ZMohammad S. AhmadZan A. NaeemMichaël AupetitAhmed ElmagarmidMohamed EltabakhXiaosong MaMourad OuzzaniChaoyi RuanHani Al-Sayehhttp://arxiv.org/abs/2603.05632v1Space-efficient B-tree Implementation for Memory-Constrained Flash Embedded Devices2026-03-05T19:42:50ZSmall devices collecting data for agricultural, environmental, and industrial monitoring enable Internet of Things (IoT) applications. Given their critical role in data collection, there is a need for optimizations to improve on-device data processing. Edge device computing allows processing of the data closer to where it is collected and reduces the amount of network transmissions. The B-tree has been optimized for flash storage on servers and solid-state drives, but these optimizations often require hardware and memory resources not available on embedded devices. The contribution of this work is the development and experimental evaluation of multiple variants for B-trees on memory-constrained embedded devices. Experimental results demonstrate that even the smallest devices can perform efficient B-tree indexing, and there is a significant performance advantage for using storage-specific optimizations.2026-03-05T19:42:50ZExtended version of CoopIS 2024 paper. 19 pagesNadir Ould-KhessalScott FazackerleyRamon Lawrencehttp://arxiv.org/abs/2506.06541v3KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes2026-03-05T19:25:53ZDiscovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. We evaluate 8 LLMs using our single-agent reference framework DS-Guru, alongside both open- and closed-source single- and multi-agent systems, and find that while current agentic systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 55% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 62%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.2025-06-06T21:18:45ZEugenie LaiGerardo VitaglianoZiyu ZhangOm ChabraSivaprasad SudhirAnna ZengAnton A. ZabreykoChenning LiFerdi KossmannJialin DingJun ChenMarkos MarkakisMatthew RussoWeiyang WangZiniu WuMichael J. CafarellaLei CaoSamuel MaddenTim Kraska