https://arxiv.org/api/R1aZHo/A1bMbO0qQKr0DdmbHJT4 2026-03-20T22:08:03Z 11369 120 15 http://arxiv.org/abs/2603.08443v1 LLM-Driven Online Aggregation for Unstructured Text Analytics 2026-03-09T14:40:17Z

Large Language Models (LLMs) exhibit strong capabilities in text processing, and recent research has augmented SQL and DataFrame with LLM-powered semantic operators for data analysis. However, LLM-based data processing is hindered by slower token generation speeds compared to relational queries. To enhance real-time responsiveness, we propose OLLA, an LLM-driven online aggregation framework that accelerates semantic processing within relational queries. In contrast to batch-processing systems that yield results only after the entire dataset is processed, our approach incrementally transforms text into a structured data stream and applies online aggregation to provide progressive output. To enhance our online aggregation process, we introduce a semantic stratified sampling approach that improves data selection and expedites convergence to the ground truth. Evaluations show that OLLA reaches 1% accuracy error bound compared with labeled ground truth using less than 4% of the full-data time. It achieves speedups ranging from 1.6$\times$ to 38$\times$ across diverse domains, measured by comparing the time to reach a 5% error bound with that of full-data time. We release our code at https://github.com/olla-project/llm-online-agg.git.

2026-03-09T14:40:17Z DASFAA 2026 Chao Hui Weizheng Lu Yanjie Gao Lingfeng Xiong Yunhai Wang Yueguo Chen http://arxiv.org/abs/2603.08337v1 PRIME: Efficient Algorithm for Token Graph Routing Problem 2026-03-09T12:54:00Z

Optimizing asset exchanges on blockchain-driven platforms poses a novel and challenging graph query optimization problem. In this model, assets represent vertices and exchanges form edges, recasting the graph query task as a routing problem over a large-scale, dynamic graph. However, the existing solutions fail to solve the problem efficiently due to the non-linear nature of the edge weights defined by a concave swap function. To address the challenge, we propose PRIME, a two-stage iterative graph algorithm designed for the Token Graph Routing Problem (TGRP). The first stage employs a pruned graph search to efficiently identify a set of high-potential routing paths. The second stage formulates the allocation task as a strongly convex optimization problem, which we solve using our novel Adaptive Sign Gradient Method (ASGM) with a linear convergence rate. Extensive experiments on real-world Ethereum data confirm PRIME's advantages over industry baselines. PRIME consistently outperforms the widely-used Uniswap routing algorithm, achieving up to 8.42 basis points (bps) better execution prices on large trades while reducing computation up to 96.7%. The practicality of PRIME is further validated by its deployment in hedge fund production environments, demonstrating its viability as a scalable graph query processing solution for high-frequency decentralized markets.

2026-03-09T12:54:00Z 16 pages, 6 figures. A short version of this paper will appear in the 42nd IEEE International Conference on Data Engineering (ICDE '26) Haotian Xu Yuqing Zhu Yuming Huang Jing Tang http://arxiv.org/abs/2411.10229v3 Optimally Rewriting Formulas and Database Queries: A Confluence of Term Rewriting, Structural Decomposition, and Complexity 2026-03-09T12:29:00Z

A central computational task in database theory, finite model theory, and computer science at large is the evaluation of a first-order sentence on a finite structure. In the context of this task, the \emph{width} of a sentence, defined as the maximum number of free variables over all subformulas, has been established as a crucial measure, where minimizing width of a sentence (while retaining logical equivalence) is considered highly desirable. An undecidability result rules out the possibility of an algorithm that, given a first-order sentence, returns a logically equivalent sentence of minimum width; this result motivates the study of width minimization via syntactic rewriting rules, which is this article's focus. For a number of common rewriting rules (which are known to preserve logical equivalence), including rules that allow for the movement of quantifiers, we present an algorithm that, given a positive first-order sentence $φ$, outputs the minimum-width sentence obtainable from $φ$ via application of these rules. We thus obtain a complete algorithmic understanding of width minimization up to the studied rules; this result is the first one -- of which we are aware -- that establishes this type of understanding in such a general setting. Our result builds on the theory of term rewriting and establishes an interface among this theory, query evaluation, and structural decomposition theory.

2024-11-15T14:41:47Z Hubie Chen Stefan Mengel http://arxiv.org/abs/2505.16635v3 WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos 2026-03-09T06:16:20Z

Relational databases are often fragmented across organizations, creating data silos that hinder distributed data management and mining. Collaborative learning (CL) -- techniques that enable multiple parties to train models jointly without sharing raw data -- offers a principled approach to this challenge. However, existing CL frameworks (e.g., federated and split learning) remain limited in real-world deployments. Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases, and they typically neglect the end-to-end data management pipeline, especially preprocessing steps such as table joins and data alignment. In contrast, our analysis of the real-world corpus WikiDBs shows that databases are interconnected, unaligned, and sometimes unjoinable, exposing a significant gap between CL algorithm design and practical deployment. To close this evaluation gap, we build WikiDBGraph, a large-scale dataset constructed from 100{,}000 real-world relational databases linked by 17 million weighted edges. Each node (database) and edge (relationship) is annotated with 13 and 12 properties, respectively, capturing a hybrid of instance- and feature-level overlap across databases. Experiments on WikiDBGraph demonstrate both the effectiveness and limitations of existing CL methods under realistic conditions, highlighting previously overlooked gaps in managing real-world data silos and pointing to concrete directions for practical deployment of collaborative learning systems.

2025-05-22T13:07:06Z ICDE 2026 ICDE 2026 Zhaomin Wu Ziyang Wang Bingsheng He http://arxiv.org/abs/2603.07950v1 Decomposition-Driven Multi-Table Retrieval and Reasoning for Numerical Question Answering 2026-03-09T04:32:20Z

In this paper, we study the problem of numerical multi-table question answering (MTQA) over large-scale table collections (e.g., online data repositories). This task is essential in many analytical applications. Existing MTQA solutions, such as text-to-SQL or open-domain MTQA methods, are designed for databases and struggle when applied to large-scale table collections. The key limitations include: (1) Limited support for complex table relationships; (2) Ineffective retrieval of relevant tables at scale; (3) Inaccurate answer generation. To overcome these limitations, we propose DMRAL, a Decomposition-driven Multi-table Retrieval and Answering framework for MTQA over large-scale table collections, which consists of: (1) constructing a table relationship graph to capture complex relationships among tables; (2) Table-Aligned Question Decomposer and Coverage-Aware Retriever, which jointly enable the effective identification of relevant tables from large-scale corpora by enhancing the question decomposition quality and maximizing the question coverage of retrieved tables; and (3) Sub-question Guided Reasoner, which produces correct answers by progressively generating and refining the reasoning program based on sub-questions. Experiments on two MTQA datasets demonstrate that DMRAL significantly outperforms existing state-of-the-art MTQA methods, with an average improvement of 24% in table retrieval and 55% in answer accuracy.

2026-03-09T04:32:20Z This is the technical report for the ICDE 2026 paper Feng Luo Hai Lan Hui Luo Zhifeng Bao Xiaoli Wang J. Shane Culpepper Shazia Sadiq http://arxiv.org/abs/2507.10934v3 Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models 2026-03-09T04:16:39Z

Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation $(I, T, O)$ to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.

2025-07-15T02:58:25Z Xinyuan Liu Jiahui Chen Bocheng Hu Yu Sun Xinyang Chen Shaoxu Song Yongxin Tong http://arxiv.org/abs/2603.07916v1 Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases 2026-03-09T03:18:26Z

In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.

2026-03-09T03:18:26Z Jun Yin Peng Huo Bangguo Zhu Hao Yan Senzhang Wang Shirui Pan Chengqi Zhang http://arxiv.org/abs/2506.05587v4 MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark 2026-03-09T00:32:06Z

Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69\% and 57\% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

2025-06-05T21:05:03Z Full version of a paper accepted at NeurIPS 2025; Code and data available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU Junjie Xing Yeye He Mengyu Zhou Haoyu Dong Shi Han Lingjiao Chen Dongmei Zhang Surajit Chaudhuri H. V. Jagadish http://arxiv.org/abs/2510.19805v2 Next Generation Cloud-native In-Memory Stores: From Redis to Valkey and Beyond 2026-03-08T18:02:45Z

In-memory key-value datastores have become indispensable building blocks of modern cloud-native infrastructures, yet their evolution faces scalability, compatibility, and sustainability constraints. The current literature lacks an experimental evaluation of state-of-the-art tools in the domain. This study addressed this timely gap by benchmarking Redis alternatives and systematically evaluating Valkey, KeyDB, and Garnet under realistic workloads within Kubernetes deployments. The results demonstrate clear trade-offs among the benchmarked data systems. Our study presents a comprehensive performance and viability assessment of the emerging in-memory key-value stores. Metrics include throughput, tail latency, CPU and memory efficiency, and migration complexity. We highlight trade-offs between performance, compatibility, and long-term viability, including project maturity, community support, and sustained development.

2025-10-22T17:40:17Z The first author was neither informed nor did he give his consent to the publication of the paper on arXiv. Further, the submission contains multiple major errors in the reported numerical results, and the related conclusions are not supported by the underlying benchmark data. The first author does not stand behind the paper Carl-Johan Fauvelle Munck af Rosensch"old Feras M. Awaysheh Ahmad Awad http://arxiv.org/abs/2603.07750v1 Structured Gossip: A Partition-Resilient DNS for Internet-Scale Dynamic Networks 2026-03-08T17:54:36Z

Network partitions pose fundamental challenges to distributed name resolution in mobile ad-hoc networks (MANETs) and edge computing. Existing solutions either require active coordination that fails to scale, or use unstructured gossip with excessive overhead. We present \textit{Structured Gossip DNS}, exploiting DHT finger tables to achieve partition resilience through \textbf{passive stabilization}. Our approach reduces message complexity from $O(n)$ to $O(n/\log n)$ while maintaining $O(\log^2 n)$ convergence. Unlike active protocols requiring synchronous agreement, our passive approach guarantees eventual consistency through commutative operations that converge regardless of message ordering. The system handles arbitrary concurrent partitions via version vectors, eliminating global coordination and enabling billion-node deployments.

2026-03-08T17:54:36Z Rejected from ACM SIGMOD 2026 Demo Track Priyanka Sinha Dilys Thomas http://arxiv.org/abs/2603.12287v1 Context-Enriched Natural Language Descriptions of Vessel Trajectories 2026-03-08T15:17:25Z

We address the problem of transforming raw vessel trajectory data collected from AIS into structured and semantically enriched representations interpretable by humans and directly usable by machine reasoning systems. We propose a context-aware trajectory abstraction framework that segments noisy AIS sequences into distinct trips each consisting of clean, mobility-annotated episodes. Each episode is further enriched with multi-source contextual information, such as nearby geographic entities, offshore navigation features, and weather conditions. Crucially, such representations can support generation of controlled natural language descriptions using LLMs. We empirically examine the quality of such descriptions generated using several LLMs over AIS data along with open contextual features. By increasing semantic density and reducing spatiotemporal complexity, this abstraction can facilitate downstream analytics and enable integration with LLMs for higher-level maritime reasoning tasks.

2026-03-08T15:17:25Z Kostas Patroumpas Alexandros Troupiotis-Kapeliaris Giannis Spiliopoulos Panagiotis Betchavas Dimitrios Skoutas Dimitris Zissis Nikos Bikakis http://arxiv.org/abs/2505.05286v3 HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL 2026-03-08T12:02:06Z

Recent advances in agentic large language models (LLMs) have substantially improved Text-to-SQL, enabling users without database expertise to query databases intuitively. However, deploying agentic LLM-based Text-to-SQL systems in production remains challenging due to multi-stage dependencies, strict latency requirements, and deployment complexity across heterogeneous GPUs in enterprise clusters. Existing LLM serving frameworks are designed mainly for independent inference tasks, leading to suboptimal performance and frequent service-level objective (SLO) violations for Text-to-SQL workloads. In this paper, we introduce \sys, a framework for scheduling and executing agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters serving multi-tenant requests. \sys adopts a hierarchical scheduler that combines global workload-balanced task dispatching with an adaptive local priority queue, guided by a systematic analysis of agentic Text-to-SQL workflows. We also propose a lightweight simulation-based method to tune key scheduling hyperparameters, improving robustness and adaptability. Evaluations on realistic Text-to-SQL benchmarks show that \sys significantly outperforms state-of-the-art LLM serving frameworks. Across all traces, \sys reduces P95 tail latency by $1.42{\sim}1.56\times$ and increases throughput by $1.49{\sim}1.81\times$, demonstrating consistent gains under diverse workloads.

2025-05-08T14:28:47Z You Peng Youhe Jiang Wenqi Jiang Chen Wang Binhang Yuan http://arxiv.org/abs/2603.07517v1 GP-Tree: An in-memory spatial index combining adaptive grid cells with a prefix tree for efficient spatial querying 2026-03-08T07:59:03Z

Efficient spatial indexing is crucial for processing large-scale spatial data. Traditional spatial indexes, such as STR-Tree and Quad-Tree, organize spatial objects based on coarse approximations, such as their minimum bounding rectangles (MBRs). However, this coarse representation is inadequate for complex spatial objects (e.g., district boundaries and trajectories), limiting filtering accuracy and query performance of spatial indexes. To address these limitations, we propose GP-Tree, a fine-grained spatial index that organizes approximated grid cells of spatial objects into a prefix tree structure. GP-Tree enhances filtering ability by replacing coarse MBRs with fine-grained cell-based approximations of spatial objects. The prefix tree structure optimizes data organization and query efficiency by leveraging the shared prefixes in the hierarchical grid cell encodings between parent and child cells. Additionally, we introduce optimization strategies, including tree pruning and node optimization, to reduce search paths and memory consumption, further enhancing GP-Tree's performance. Finally, we implement a variety of spatial query operations on GP-Tree, including range queries, distance queries, and k-nearest neighbor queries. Extensive experiments on real-world datasets demonstrate that GP-Tree significantly outperforms traditional spatial indexes, achieving up to an order-of-magnitude improvement in query efficiency.

2026-03-08T07:59:03Z Xiangyang Yang Xuefeng Guan Lanxue Dang Yi Xie Qingyang Xu Huayi Wu Jiayao Wang http://arxiv.org/abs/2603.07449v1 Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System 2026-03-08T03:56:15Z

Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built-in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt-based approaches tightly couple intent reasoning with dialect syntax, rule-based translators often degrade native operators into generic constructs, and multi-dialect fine-tuning suffers from cross-dialect interference. In this paper, we present Dial, a knowledge-grounded framework for dialect-specific NL2SQL. Dial introduces: (1) a Dialect-Aware Logical Query Planning module that converts natural language into a dialect-aware logical query plan via operator-level intent decomposition and divergence-aware specification; (2) HINT-KB, a hierarchical intent-aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution-driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect-specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state-of-the-art baselines. The code is at https://github.com/weAIDB/Dial.

2026-03-08T03:56:15Z Xiang Zhang Hongming Xu Le Zhou Wei Zhou Xuanhe Zhou Guoliang Li Yuyu Luo Changdong Liu Guorun Chen Jiang Liao Fan Wu http://arxiv.org/abs/2603.07382v1 Enhancing OLAP Resilience at LinkedIn 2026-03-07T23:48:25Z

Real-time OLAP datastores are critical infrastructure for modern enterprises, powering interactive analytics on petabyte-scale datasets with subsecond latency requirements. As these systems become integral to service architectures, maintaining strict SLAs under failures, load spikes, and cluster changes is as important as raw performance. We present a set of resiliency mechanisms developed for Apache Pinot at LinkedIn, applicable to modern OLAP systems broadly. We introduce Query Workload Isolation (QWI), which provides workload-level CPU and memory budgeting across Pinot's broker and server tiers via fine-grained resource accounting and sub-millisecond enforcement, delivering predictable tail latency and fairness with under 1% overhead. We present Impact-Free Rebalancing for SLA-safe data movement during routine operations (e.g., upgrades, scale-out, and recovery), and Maintenance Zone Awareness to place replicas across fault domains and mitigate correlated failures. We also describe Adaptive Server Selection, which routes queries using real-time load and performance signals to avoid slow or failing nodes while preserving balanced utilization. Together, these mechanisms form a holistic resiliency framework deployed in production at LinkedIn, enabling stable query latency and high availability at scale.

2026-03-07T23:48:25Z 14 pages, 12 figures Praveen Chaganlal Jia Guo Vivek Vaidyanathan Dino Occhialini Sonam Mandal Subbu Subramaniam Siddharth Teotia Tianqi Li Xiaxuan Gao Florence Zhang