https://arxiv.org/api/iY0toizNFC1O0+o6zv7ZirVmC7U 2026-03-21T08:18:58Z 11369 225 15 http://arxiv.org/abs/2602.23286v1 SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables 2026-02-26T17:59:51Z

Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.

2026-02-26T17:59:51Z 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: https://sparta-projectpage.github.io/ The Fourteenth International Conference on Learning Representations (ICLR), 2026 Sungho Park Jueun Kim Wook-Shin Han http://arxiv.org/abs/2602.22805v1 Optimizing SSD-Resident Graph Indexing for High-Throughput Vector Search 2026-02-26T09:42:20Z

Graph-based approximate nearest neighbor search (ANNS) methods (e.g., HNSW) have become the de facto state of the art for their high precision and low latency. To scale beyond main memory, recent out-of-memory ANNS systems leverage SSDs to store large vector indexes. However, they still suffer from severe CPU underutilization and read amplification (i.e., storage stalls) caused by limited access locality during graph traversal. We present VeloANN, which mitigates storage stalls through a locality-aware data layout and a coroutine-based asynchronous runtime. VeloANN utilizes hierarchical compression and affinity-based data placement scheme to co-locate related vectors within the same page, effectively reducing fragmentation and over-fetching. We further design a record-level buffer pool, where each record groups the neighbors of a vector; by persistently retaining hot records in memory, it eliminates excessive page swapping under constrained memory budgets. To minimize CPU scheduling overheads during disk I/O interruptions, VeloANN employs a coroutine-based asynchronous runtime for lightweight task scheduling. On top of this, it incorporates asynchronous prefetching and a beam-aware search strategy to prioritize cached data, ultimately improving overall search efficiency. Extensive experiments show that VeloANN outperforms state-of-the-art disk-based ANN systems by up to 5.8x in throughput and 3.25x in latency reduction, while achieving 0.92x the throughput of in-memory systems using only 10% of their memory footprint.

2026-02-26T09:42:20Z Weichen Zhao Yuncheng Lu Yao Tian Hao Zhang Jiehui Li Minghao Zhao Yakun Li Weining Qian http://arxiv.org/abs/2602.22771v1 ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making 2026-02-26T09:03:41Z

Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models (LLMs), we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchmarks are insufficient to evaluate the safety of LLMs in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains, and is publicly available.

2026-02-26T09:03:41Z 17 pages, 3 figures, 10 tables Yusuke Watanabe Yohei Kobashi Takeshi Kojima Yusuke Iwasawa Yasushi Okuno Yutaka Matsuo http://arxiv.org/abs/2602.22721v1 Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA 2026-02-26T07:49:50Z

Table Question Answering (TQA) aims to answer natural language questions over structured tables. Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation pipelines in a multi-step manner offering state-of-the-art performance. However, these solutions rely on multiple LLM calls, resulting in prohibitive latencies and computational costs. We propose Operation-R1, the first framework that trains lightweight LLMs (e.g., Qwen-4B/1.7B) via a novel variant of reinforcement learning with verifiable rewards to produce high-quality data-preparation pipelines for TQA in a single inference step. To train such an LLM, we first introduce a self-supervised rewarding mechanism to automatically obtain fine-grained pipeline-wise supervision signals for LLM training. We also propose variance-aware group resampling to mitigate training instability. To further enhance robustness of pipeline generation, we develop two complementary mechanisms: operation merge, which filters spurious operations through multi-candidate consensus, and adaptive rollback, which offers runtime protection against information loss in data transformation. Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2.2$\times$ reduction in monetary cost.

2026-02-26T07:49:50Z Fengyu Li Junhao Zhu Kaishi Song Lu Chen Zhongming Yao Tianyi Li Christian S. Jensen http://arxiv.org/abs/2602.22699v1 DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule 2026-02-26T07:21:00Z

SQL is the de facto interface for exploratory data analysis; however, releasing exact query results can expose sensitive information through membership or attribute inference attacks. Differential privacy (DP) provides rigorous privacy guarantees, but in practice, DP alone may not satisfy governance requirements such as the \emph{minimum frequency rule}, which requires each released group (cell) to include contributions from at least $k$ distinct individuals. In this paper, we present \textbf{DPSQL+}, a privacy-preserving SQL library that simultaneously enforces user-level $(\varepsilon,δ)$-DP and the minimum frequency rule. DPSQL+ adopts a modular architecture consisting of: (i) a \emph{Validator} that statically restricts queries to a DP-safe subset of SQL; (ii) an \emph{Accountant} that consistently tracks cumulative privacy loss across multiple queries; and (iii) a \emph{Backend} that interfaces with various database engines, ensuring portability and extensibility. Experiments on the TPC-H benchmark demonstrate that DPSQL+ achieves practical accuracy across a wide range of analytical workloads -- from basic aggregates to quadratic statistics and join operations -- and allows substantially more queries under a fixed global privacy budget than prior libraries in our evaluation.

2026-02-26T07:21:00Z Tomoya Matsumoto Shokichi Takakura Shun Takagi Satoshi Hasegawa http://arxiv.org/abs/2602.22434v1 GetBatch: Distributed Multi-Object Retrieval for ML Data Loading 2026-02-25T21:45:18Z

Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.

2026-02-25T21:45:18Z 11 pages, 3 figures, 2 tables. Preprint Alex Aizman Abhishek Gaikwad Piotr Żelasko http://arxiv.org/abs/2504.17203v4 High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services 2026-02-25T18:55:05Z

The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{Gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.

2025-04-24T02:27:17Z Shivasankari Kannan Yeounoh Chung Amita Gondi Tristan Swadell Fatma Ozcan http://arxiv.org/abs/2505.09198v2 From RDF Graph Validation to RDF Dataset Validation with SHACL-DS 2026-02-25T18:53:48Z

The Shapes Constraint Language (SHACL) is the W3C Recommendation for validating a single RDF graph. This makes SHACL inadequate for validating data across (named) graphs in an RDF dataset. Existing workarounds, such as graph unions or bespoke preprocessing, either collapse the RDF dataset structure or compromise the declarative nature of SHACL validation. In the former, we lose track of where triples come from; in the latter, knowledge is hidden in the code, and the constraints are not self-contained nor fully declarative. We present SHACL-DS to address this problem. SHACL-DS proposes a vocabulary and an algorithm on top of SHACL for RDF dataset validation. SHACL-DS introduces the concepts of Shapes Datasets, Target Graph Declarations, and Target Graph Combinations, enabling declarative constraints to operate across multiple graphs in an RDF dataset. SHACL-DS also defines the behaviour of SPARQL-based constraints for validating RDF datasets. In this paper, we formalize SHACL-DS and provide a prototype implementation.

2025-05-14T07:21:17Z Accepted for ESWC2026 Davan Chiem Dao Christophe Debruyne http://arxiv.org/abs/2603.02248v1 HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval 2026-02-25T15:42:24Z

Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming "stars," which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose HELIOS, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star graph level rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that HELIOS outperforms state-of-the-art models with a significant improvement up to 42.6\% and 39.9\% in recall and nDCG, respectively, on the OTT-QA benchmark.

2026-02-25T15:42:24Z 9 pages, 6 figures. Accepted at ACL 2025 main. Project page: https://helios-projectpage.github.io/ Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32424-32444, July 2025 Sungho Park Joohyung Yun Jongwuk Lee Wook-Shin Han 10.18653/v1/2025.acl-long.1559 http://arxiv.org/abs/2602.21955v1 Detecting Logic Bugs of Join Optimizations in DBMS 2026-02-25T14:36:22Z

Generation-based testing techniques have shown their effectiveness in detecting logic bugs of DBMS, which are often caused by improper implementation of query optimizers. Nonetheless, existing generation-based debug tools are limited to single-table queries and there is a substantial research gap regarding multi-table queries with join operators. In this paper, we propose TQS, a novel testing framework targeted at detecting logic bugs derived by queries involving multi-table joins. Given a target DBMS, TQS achieves the goal with two key components: Data-guided Schema and Query Generation (DSG) and Knowledge-guided Query Space Exploration (KQE). DSG addresses the key challenge of multi-table query debugging: how to generate ground-truth (query, result) pairs for verification. It adopts the database normalization technique to generate a testing schema and maintains a bitmap index for result tracking. To improve debug efficiency, DSG also artificially inserts some noises into the generated data. To avoid repetitive query space search, KQE forms the problem as isomorphic graph set discovery and combines the graph embedding and weighted random walk for query generation. We evaluated TQS on four popular DBMSs: MySQL, MariaDB, TiDB and the gray release of an industry-leading cloud-native database, anonymized as X-DB. Experimental results show that TQS is effective in finding logic bugs of join optimization in database management systems. It successfully detected 115 bugs within 24 hours, including 31 bugs in MySQL, 30 in MariaDB, 31 in TiDB, and 23 in X-DB respectively.

2026-02-25T14:36:22Z Proceedings of the ACM on Management of Data (SIGMOD 2023) Xiu Tang Sai Wu Dongxiang Zhang Feifei Li Gang Chen 10.1145/3588909 http://arxiv.org/abs/2602.05674v2 Fast Private Adaptive Query Answering for Large Data Domains 2026-02-25T14:10:20Z

Privately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.

2026-02-05T13:57:56Z Miguel Fuentes Brett Mullins Yingtai Xiao Daniel Kifer Cameron Musco Daniel Sheldon http://arxiv.org/abs/2602.21766v1 RAMSeS: Robust and Adaptive Model Selection for Time-Series Anomaly Detection Algorithms 2026-02-25T10:36:32Z

Time-series data vary widely across domains, making a universal anomaly detector impractical. Methods that perform well on one dataset often fail to transfer because what counts as an anomaly is context dependent. The key challenge is to design a method that performs well in specific contexts while remaining adaptable across domains with varying data complexities. We present the Robust and Adaptive Model Selection for Time-Series Anomaly Detection RAMSeS framework. RAMSeS comprises two branches: (i) a stacking ensemble optimized with a genetic algorithm to leverage complementary detectors. (ii) An adaptive model-selection branch identifies the best single detector using techniques including Thompson sampling, robustness testing with generative adversarial networks, and Monte Carlo simulations. This dual strategy exploits the collective strength of multiple models and adapts to dataset-specific characteristics. We evaluate RAMSeS and show that it outperforms prior methods on F1.

2026-02-25T10:36:32Z Mohamed Abdelmaksoud Sheng Ding Andrey Morozov Ziawasch Abedjan http://arxiv.org/abs/2507.21989v2 Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors 2026-02-25T10:13:41Z

Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, and others. Many of these applications require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as filtered approximate nearest neighbor search (FANNS). By performing an in-depth literature analysis on FANNS, we identify a key gap in the research landscape: publicly available datasets with embedding vectors from state-of-the-art transformer-based text embedding models that contain abundant real-world attributes covering a broad spectrum of attribute types and value distributions. To fill this gap, we introduce the arxiv-for-fanns dataset of transformer-based embedding vectors for the abstracts of over 2.7 million arXiv papers, enriched with 11 real-world attributes such as authors and categories. We benchmark eleven different FANNS methods on our new dataset to evaluate their performance across different filter types, numbers of retrieved neighbors, dataset scales, and query selectivities. We distill our findings into eight key observations that guide users in selecting the most suitable FANNS method for their specific use cases.

2025-07-29T16:39:54Z Patrick Iff Paul Bruegger Marcin Chrapek David Kochergin Maciej Besta Torsten Hoefler http://arxiv.org/abs/2602.21717v1 C$^{2}$TC: A Training-Free Framework for Efficient Tabular Data Condensation 2026-02-25T09:25:24Z

Tabular data is the primary data format in industrial relational databases, underpinning modern data analytics and decision-making. However, the increasing scale of tabular data poses significant computational and storage challenges to learning-based analytical systems. This highlights the need for data-efficient learning, which enables effective model training and generalization using substantially fewer samples. Dataset condensation (DC) has emerged as a promising data-centric paradigm that synthesizes small yet informative datasets to preserve data utility while reducing storage and training costs. However, existing DC methods are computationally intensive due to reliance on complex gradient-based optimization. Moreover, they often overlook key characteristics of tabular data, such as heterogeneous features and class imbalance. To address these limitations, we introduce C$^{2}$TC (Class-Adaptive Clustering for Tabular Condensation), the first training-free tabular dataset condensation framework that jointly optimizes class allocation and feature representation, enabling efficient and scalable condensation. Specifically, we reformulate the dataset condensation objective into a novel class-adaptive cluster allocation problem (CCAP), which eliminates costly training and integrates adaptive label allocation to handle class imbalance. To solve the NP-hard CCAP, we develop HFILS, a heuristic local search that alternates between soft allocation and class-wise clustering to efficiently obtain high-quality solutions. Moreover, a hybrid categorical feature encoding (HCFE) is proposed for semantics-preserving clustering of heterogeneous discrete attributes. Extensive experiments on 10 real-world datasets demonstrate that C$^{2}$TC improves efficiency by at least 2 orders of magnitude over state-of-the-art baselines, while achieving superior downstream performance.

2026-02-25T09:25:24Z Sijia Xu Fan Li Xiaoyang Wang Zhengyi Yang Xuemin Lin http://arxiv.org/abs/2602.21700v1 Maximal Biclique Enumeration with Improved Worst-Case Time Complexity Guarantee: A Partition-Oriented Strategy 2026-02-25T09:03:58Z

The maximal biclique enumeration problem in bipartite graphs is fundamental and has numerous applications in E-commerce and transaction networks. Most existing studies adopt a branch-and-bound framework, which recursively expands a partial biclique with a vertex until no further vertices can be added. Equipped with a basic pivot selection strategy, all state-of-the-art methods have a worst-case time complexity no better than $O(m\cdot (\sqrt{2})^n)$}, where $m$ and $n$ are the number of edges and vertices in the graph, respectively. In this paper, we introduce a new branch-and-bound (BB) algorithm \texttt{IPS}. In \texttt{IPS}, we relax the strict stopping criterion of existing methods by allowing termination when all maximal bicliques within the current branch can be outputted in the time proportional to the number of maximal bicliques inside, reducing the total number of branches required. Second, to fully unleash the power of the new termination condition, we propose an improved pivot selection strategy, which well aligns with the new termination condition to achieve better theoretical and practical performance. Formally, \texttt{IPS} improves the worst-case time complexity to $O(m\cdot α^n + n\cdot β)$, where $α(\approx 1.3954)$ is the largest positive root of $x^4-2x-1=0$ and $β$ represents the number of maximal bicliques in the graph, respectively. This result surpasses that of all existing algorithms given that $α$ is strictly smaller than $\sqrt{2}$ and $β$ is at most $(\sqrt{2})^n-2$ theoretically. Furthermore, we apply an inclusion-exclusion-based framework to boost the performance of \texttt{IPS}, improving the worst-case time complexity to $O(n\cdot γ^2\cdotα^γ+ γ\cdot β)$ for large sparse graphs ($γ$ is a parameter satisfying $γ\ll n$ for sparse graphs).

2026-02-25T09:03:58Z Accepted by SIGMOD 2026 Kaixin Wang Kaiqiang Yu Cheng Long