https://arxiv.org/api/7dd/Kife46i2Z/IXnwYSRTU6phc 2026-05-15T23:43:10Z 11663 0 15 http://arxiv.org/abs/2605.15173v1 Hybrid Sketching Methods for Dynamic Connectivity on Sparse Graphs 2026-05-14T17:57:03Z Dynamic connectivity is a fundamental dynamic graph problem, and recent algorithmic breakthroughs on dynamic graph sketching have reshaped what is theoretically possible: by encoding the graph as per-vertex linear sketches, these algorithms solve dynamic connectivity in only $Θ(V \log^2 V)$ space, independent of the number of edges,outperforming lossless $Θ(V+E)$-space structures that grow as the graph becomes denser. Prior to this work, no practical dynamic connectivity algorithm has been able to translate these theoretical breakthroughs into space savings on real-world graphs. The main obstacle is that per-vertex sketches cost thousands of bytes per vertex, so sketching only pays off once the graph becomes extremely dense. We observe that sparse real-world graphs are often not uniformly sparse, these graphs can contain dense cores on a small subset of vertices that account for a large fraction of edges. We exploit this structure via hybrid sketching: sketch only the dense core, and store the sparse periphery losslessly. We design new hybrid algorithms for fully-dynamic and semi-streaming connectivity with space $O(\min\{V+E, V \log V \log(2+E/V)\})$ w.h.p., simultaneously matching the lossless bound on sparse graphs, the sketching bound on dense graphs, and improving on both in an intermediate regime. A key component is BalloonSketch, a new l0-sampler reducing per-vertex sketch sizes by up to 8x. We implement HybridSCALE, a modular system treating the lossless and sketch-based components as subroutines. HybridSCALE is the first sketch-based dynamic connectivity system to save space on common real-world graphs. Compared to the state-of-the-art lossless baseline, HybridSCALE saves up to 15% space on sparse graphs (average degree < 100), up to 92% on intermediate density graphs (average degree ~ 100-1000), and up to 97% on dense graphs (average degree > 1000). 2026-05-14T17:57:03Z Quinten De Man Gilvir Gill Michael A. Bender Laxman Dhulipala David Tench http://arxiv.org/abs/2605.15079v1 Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets 2026-05-14T17:04:39Z Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains. 2026-05-14T17:04:39Z 23 pages, 5 figures, 11 tables. Project: https://lcp.mit.edu/croissant-baker/ Code: https://github.com/MIT-LCP/croissant-baker Rafi Al Attrach Rajna Fani Sebastian Lobentanzer Joan Giner-Miguelez Debanshu Das Varuni H. K. Nobin Sarwar Rajat Ghosh Anwai Archit Surbhi Motghare Christina Conrad Parry Luis Oala Lara Grosso Joaquin Vanschoren Steffen Vogler Sujata Goswami Eric S. Rosenthal Marzyeh Ghassemi Matthew McDermott Tom Pollard http://arxiv.org/abs/2605.03596v4 Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies 2026-05-14T13:14:02Z Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%. 2026-05-05T10:17:06Z 30 pages, 16 figures Zirui Tang Xuanhe Zhou Yumou Liu Linchun Li Yukai Wu Weizheng Wang Hongzhang Huang Wei Zhou Jun Zhou Jiachen Song Shaoli Yu Jinqi Wang Zihang Zhou Hongyi Zhou Yuting Lv Jinyang Li Jiashuo Liu Ruoyu Chen Chunwei Liu GuoLiang Li Jihua Kang Fan Wu http://arxiv.org/abs/2605.14719v1 A Toolbox to Understand the Physics of Quantum Data Management 2026-05-14T11:40:34Z The application of quantum computing to data management has attracted growing interest, yet remains constrained by a limited understanding of how the physical behaviour of quantum devices relates to the structure and difficulty of database problems. In particular, evaluating quantum annealing approaches for combinatorial optimisation, which is central to many data management tasks, poses significant challenges beyond the scope of conventional empirical and complexity-theoretic methods. We present a computational toolbox for the systematic numerical analysis of quantum annealing processes derived from data management problem formulations. Adopting a physics-informed perspective, the toolbox enables the study of spectral and dynamical properties -- such as energy gaps and eigenstate structure -- that are inaccessible through direct hardware measurements, yet essential for understanding computational hardness and scaling behaviour. Our approach further provides derived quantities and visualisation techniques that support the interpretation of optimisation dynamics, the identification of structural similarities to canonical physical models, and the construction of reduced effective descriptions. By bridging methodological gaps between quantum computing and database systems research, this work establishes a principled foundation for evaluating quantum approaches and guiding future co-design efforts. 2026-05-14T11:40:34Z To appear at Q-Data@IEEE SIGMOD'26 Wolfgang Mauerer Manuel Schönberger http://arxiv.org/abs/2601.04722v2 Toward Temporal Attribution Analytics in Dataflows 2026-05-14T07:01:58Z Data provenance (the process of determining the origin and derivation of data outputs) has applications across multiple domains including explaining database query results and auditing scientific workflows. Despite decades of research, provenance tracing remains challenging due to its high computational cost and storage requirements. In streaming systems such as Apache Flink, fine-grained provenance graphs can grow super-linearly with data volume, posing significant scalability challenges. We define temporal attribution, a new lightweight form of provenance, appropriate for certain tasks, such as monitoring dependencies between system components over time quantitatively. Temporal attribution enables time-focused analysis that does not require fine-grained, tuple-level dependency meta-data. Inspired by volume-based provenance tracking in Temporal Interaction Networks (TINs), we demonstrate TINs' applicability in succinctly modeling quantified data exchanges between dataflow operators in stream data processing systems and in processing workflows, in general, over time. We classify data into discrete and liquid types, define five temporal provenance query types, and propose a state-based indexing approach. Our vision outlines research directions toward making this new form of temporal attribution a practical tool for large-scale dataflow analytics. 2026-01-08T08:37:09Z Chrysanthi Kosyfaki Ruiyuan Zhang Nikos Mamoulis Xiaofang Zhou http://arxiv.org/abs/2605.14464v1 From Schema to Signal: Retrieval-Augmented Modeling for Relational Data Analytics 2026-05-14T06:59:32Z Relational data stored in RDBMS is foundational to many real-world applications across domains such as e-commerce, finance, and sociality. While deep neural networks (DNNs) have achieved strong performance on tabular data with a single table, extending these models to relational databases is challenging due to the normalized multi-table structure and complex inter-table relationships. Existing approaches often rely strictly on schema-defined graphs, which overlook implicit semantic signals embedded in tuple attributes and suffer from rigid connectivity. In this work, we propose Retrieval-Augmented Modeling (RAM), a novel framework that combines graph structure with attribute semantics for relational data analytics. RAM treats tuple attributes as tokens and uses random walks to construct contextual documents, enabling the use of information retrieval techniques to estimate semantic relevance between tuples. Building on these documents, we introduce two retrieval-based augmentations: ATRA, which leverages intra-table relevance for contrastive learning, and ETRA, which links semantically related tuples across tables to enhance graph connectivity. Then, we propose a layer-wise model architecture tailored for relational data, which involves attribute embedding, feature integration, and graph aggregation layers to enable expressive and flexible representation learning. Extensive experiments on five real-world relational databases demonstrate that RAM consistently outperforms existing baselines in diverse prediction tasks, establishing a state-of-the-art for relational data analytics. 2026-05-14T06:59:32Z 14 pages Lingze Zeng Shaofeng Cai Changshuo Liu Zhongle Xie Yuncheng Wu Beng Chin Ooi http://arxiv.org/abs/2604.23477v2 SEMA-SQL: Beyond Traditional Relational Querying with Large Language Models 2026-05-14T05:59:30Z Relational databases excel at structured data analysis, but real-world queries increasingly require capabilities beyond standard SQL, such as semantically matching entities across inconsistent names, extracting information not explicitly stored in schemas, and analyzing unstructured text. While text-to-SQL systems enable natural language querying, they remain limited to relational operations and cannot leverage the semantic reasoning capabilities of modern large language models (LLMs). Conversely, recent semantic operator systems extend relational algebra with LLM-powered operations (e.g., semantic joins, mappings, aggregations), but require users to manually construct complex query pipelines. To address this gap, we present SEMA-SQL, a system that automatically answers natural language questions by generating efficient queries that combine relational operations with LLM semantic reasoning. We formalize Hybrid Relational Algebra (HRA), a declarative abstraction unifying traditional relational operators with LLM user-defined functions (UDFs). The system automates three critical aspects: (1) query generation via in-context learning that produces HRA queries with precise natural language specifications for LLM UDFs, (2) query optimization via cost-based transformations and UDF rewriting, and (3) efficient execution algorithms that reduce LLM invocations by an average of 93% in semantic joins through intelligent batching. Extensive experiments with known benchmarks, and extensions thereof, demonstrate the significant query capability improvements possible with our design. 2026-04-26T00:05:53Z Yin Lin Tianjing Zeng Zhongjun Ding Rong Zhu Bolin Ding H. V. Jagadish Jingren Zhou http://arxiv.org/abs/2602.23342v2 AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search 2026-05-14T05:09:37Z On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems. 2026-02-26T18:48:29Z The paper has been accepted by SIGMOD 2026 Weijian Chen Haotian Liu Yangshen Deng Long Xiang Liang Huang Gezi Li Bo Tang http://arxiv.org/abs/2604.16813v3 PersonalHomeBench: Evaluating Agents in Personalized Smart Homes 2026-05-13T21:16:49Z Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning. 2026-04-18T03:53:22Z Please use and cite the V3 version of this work, which includes updated correct author ordering and expanded error analysis in the appendix Manasa Bharadwaj Yolanda Liu InJung Yang Sungil Kim Nikhil Verma KoKeun Kim Kevin Ferreira YoungJoon Kim http://arxiv.org/abs/2409.02038v3 BEAVER: An Enterprise Benchmark for Text-to-SQL 2026-05-13T15:02:07Z Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions. 2024-09-03T16:37:45Z Dataset and code are available at https://beaverbench.github.io/ Peter Baile Chen Devin Yang Weiyue Li Fabian Wenz Yi Zhang Nesime Tatbul Michael Cafarella Çağatay Demiralp Michael Stonebraker http://arxiv.org/abs/2301.08178v4 Work-Efficient Query Evaluation in Constant Time with PRAMs 2026-05-13T14:41:27Z The article studies query evaluation in parallel constant time in the CRCW PRAM model. While it is well-known that all relational algebra queries can be evaluated in constant time on an appropriate CRCW PRAM model, this article is interested in the efficiency of evaluation algorithms, that is, in the number of processors or, asymptotically equivalent, in the work. Naive evaluation in the parallel setting results in huge (polynomial) bounds on the work of such algorithms and in presentations of the result sets that can be extremely scattered in memory. The article discusses some obstacles for constant-time PRAM query evaluation. It presents algorithms for relational operators and explores three settings, in which efficient sequential query evaluation algorithms exist: acyclic queries, semijoin algebra queries, and join queries -- the latter in the worst-case optimal framework. Under mild assumptions -- that data values are numbers of polynomial size in the size of the database or that the relations of the database are suitably sorted -- constant-time algorithms are presented that are weakly work-efficient in the sense that work $\mathcal{O}(T^{1+\varepsilon})$ can be achieved, for every $\varepsilon>0$, compared to the time $T$ of an optimal sequential algorithm. Important tools are the algorithms for approximate prefix sums and compaction from Goldberg and Zwick (1995). 2023-01-19T17:10:30Z Related/Previous versions are discussed in the introduction of the paper Jens Keppeler Thomas Schwentick Christopher Spinrath http://arxiv.org/abs/2604.20946v2 Common Foundations for Recursive Shape Languages 2026-05-13T13:20:04Z As schema languages for RDF data become more mature, we are seeing efforts to extend them with recursive semantics, applying diverse ideas from logic programming and description logics. While ShEx has an official recursive semantics based on greatest fixpoints (GFP), the discussion for SHACL is ongoing and seems to be converging towards least fixpoints (LFP). A practical study we perform shows that, indeed, ShEx validators implement GFP, whereas SHACL validators are more heterogeneous. This situation creates tension between ShEx and SHACL, as their semantic commitments appear to diverge, potentially undermining interoperability and predictability. We aim to clarify this design space by comparing the main semantic options in a principled yet accessible way, hoping to engage both theoreticians and practitioners, especially those involved in developing tools and standards. We present a unifying formal semantics that treats LFP, GFP, and supported model semantics (SMS), clarifying their relationships and highlighting a duality between LFP and GFP on stratified fragments. Next, we investigate to which extent the directions taken by SHACL and ShEx are compatible. We show that, although ShEx and SHACL seem to be going in different directions, they include large fragments with identical expressive power. Moreover, there is a strong correspondence between these fragments through the aforementioned principle of duality. Finally, we present a complete picture of the data and combined complexity of ShEx and SHACL validation under LFP, GFP, and SMS, showing that SMS comes at a higher computational cost under standard complexity-theoretic assumptions. 2026-04-22T17:05:28Z Shqiponja Ahmetaj Iovka Boneva Jan Hidders Maxime Jakubowski Jose-Emilio Labra-Gayo Wim Martens Fabio Mogavero Filip Murlak Cem Okulmus Ognjen Savković Mantas Šimkus Dominik Tomaszuk http://arxiv.org/abs/2603.23105v3 Spatial Analysis on Value-Based Quadtrees of Rasterized Vector Data 2026-05-13T12:37:28Z Mobility data science offers insights into the complex interconnections of spatial data of moving objects and their surroundings, often based on a combination of vector and raster data. For example, mobility traces are usually in vector format, weather data are often in raster format. Yet, available spatial analysis tools for exploratory data science push data scientists towards one or the other, providing only limited support for the respective other. In this paper, we contribute to this problem space with a value-based quadtree index, which serves as a bridge builder to support joint spatial analysis on vector and raster data leveraging their unique autocorrelation property. We achieve a 90% reduction in median Point-in-Polygon query latency, while keeping the accuracy of query responses at equal level. 2026-03-24T11:54:47Z Accepted for publication at the 1st Workshop on Secure and Intelligent Data Spaces (SIDS 2026) in the proceedings of the 27th IEEE International Conference on Mobile Data Management (MDM 2026) Diana Baumann Nils Japke Tim C. Rese David Bermbach http://arxiv.org/abs/2605.13398v1 FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration 2026-05-13T11:53:59Z Online Transaction Processing (OLTP) is a classic application with a growing business. CPU-based OLTP has low lock serving efficiency. The main reason is that most locks are cold, and the lock agent must issue frequent memory accesses to retrieve the lock details to determine whether to grant it. This motivates us to propose dedicated hardware-based lock agents with integrated lock tables to remove the DRAM access overhead. In this paper, we propose hardware-accelerated lock management and transaction processing for database systems. First, we propose a low-latency lock agent optimized for both lock acquiring and releasing requests. Second, we design a scalable transaction agent that executes the full transaction lifecycle. We present the architecture, optimizations, and design-space exploration of the proposed lock management and transaction processing system. The experiment results show up to 51X higher transaction throughput over the CPU baseline on the TPC-C benchmark. 2026-05-13T11:53:59Z 10 pages Shien Zhu Gustavo Alonso http://arxiv.org/abs/2605.13367v1 A Horn extension of DL-Lite with NL data complexity 2026-05-13T11:26:32Z The literature on ontology-mediated query answering (OMQA) has been shaped by two key results: first-order rewritability for DL-Lite, and PTime-hardness of data complexity for essentially every description logic beyond it. This has effectively positioned DL-Lite as the only practical choice for query rewriting, restricting OMQA solutions to first-order queries and ontologies that can be rewritten into them. This AC0 vs. PTime dichotomy is especially limiting if we consider that OMQA targets graph-structured data, and that standard graph query languages (including the recent ISO standards GQL and SQL/PGQ) are typically NL-complete. Towards identifying a rich Horn DL that can be rewritten into graph query languages and that can still express many ELI and DL-Lite ontologies, we introduce a stratification mechanism for ELI that controls the interaction between conjunction and recursion. In this way, we obtain ELbotpreceq, a description logic that strictly extends the core DL-Lite, supports reachability axioms and restricted conjunction, and allows for reasoning in NL. We establish the NL upper bound via a rewriting into nested two-way regular path queries, a fragment of GQL, providing initial evidence that our ontology language is a promising candidate for extending OMQA to graph query languages. 2026-05-13T11:26:32Z Submitted to Description Logic Workshop 2025. Full version in preparation Janos Arpasi Bartosz Jan Bednarczyk Magdalena Ortiz