https://arxiv.org/api/KB/KYBEzvZK34vISbghfoDNEsKk2026-03-20T10:45:08Z113691515http://arxiv.org/abs/2603.17664v1On the generic information capacity of relational schemas with a single binary relation2026-03-18T12:29:32ZWe consider database schemas consisting of a single binary relation, with key constraints and inclusion dependencies. Over this space of 20 schemas, we completely characterize when one schema is generically dominated by another schema. Generic dominance, a classical notion for measuring information capacity, expresses that every instance of a schema can be uniquely represented in the dominating schema, through application of a deterministic, generic data transformation. Our investigation is motivated both by current interest in schema design for graph databases, as well as by intrinsic scientific interest. We also consider the ternary case, but without inclusion dependencies, and discuss how the notions change in the presence of object identifiers.2026-03-18T12:29:32ZBenoît GrozUniversité Paris-Saclay, CNRS, LISN, FranceJan HiddersBirkbeck, University of London, UKNina PardalUniversity of Edinburgh, UKJan Van den BusscheHasselt University, BelgiumPiotr WieczorekUniversity of Wrocław, Polandhttp://arxiv.org/abs/2603.15080v3Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database2026-03-18T11:46:32ZBiomedical knowledge is fragmented across siloed databases -- Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug-gene interactions, SIDER for side effects. We present three open-source biomedical knowledge graphs -- Pathways KG (118,686 nodes, 834,785 edges from 5 sources), Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources), and Drug Interactions KG (32,726 nodes, 191,970 edges from 3 sources) -- built on Samyama, a high-performance graph database written in Rust.
Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source deduplication, batch loading (Python Cypher and Rust native loaders), and portable snapshot export. Second, we demonstrate cross-KG federation: loading all three snapshots into a single graph tenant enables property-based joins across datasets. Third, we introduce schema-driven MCP server generation for LLM agent access, evaluated on a new BiomedQA benchmark (40 pharmacology questions): domain-specific MCP tools achieve 98% accuracy vs. 85% for schema-aware text-to-Cypher and 75% for standalone GPT-4o, with zero schema errors.
All data sources are open-license. The combined federated graph (7.9M nodes, 28M edges) loads in approximately 3 minutes on commodity cloud hardware, with single-KG queries completing in 80-100ms and cross-KG federation joins in 1-4s2026-03-16T10:36:13Z12 pages, 7 tables, open-source code and dataMadhulatha MandarapuSandeep Kunkunuruhttp://arxiv.org/abs/2603.17573v1HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness2026-03-18T10:25:08ZVision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.2026-03-18T10:25:08ZZihao ZhengZhihao MaoSicheng TianMaoliang LiJiayu ChenXinhao SunZhaobo ZhangXuanzhe LiuDonggang CaoHong MeiXiang Chenhttp://arxiv.org/abs/2303.18142v2Shirakami: A Hybrid Concurrency Control Protocol for Tsurugi Relational Database System2026-03-18T05:01:40ZBill-of-materials and telecommunications billing applications, need to process both short transactions and long read-write transactions simultaneously. Recent work rarely addresses such evolving workloads. To deal with these workloads, we propose a new concurrency control protocol, Shirakami. Shirakami is a hybrid protocol. The first protocol, Shirakami-LTX, is for long read-write transactions based on multiversion view serializability. The second protocol, Shirakami-OCC, is for short transactions based on Silo. Shirakami naturally integrates them with the write-preservation and epoch-based synchronization. It does not require dynamic protocol switching and provides stable performance. We implemented Shirakami as the transaction processing module of the Tsurugi system, which is a production-grade relational database system. The experimental results demonstrated that Tsurugi exhibited 19.7 times lower latency than PostgreSQL, and Shirakami-LTX exhibited 680 times higher throughput than Shirakami-OCC.2023-03-31T15:26:42ZTakayuki TanabeShinichi UmeganeSuguru ArakawaRyoji KurosawaTakashi HoshinoHideyuki KawashimaMasahiro TanakaTakashi Kambayashihttp://arxiv.org/abs/2603.18067v1DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment2026-03-18T03:36:16ZThe low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.2026-03-18T03:36:16Z8 pages, 8 figures. Accepted to ICRA 2026Wuqi WangHaochen YangBaolu LiJiaqi SunXiangmo ZhaoZhigang XuQing GuoHaigen MinTianyun ZhangHongkai Yuhttp://arxiv.org/abs/2603.17298v1Efficient and Effective Table-Centric Table Union Search in Data Lakes2026-03-18T02:45:46ZIn data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or ground-truth labels. Existing methods are mainly column-centric: they focus on modeling column unionability scores using column embeddings, which are then used throughout the search process for column matching, filtering, and aggregation. However, this overlooks holistic table-level semantics, which may result in suboptimal rankings and inefficiencies. We introduce TACTUS, a novel table-centric method for table union search. Unlike prior work that searches from columns to tables, we search in a table-first way and examine columns only in the final step. During offline processing, we directly generate table embeddings for holistic, table-level unionability scoring by designing table-level representation techniques, including positive table pair construction to simulate unionable tables, two-pronged negative table sampling to avoid latent positives and mine hard negatives to enhance representation quality, and attentive table encoding for effective embeddings. During online search, we first develop a table-centric adaptive candidate retrieval method that efficiently selects a compact, high-quality candidate pool by leveraging the distribution of table-level unionability scores induced by table embeddings. We then inspect columns only within this compact candidate set and design a dual-evidence reranking technique that integrates table-level and column-level scores to refine the final top-k results. Extensive experiments on real-world datasets show that TACTUS significantly improves result quality while being much faster than existing methods in both offline and online processing, often by an order of magnitude.2026-03-18T02:45:46Z14 pagesYongkang SunZhihao DingHuiqiang WangReynold ChengJieming Shihttp://arxiv.org/abs/2508.21304v3ORCA: ORchestrating Causal Agent2026-03-18T01:09:34ZCausal analysis on relational databases is challenging, as analysis datasets must be repeatedly queried from complex schemas. Recent LLM systems can automate individual steps, but they hardly manage dependencies across analysis stages, making it difficult to preserve consistency between causal hypothesis. We propose ORCA (ORchestrating Causal Agent), an interactive multi-agent framework to enable coherent causal analysis on relational databases by maintaining shared state and introducing human checkpoints. In a controlled user study, participants using ORCA successfully completed end-to-end analysis more often than with a baseline LLM (GPT-4o-mini) assistant by 42 percentage points, achieved substantially lower ATE error, and reduced time spent on repetitive data exploration and query refinement by 76\% on average. These results show that ORCA improves both how users interact with the causal analysis pipeline and the reliability of the resulting causal conclusions.2025-08-29T01:59:34Z35 pages, CHI EA 2026Joanie Hayoun ChungSumin LeeSungbin Limhttp://arxiv.org/abs/2603.17223v1ListK: Semantic ORDER BY and LIMIT K with Listwise Prompting2026-03-18T00:08:12ZSemantic operators abstract large language model (LLM) calls in SQL clauses. It is gaining traction as an easy method to analyze semi-structured, unstructured, and multimodal datasets. While a plethora of recent works optimize various semantic operators, existing methods for semantic ORDER BY (full sort) and LIMIT K (top-K) remain lackluster. Our ListK framework improves the latency of semantic ORDER BY ... LIMIT K at no cost to accuracy. Motivated by the recent advance in fine-tuned listwise rankers, we study several sorting algorithms that best combine partial listwise rankings. These include: 1) deterministic listwise tournament (LTTopK), 2) Las Vegas and embarrassingly parallel listwise multi-pivot quickselect/sort (LMPQSelect, LMPQSort), and 3) a basic Monte Carlo listwise tournament filter (LTFilter). Of these, listwise multi-pivot quickselect/sort are studied here for the first time. The full framework provides a query optimizer for combining the above physical operators based on the target recall to minimize latency. We provide theoretical analysis to easily tune parameters and provide cost estimates for query optimizers. ListK empirically dominates the Pareto frontier, halving latency at virtually no cost to recall and NDCG compared to prior art.2026-03-18T00:08:12ZJason ShinJiwon ChangFatemeh Nargesianhttp://arxiv.org/abs/2603.17168v1HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage2026-03-17T21:59:59ZTraditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.2026-03-17T21:59:59Z15 pages, 12 figuresHaidong RongJiashu YaoMatthias LangerShijie LiuLi FanDongxin WangJia HeJinglin ChenJiaheng RangJulian QianMengyao XuFan YuMinseok LeeZehuan WangEven Oldridgehttp://arxiv.org/abs/2603.14644v2LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol2026-03-17T16:50:59ZPublicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical annotations, and vendor diversity, hindering the development of robust models. We introduce LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to capture clinically relevant appearance variations often overlooked in existing benchmarks. This dataset contains 1824 images from 468 patients (960 benign, 864 malignant), with pathology-confirmed labels, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and includes both high- and low-energy imaging styles, enabling systematic analysis of vendor- and energy-induced domain shifts. To address these variations, we propose a foreground-only pixel-space alignment method (''energy harmonization'') that maps images to a low-energy reference while preserving lesion morphology. We benchmark CNN and transformer models on three clinically relevant tasks: diagnosis (benign vs. malignant), BI-RADS classification, and density estimation. Two-view models consistently outperform single-view models. EfficientNet-B0 achieves an AUC of 93.54% for diagnosis, while Swin-T achieves the best macro-AUC of 89.43% for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses. Overall, LUMINA provides (1) a vendor-diverse benchmark and (2) a model-agnostic harmonization framework for reliable and deployable mammography AI.2026-03-15T22:41:40ZThis paper was accepted to CVPR 2026Hongyi PanGorkem DurakHalil Ertugrul AktasAndrea M. BejarBaver TutunEmre UysalEzgi BulbulMehmet Fatih DoganBerrin ErokBerna Akkus YildirimSukru Mehmet ErturkUlas Bagcihttp://arxiv.org/abs/2602.02335v3Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents2026-03-17T16:04:54ZLakehouses are now the default substrate for analytics and AI, but they remain fragile under concurrent, untrusted change: schema mismatches often surface only at runtime, development and production easily diverge, and multi-table pipelines can expose partial results after failure. We present Bauplan, a code-first lakehouse that aims to eliminate a broad class of these failures by construction. Bauplan builds on a storage substrate that already provides atomic single-table snapshot evolution, and adds three pipeline-level correctness mechanisms: typed table contracts to make transformation boundaries checkable, Git-like data versioning to support reproducible collaboration and review, and transactional runs that guarantee atomic publication of an entire pipeline execution. We describe the system design, show how these abstractions fit together into a unified programming model for humans and agents, and report early results from a lightweight Alloy model that both validates key intuitions and exposes subtle counterexamples around transactional branch visibility. Our experience suggests that correctness in the lakehouse is best addressed not by patching failures after the fact, but by restricting the programming model so that many illegal states become unrepresentable.2026-02-02T16:58:38ZSubmission pre-print, data conferenceWeiming ShengJinlang WangManuel BarrosAldrin MontanaJacopo TagliabueLuca Bigonhttp://arxiv.org/abs/2603.11391v2BEACON: Budget-Aware Entity Matching Across Domains (Extended Technical Report)2026-03-17T14:16:28ZEntity Matching (EM)--the task of determining whether two data records refer to the same real-world entity--is a core task in data integration. Recent advances in deep learning have set a new standard for EM, particularly through fine-tuning Pretrained Language Models (PLMs) and, more recently, Large Language Models (LLMs). However, fine-tuning typically requires large amounts of labeled data, which are expensive and time-consuming to obtain. In the context of e-commerce matching, labeling scarcity varies widely across domains, raising the question of how to intelligently train accurate domain-specific EM models with limited labeled data. In this work we assume users have only a limited amount of labels for a specific target domain but have access to labeled data from other domains. We introduce BEACON, a distribution-aware, budget-aware framework for low-resource EM across domains. BEACON leverages the insight that embedding representations of pairwise candidate matches can guide the effective selection of out-of-domain samples under limited in-domain supervision. We conduct extensive experiments across multiple domain-partitioned datasets derived from established EM benchmarks, demonstrating that BEACON consistently outperforms state-of-the-art methods under different training budgets.2026-03-12T00:09:50ZThis paper is the extended version of "BEACON: Budget-Aware Entity Matching Across Domains" to appear in Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD 2026)Nicholas PulsoneRoee ShragaGregory Goren10.1145/3802021http://arxiv.org/abs/2603.16474v1Practical MCTS-based Query Optimization: A Reproducibility Study and new MCTS algorithm for complex queries2026-03-17T13:01:29ZMonte Carlo Tree Search (MCTS) has been proposed as a transformative approach to join-order optimization in database query processing, with recent frameworks such as AlphaJoin and HyperQO claiming to outperform traditional methods. However, the fact that these frameworks rely on learned cost models raises concerns related to generalizability and deployment readiness. This paper presents a comprehensive reproducibility study of these methods, revealing that they often fail to support the claimed performance gains when subjected to diverse workloads.
Through an ablation study, we diagnose the root cause of this instability: while the MCTS search strategy is effective, the accompanying learned cost models suffer from severe out-of-distribution generalization errors.
Addressing this, we propose a novel MCTS framework. Unlike prior methods that rely on unstable learned components, our approach utilizes the database standard internal cost model, augmented by a new Extreme UCT (Upper Confidence Bound applied to Trees) selection policy to navigate the search space more robustly. We benchmark our method against the original AlphaJoin and HyperQO, as well as industry-standard baselines including Dynamic Programming (DP) and Genetic Query Optimization (GEQO), using the well-known Join Order Benchmark (JOB) and the new JOB-Complex benchmark. The results demonstrate that our approach outperforms learned MCTS methods and achieves superiority over a SOTA query optimizer in complex join scenarios on real-world data. We release the full implementation and experimental artifacts to support further research.2026-03-17T13:01:29ZVladimir BurlakovAlena RybakinaSergey KudashevKonstantin GilevAlexander DeminDenis PonomaryovYuriy Dornhttp://arxiv.org/abs/2603.16450v1MFTune: An Efficient Multi-fidelity Framework for Spark SQL Configuration Tuning2026-03-17T12:31:13ZApache Spark SQL is a cornerstone of modern big data analytics.However,optimizing Spark SQL performance is challenging due to its vast configuration space and the prohibitive cost of evaluating massive workloads. Existing tuning methods predominantly rely on full-fidelity evaluations, which are extremely time-consuming,often leading to suboptimal performance within practical budgets.While multi-fidelity optimization offers a potential solution, directly applying standard techniques-such as data volume reduction or early stopping-proves ineffective for Spark SQL as they fail to preserve performance correlations or represent true system bottlenecks. To address these challenges, we propose MFTune, an efficient multi-fidelity framework that introduces a query-based fidelity partitioning strategy, utilizing representative SQL subsets to provide accurate, low-cost proxies. To navigate the huge search space, MFTune incorporates a density-based optimization mechanism for automated knob and range compression, alongside an adapted transfer learning approach and a two-phase warm start to further accelerate the tuning process. Experimental results on TPC-H and TPC-DS benchmarks demonstrate that MFTune significantly outperforms five state-of-the-art tuning methods, identifying superior configurations within practical time constraints.2026-03-17T12:31:13ZBeicheng XuLingching TungYuchen WangYupeng LuBin Cuihttp://arxiv.org/abs/2603.16360v1Work Sharing and Offloading for Efficient Approximate Threshold-based Vector Join2026-03-17T10:47:35ZVector joins - finding all vector pairs between a set of query and data vectors whose distances are below a given threshold - are fundamental to modern vector and vector-relational database systems that power multimodal retrieval and semantic analytics. Existing state-of-the-art approach exploits work sharing among similar queries but still suffers from redundant index traversals and excessive distance computations. We propose a unified framework for efficient approximate vector joins that (1) introduces soft work sharing to reuse traversal results beyond the join results of previous queries, (2) builds a merged index over both query and data vectors to further speedup graph explorations, and (3) improves robustness for out-of-distribution queries through an adaptive hybrid search strategy. Experiments on eight datasets demonstrate substantial improvements in efficiency-recall trade-off over the state of the art.2026-03-17T10:47:35ZKyoungmin KimLennart RothLiang LiangAnastasia Ailamaki